digitech - h&h - summary

DigiTech - H&H - Summary

Summary of the textbook D. Harris and S. Harris, "Digital Design and Computer Architecture (2nd

Edition)." .

Author: Ruben W. G. Schenk, [email protected]

Date: 01.05.2020

Version: 0.6

Current scope: Chapter 1 - 8 (Full lecture)

I, Ruben Schenk, take no responsibility for the accuracy, currency, reliability and correctness of any

information included in the work provided. This work is not licensed under a Creative Commons License.

Sharing this work is prohibited without requesting permission to do so.

Chapter 1 - From Zero to One

1.3 The Digital Abstraction

Digital Systems as we know them represent information with discrete varaibles . These are variables

with a finite number of distinct values. The amount of information $D$ in a discrete value with $N$

distinct staes is measure in units of bits as $D = \log_2 (N) \ bits$.

1.4 Number Systems

1.4.1 Decimal Numbers

The decimal number representation is the one we learned at school. There are ten digits, ranging from 0

to 9. Decimal numbers are referred to as base $10$ .

1.4.2 Binary Numbers

Bits represent one of two values, $0$ or $1$, and are joined together to form binary numbers . Each

column of a binary number has twice the weight of the previous number, so binary numbers are base

$2$ . If we want to convert decimal to binary numbers, we can work from right to left. We divide the

decimal number repeatedly by 2, the remainder goes into the binary position.

1.4.3 Hexadecimal Numbers

If we want to represent large numbers, it is sometimes more convenient to represent them in

hexadecimal numbers , since each column has 16 times the weight as the previous one, so hexadecimal

numbers are base $16$. They use the digits 0 to 9 along with the letters A to F. If we want to convert

hexadecimal to binary , we simply convert the hexadecimal number one by one, since each hexadecimal

digit corresponds to a 4 bit binary number. If we want to convert decimal to hexadecimal , we start

from the left and look for the highest power of 16 that still fits in our decimal number. E.g. if the 2nd

power of 16 fits in our number once, then there is a $1$ in the 3rd position of our hexadecimal number.

We repeate this with the remainder.

1.4.4 Bytes and Nibbles

We introduce some definitions of important terms:

Byte: A group of eight bits

Nibble: Half a byte, i.e. a group of four bits

Word: Data chunks a microprocessor handles, e.g. a group of 64 bits in a 64-bit architecture

Least significant bit (LSB): The right-most bit in a group of bits

Most significant bit (MSB): The left-most in a group of bits

Kilobyte (1KB): $2^{10} \approx 1'000 = \ "a \ thousand"$

Megabyte (1MB): $2^{20} \approx 1'000'000 = \ "a \ million"$

Gigabyte (1GB): $2^{30} \approx 1'000'000'000 = \ "a \ billion"$

1.4.6 Signed Binary Numbers

Until no we only looked at unsigned binary numbers that represent positive quantities. If we want to

represent negative numbers too, we need to use signed binary numbers . The two most common schemes

to display signed binary nubmers are:

Sign/Magnitude Numbers

Sign/Magnitude nubmers are intuitively appealing, because they simply match a binary number with an

additional sign bit in the msb-position. An $N$-bit sign/magnitude numbers therefore uses the msb as

the sign ($0$ indicating positive ) and the remaining $N-1$ bits are the magnitude, i.e. the absolute

value. For example $+5 = 0101$ and $-5 = 1101$. The only problem with this representation is, thay it

doesn't allow for ordinary binary addition.

Two's Complement Numbers

This representation overcomes the problem of sign/magnitude numbers since it allows for ordinary

binary addition . In the Two's Complement representation, we obtain a negative number with taking the

complement of the positive counterpart and adding one to the lsb . For example $+5 = 0101$ and $-5 =

1010 + 1 = 1011$. Again, the msb represents the sign of the number ($0$ for positive, $1$ for negative).

One exception is the most negative number . It is sometimes called the weird number , since it doesn't

have a positive counterpart in the two's complement representation. Keep in mind that when adding

more bits to a two's complement number, we must perform sign extension . For example $0101

\rightarrow 00000101$ and $1011 \rightarrow 11111011$.

1.5 Logic Gates

Logic gates are simple digital circuits that take one or more binary inputs and produce a binary output.

The defined logic gates are:

1.6 Beneath The Digital Abstraction

1.6.1 Supply Voltage

In reality, binary signals are represented with a voltage on a wire. The lowest voltage in a system is

called the ground (GND) and the highest voltage in the system is often denoted with $V_{DD}$. If we

focus on two logic gates, the first one is called the driver and the second one the reciever . The driver

produces either a LOW(0) in the range from $0$ to $V_{OL}$ or a HIGH(1) in the range from $V_{OH}$ to

$V_{DD}$. Similarly, if the reciever gets an input between $0$ to $V_{IL}$, it will consider the input as

LOW(0), and if it's an input between $V_{IH}$ to $V_{DD}$, it will consider it as HIGH(1). Every signal

outside of those ranges is said to be in the forbidden zone .

1.6.3 Noise Margins

The noise margin is the amount of noise that could be added to a worst-case output such that the

signal still can be correctly interpreted by the reciever. The margins are computed with:

$NM_L = V_{IL} - V_{OL}$

$NM_H = V_{OH} - V_{IH}$

1.7 CMOS Transistors

Transistors are electrically controlled switches that turn $ON$ or $OFF$ when a voltage is applied to a

control terminal. The most common transistors are metal-oxide-semiconductor field effect transistors

(MOSFETs or MOS transistors).

1.7.3 Capacitors

A capacitor consists of two conductors separated by an insulator. When a voltage $V$ is applied to

one of the conductors, the conductor accumulates electric charge $Q$ . The capacitance $C$ of the

capacitor is the ratio if charge to voltage: $C = \frac{Q}{V}$.

1.7.4 nMOS and pMOS Transistors

There are two flavors of MOSFETS: nMOS and pMOS . The n-type transistors, called nMOS , have regions

of n-type dopants (impurity atoms) adjacent to the gate called the source and the drain and are built

on a p-type conductor substrate. The pMOS transistors are just the opposite, consisting of p-type

source and drain regions in an n-type substrate .

When a positive voltage is applied to the top plate of a capacitor, it establishes an electric field that

attracts positive charge on the top plate and negative charge to the bottom plate. If the voltage is

sufficiently large, so much negative charge is attracted to the underside of the gate that the region

inverts from p-type to effectively become n-type. The inverted region is called the channel . Now the

transistor has a continuous path from the n-type source through the n-type channel to the n-type drain,

so electrons can flow from source to drain . The transistor is $ON$. pMOS transistors work in just the

opposite fashion.

In summary, CMOS processes give us two types of electrically controlled switches . The voltage at the

gate ($g$) regulates the flow of current between source ($s$) and drain ($d$). nMOS transistors are

$OFF$ when the gate is $0$ and $ON$ when the gate is $1$. pMOS transistors are $ON$ when the gate is

$0$ and $OFF$ when the gate is $1$.

Chapter 2 - Combinational Logic Design

2.1 Introduction

In digital eletctronics, a circuit is a network that processes discrete-valued variables. It has:

one or more input terminals

one or more output terminals

a functional specification

a timing specification

Digital circuits are classified as combinational or sequential . The difference is, that a combinational

circuit combines the current input values to compute the output, whereas in a sequential circuit, the

output depends on the input sequence, i.e. it depends on both current and previous values of the input.

In another way, a combinational circuit is memoryless and a sequential circuit has memory .

We can therefore classify a circuit as combinational if it satisfies the following requirements:

Every circuit element is itself combinational

Every node (wire) of the circuit is either designated as an input or connects exactly to one output

terminal

The circuit contains no cyclic paths

2.2 Boolean Equations

2.2.1 Terminology

We definde the following terms for a better understanding of the underlying topic:

Complement: The inverse of a variable, denoted $\bar A$

Literal: A variable in a boolean equation

Product (implicant): The $AND$ of one or more literals

Minterm: The product of all inputs (or their complement) of a function

Sum: The $OR$ of one or more literals

Maxterm: The sum of all inputs of a function

Precedence: The order of operations (NOT $\rightarrow$ AND $\rightarrow$ OR)

Sum-of-products canonical form: The sum of all minterms for which the output of a function is

$TRUE$

Prime implicant: An implicant that cannot be combined with other implicants to form a new

implicant with fewer literals

2.4 From Logic to Gates

A schemantic is a diagram of a digital circuit showing the elements and wires that connect them

together. When drawing schemantics, we will generally obey the following guidelines:

Inputs are on the left or top of a schemantic

Outputs are on the right or bottom of a schemantic

Whenever possible, gates should flow from left to right

Wires always connect at a T-junction

A dot where wires cross indicates a connection between the wires

Wire crossing without a dot make no connection

2.5 Multilevel Combinational Logic

2.5.2 Bubble Pushing

Bubble pushing is a helpful way to redraw combinational circuits so that the bubbles cancel out and

the function can be more easily determined. The guidelines for bubble pushing are as follows:

Begin at the output of the circuit and work toward the inputs

Push any bubbles on the final output back toward the inputs, so that you can read an equation in

terms of the inputs

Working backward, draw each gate in a form so that the bubbles cancel. If the current gate has an

input bubble, draw the preceding gate with an output bubble and vice versa.

2.6 X's, Z's, Oh My

2.6.1 Illegal Value: X

The symbol $X$ indicates that the circuit node has an unknown or illegal value. this commonly

happens if it is being driven to both $0$ and $1$ at the same time. This situation is also called

contention . Digital designers also use the symbol $X$ to indicate " don't care " values in truth tables.

2.6.2 Floating Value: Z

The symbol $Z$ denotes that a node is being driven neither by $HIGH$ nor by $LOW$. The node is said

to be floating, high impedance or high Z . In reality, a floating node might be $0$, might be $1$, or

might be at some voltage in between.

2.7 Karnaugh Maps

Karnaugh maps (K-maps) are a graphical method for simplifying Boolean equations. They work well for

problems with up to four variables and more important, they give insight into manipulating Boolean

equations .

The top row of the K-map gives the four possible values for the $A$ and $B$ inputs. The left column

gives the two (or four) possible values for input $C$ (and $D$). Each square in the K-map corresponds to

a row in the truth table and contains the value of output $Y$. The order of $AB$ and $CD$ inputs is in

Gray code . It differs from ordinary binary order ($00, 01, 10, 11$) in that adjacent entries differ only in a

single variable ($00, 01, 11, 10$).

2.7.2 Logic Minimization with K-Maps

Simplyfied we can minimize a Boolean equation by circling all the rectangular blocks of $1$'s in the

map, using fewest possible number of circles. Each circle should be as large as possible. Then we can

read off the implicants that we circled. Rules for finding a minimized equation from a K-map are as

follows:

Use the fewest circles neccessary to cover all the $1$'s

All the squares in each circle must contain only $1$'s

Each circle must span a rectangular block that is a power of 2 (i.e. 1, 2, or 4)

Each circle should be as large as possible

A circle may wrap around the edges of the K-map

A $1$ in a K-map may be circled multiple times if doing so allows fewer circles to be used

Example of how we can use a K-map to minimize a Boolean equation:

2.8 Combinational Building Blocks

2.8.1 Multiplexers

Multiplexers are among the most commonly used combinational circuits. They choose an output among

several possible inputs based on the value of a select signal . A multiplexer is sometimes affectionaly

called a mux . The combinational schemantic of a multiplexer looks as follows:

2.8.2 Decoders

A decoder has $N$ inputs and $2^N$ outputs. It asserts exactly one of its outputs depending on the

input combination. The outputs of a decoder are called one-hot , because exactly one is hot ($HIGH$)

at a given time. The combinational schemantic of a 2:4 decoder looks as follows:

2.9 Timing

An output takes time to change in response to an input change. We can draw a timing diagram to

portray the transient response of a combinational circuit. The transition from $LOW$ to $HIGH$ is

furthermore called the rising edge and similarly, the transition from $HIGH$ to $LOW$ is called the

falling edge . The following example shows a timing diagram of a buffer:

2.9.1 Propagation and Contamination Delay

Combinational logic is characterized by its propagation delay $t_{pd}$ , the maximum time from when

an input changes until the output reach their final value, and the contamination delay $t_{cd}$ , the

minimum time from an input change until the output takes its final value.

Since $t_{pd}$ and $t_{cd}$ are determined by signal paths in a combinational circuit, we introduce

the following two definitions:

Critical path: The longest and therefore slowest path in a circuit

Short path: The shortest and therefore fastest path through a circuit

We can now calculate the $t_{pd}$ by adding up the propagation time for each element along the

critical path, and similarly calculate the $t_{cp}$ by adding all propagation times along the short

path.

Chapter 3 - Sequential Logic Design

3.1 Introduction

We will now analyze sequential logic . As mentioned beforem, the outputs of sequential logic depend

on both current and prior input values. Hence, sequential logic has memory .

3.2 Latches and Flip-Flops

The fundamental building blocks of memory is a bistable element , an element with two stable states.

Those states are either $0$ or $1$. They also have a third possible state with the output being

approximately halfway between $0$ and $1$. This is called a metastable state and will be discussed

further on.

3.2.1 SR Latch

One of the simplest sequential circuits is the SR Latch , which is composed of two cross-coupled $NOR$

gates. The state of the SR Latch can be controlled through the $S$ and $R$ inputs, which set and

reset the output.

The SR Latch has the simple but powerful functionality, that if both inputs R and S are $0$, $Q$ will

keep the previous value $Q_{prev}$. The circuit has memory .

3.2.2 D Latch

The limitation of the SR Latch is, that on giving an input on either $R$ or $S$, not only what the state

should be but also when the state should change. Designing circuits becomes easier once we can

seperate those two actions. The D Latch solves this limitation.

It has two inputs. The data input $D$ controls what the next state should be and the clock input $CLK$

controls when the state should change. Observe that when $CLK = 0$, $Q$ remembers its old value

$Q_{prev}$ and if $CLK = 1$, $Q = D$. One also says that if $CLK = 1$, the latch is transparent and

opaque otherwise.

3.2.3 D Flip-Flop

A D Flip-Flop can be built from two back-to-back D Latches controlled by complementary clocks. The

first Latch L1 is called the master , the second latch L2 is called the slave .

The functionality of the D Flip-Flop is essential for many digital designers, therefore its describiton

should be known by heart:

A D flip-flop copies $D$ to $Q$ on the rising edge of the clock, and remembers its state at all other

times.

3.2.4 Register

An $N$-bit register is a bank of $N$ flip-flops that share a common $CLK$ input, so that all bits of the

register are updated at the same time. Registers are the key building block of most sequential circuits.

3.2.5 Enabled Flip-Flop

An enabled flip-flop adds another input called $EN$ or $ENABLE$ to determine whether data is loaded

on the clock edge. When $EN$ is $TRUE$, the enabled flip-flop behaves like an ordinary D flip-flop. When

$EN$ is $FALSE$, the enabled flip-flop ignores and retains its state.

3.2.6 Resettable Flip-Flop

A resettable flip-flop adds another input called $RESET$. When $RESET$ is $FALSE$, the resettable flip-

flop behaves like an ordinary D flip-flop. If $RESET$ is $TRUE$, the resettable flip-flop ignores $D$ and

sets the output to $0$.

We differ between synchronously and asynchronously resettable flip-flops. The difference being that

synchronously resettable flip-flops only reset themselves on the rising edge of $CLK$, whereas

asynchronously resettable flip-flops reset themselves as soon as $RESET$ becomes $TRUE$.

3.3 Synchronous Logic Design

In general, sequential circuits include all circuits that are not combinational, that is, those whose

outputs cannot be determined simply by looking at the current inputs.

3.3.2 Synchronous Sequential Circuits

In general, a sequential circuit has a finite set of discrete states $\{S_0, S_1,...,S_{k-1}\}$. A

synchronous sequential circuit has a clock input, whose rising edges indicate a sequence of times at

which state transitions occur. The rules of synchronous sequential circuit composition are:

Every circuit element is either a register or a combinational circuit

At least one circuit element is a register

All registers recieve the same clock signal

Every cyclic path contains at least one register

Similarly, if a circuit is not synchronous, it's said to be asynchronous .

3.4 Finite State Machines

Finite state machines are synchronous sequential circuits with $k$ registers and finite number ($2^k$)

of unique states the machine can be in. An FSM consists of two blocks of combinational logic, next

state logic and output logic , and a register that stores the state.

The are two general classes of finite state machines. One are the Moore machines (a) , in those, the

outputs depend only on the current state of the machine. In the Mealy machines (b) , the outputs

depend on both the current state of the machine and the current inputs.

3.4.1 FSM Design Example

The designing of an FSM involves several different steps. Those steps usually follow this procedure (for

a Moore machine):

1. Identify the inputs and outputs

2. Sketch a state transition diagram

3. Write a state transition table

4. Write an output table

5. Select state encodings (this effects the hardware design)

6. Write boolean equations for the next state and output logic

7. Sketch the circuit schemantic

The above mentioned terms are defined and explained as follows:

State transition diagram: It indicates all the possible states of the system and the transitions between

these states. Example of a state transition diagram:

State transition table: The state transition diagram is rewritten as a state transition table. It indicates,

for each state and each input, what the next state should be. In the state transition table one makes

often use of the don't care symbol $X$.

State encoding: To write a clean state transition diagram it is necessary to encode each state. The two

most common encoding schemes are binary encoding and one-hot-encoding . In the binary encoding

scheme, each state is represented as a binary number, therefore we only need $log_2 (K)$ bits to

represent a system with $K$ states. In one-hot-encoded system, each state is represented by one

seperate bit. Therefore in a system with one-hot-encoding we use a lot more flip-flops than in a binary

encoded system, but we have a much easier output logic and less used gates in general.

3.5 Timing of Sequential Logic

Recall that a flip-flop copies the input $D$ to the output $Q$ on the rising edge of the clock. This

process is called sampling $D$ on the clock edge. We define the aperture of a sequential element by a

setup time and a hold time, before and after the clock edge, respectively.

3.5.1 The Dynamic Discipline

When a clock rises, the output may start to change after the clock-to-Q contamination delay

$t_{ccq}$ and must definitely settle to the final value within the clock-to-Q propagation delay

$t_{pcq}$. For the circuit to sample its input correctly, the input must have stabilized at least some

setup time $t_{setup}$, before the rising edge of the clock must remain stable for at least some hold

time $t_{hold}$, after the rising edge of the clock. The sum of the setup and hold time is called the

aperture time of the circuit, and during this time, the input must be stable.

The dynamic discipline states that inputs of a synchronous sequential circuit must be stable during the

setup and hold aperture time around the clock edge.

3.5.2 System Timing

We introduce the following two terms that are of importance:

Clock period/Cycle time: $T_c$, the time between rising edges of a repetitive clock signal

Clock frequency: $f_c = \frac{1}{T_c}$

All those principles are shown in the following diagram:

3.5.4 Metastability

As noted earlier, it is not always possible to guarantee that the input to a sequential circuit is stable

during the aperture time. When a flip-flop samples an input that is changing during its aperture, the

output $Q$ may momentarily take on a voltage between $0$ and $V_{DD}$ that is in the forbidden

zone. This is called a metastable state.

3.6 Parallelism

The speed of a system is characterized by the latency and throughput of information moving thought it.

We define a token to be a group of inputs that are processed to produce a group of outputs. The

latency of a system is the time required for one token to pass through the system from start to end. The

throughput is the number of tokens that can be produced per unit time.

The throughput can be improved by processing several tokens at the same time. This is called

parallelism , and it comes in two forms: spatial and temporal. With spatial parallelism , multiple copies

of the hardware are provided so that multiple tasks can be done at the same time. With temporal

parallelism , a task is broken into stages, like an assembly line. Those taks can be spread across stages,

and if spread correctly, a different task will be in each stage at any given time so multiple tasks can

overlap. Temporal parallelism is commonly refered to as pipelining .

Chapter 4 - Hardware Description Language

4.1 Introduction

!!!

The following summary of chapter four is based on the HDL SystemVerilog . Verilog is very similar to

SystemVerilog, though it doesn't allow all advanced syntax SystemVerilog does. Syntax for VHDL is

excluded from the summary.

!!!

4.1.1 Modules

A block of hardware with inputs and outputs is called a module . The two general styles for describing

module functionality are behavioral and structural . Behavioral models describe what a module does,

and structural models describe how a module is built from simpler pieces.

4.1.3 Simulation and Synthesis

The two major purposes of HDL's are logic simulation and synthesis . During the simulation, inputs are

applied to a module, and the outputs are checked to verify that the module operates correctly. During

synthesis, the textual description of a module is transformed into logic gates.

4.2 Combinational Logic

4.2.1 Bitwise Operators

Bitwise operators act on single-bit signals or on multi-bit busses. A multi-bit bus is a wire with a width

$N>1$. a[3:0] represents a 4-bit wide bus. The bits in this notation are stored from most significant to

least significant order, also called little-endian order (i.e. a[3] is the msb, a[0] is the lsb). We can

also declare a multi-bit bus with a[0:3] , in this case the bits are stored in big-endian order , i.e. a[0]

is the msb and a[3] is the lsb.

Logic gates can either be declared with modules or simple bitwise operators. Those are:

4.2.4 Conditional Assignment

Conditional assignments select the output from among alternatives based on an input called the

condition . The conditional operator ?: , also called ternary operator , chooses, based on a first

expression, between a second (if the first expression is $TRUE$) and a third (if the first expression is

$FALSE$) expression. One can easily describe a 2:1 MUX with a conditional assignment:

assign y0 = ~ a; // NOT

assign y1 = a & b; // AND

assign y2 = a | b; // OR

assign y3 = a ^ b; // XOR

assign y4 = ~(a & b); // NAND

assign y5 = ~(a | b); // NOR

1

2

3

4

5

6

assign y = s ? d1 : d2; // 2:1 MUX with s as the select signal1

A 4:1 MUX can be describe with two nested conditional assignments:

4.2.5 Internal Variables

Signals that are neither inputs nor outputs are called internal variables . In SystemVerilog, internal

signals are usually declared with logic . This differs from Verilog, where internal variables are usually

declared as wire or reg , but this will be explained later on.

4.2.6 Precedence

If we do not use parentheses, the default operation order (precedence) is defined by the language. The

precedence in SystemVerilog is defined as follows:

4.2.7 Numbers

In SystemVerilog, numbers can be specified in binary, octal, decimal, or hexadecimal. The format for

declaring constants is N'Bvalue , where N is the size in bits, B is the letter indicating the base, and

value gives the value. SystemVerilog supports 'b for binary, 'o for octal, 'd for decimal, and 'h for

hexadecimal. If the base is omitted, it defaults to decimal. In SystemVerilog we can further us the

idioms '1 and '0 for filling a bus with $0$s and $1$s, respectively.

4.2.9 Bit Swizzling

Often it is necessary to operate on a subset of a bus or to concatenate (join together) signals from

busses. These operations are collectively known as bit swizzling . This could look like this:

The operator { } is used to concatenate busses. {3 { d[0] } } is the syntax for concatenating three

copies of d[0] .

4.4 Sequential Modeling

4.4.1 Registers

assign y = s1 ? (s0 ? d3 : d2) : (s0 ? d1 : d0); // 4:1 MUX with s0 and s1 for

select

1

assign y = {c[2:1], {3{d[0]}}, c[0], 3'b101};1

In SystemVerilog always statements, signals keep their old value until an event in the sensitivity list

takes place that explicitly causes them to change. In general, a SystemVerilog always statement is

written in the form:

The statement is executed only when the event/events in the sensitivity list occur. SystemVerilog

furthermore introduces always_ff, always_latch, and awlays_comb to reduce the risk of common

errors. In Verilog, anly the standart always @() block is a valid syntax .

4.4.2 Resettable Registers

Genrally, it is good practice to use resettable registers so that on powerup you can put the system in a

known state. The reset can either occur synchronous or asynchronous . The respective syntax looks as

follows:

Remark: always_ff behaves the same way as awlays but is used exclusively to imply flip-flops.

4.4.3 Enabled Registers

Enabled registers respond to the clock only when the enable is asserted.

4.4.5 Latches

Recall that a D latch is transparent when the clock is $HIGH$, allowing data to flow from input to

output. The latch becomes opaque when the clock is $LOW$, retaining its old state. An example for a D

latch in SystemVerilog is:

Remark: always_latch is equivalent to awlays @(clk, d) .

4.5 More Combinational Logic

always @ (sensitivity_list)

statement;

1

2

// asynchronous reset

always_ff @(posedge clk, posedge rst)

if(reset) q <= 4'b0;

else q <= d;

// synchronous reset

always_ff @(posedge clk)


else q <= d;

1

2

3

4

5

6

7

8

9

// asynchronous resettable register

always_ff @(posedge clk, posedge reset)


else if(enable) q <= d;

1

2

3

4

always_latch

if(clk) q <= d;

1

2

SystemVerilog always statements are used to describe sequential circuits , because they remember the

old state when no new state is prescribed. However, always statements can also be used to describe

combinational logic behaviorally if the sensitivity list is written to respond changes in all of the inputs

and the body prescribes the output value for every possible input combination . The following code-

snippet is an example for a combinational circuit using an always statement:

Remark: always_comb is equivalent to always @() . It re-evaluates the statements in the always -

block everytime a singal on the right hand side of any assignment changes.

HDLs support blocking (=) and non-blocking (<=) assignments in an always statement. A group of

blocking assignments are evaluated in the order in which they appear in the code. A group of non-

blocking assignments are evaluated concurrently . All of the statements are evaluated before any of the

signals on the left hand sides are updated.

4.5.1 Case Statements

A case statement performs different actions depending on the value of its inputs. A case statement

implies combinational logic if all possible input combinations are defined, otherwise it implies

sequential logic, because the output will keep its old value in the undefined cases. A case statement

is of the following form:

The default clause is a convenient way to describe the output for all cases not explicitly listed,

guaranteeing combinational logic.

With case statements it's also possible to use don't care variables. A possible implementation could

look like this:

Remark: The casez statement acts like a case statement except that it also recognizes ? as don't

care .

4.5.4 Blocking and Nonblocking Assignments

//inverter using always @

always_comb

y = ~ a;

1

2

3

// case statement

always_comb

case(data)

c1: variable = value1;



default: variable = defaultValue;

endcase

1

2

3

4

5

6

7

8

always_comb

casez(data)

3'b1?? : variable = value1;

3'b11? : variable = value2;

3'b111 : variable = value3;

endcase

1

2

3

4

5

6

We follow the upcoming guidelines to decide when to use blocking (=) and when non-blocking (<=)

assignments:

1. Use always_ff @(posedge clk) and non-blocking assignments to model synchronous sequential

logic.

2. Use continuous (blocking) assignments to model simple combinational logic.

3. Use always_comb and blocking assignments to model more complicated combinational logic

where the always statement is helpful.

4. Do not make assignments to the same signal in more than one always statement or continuous

assignment statement.

4.7 Data Types

4.7.1 SystemVerilog

Prior to SystemVerilog, Verilog primarily used two types: reg and wire . Despite its name, a reg

signal might or might not be associated with a registers. In Verilog, if a signal appears on the left

hand side of <= or = in an always block, it must be declared as reg . Otherwise, it should be

declared as wire . Input and output ports default wo the wire type unless ttheir type is explicitly

defined as reg .

SystemVerilog introduces the logic type. logic is a synonym for reg , but avoids misleading users

whether or not the signal is actually a flip-flop or not. Moreover, SystemVerilog also relaxes the rules

on assign statements and allows the use of logic outside of an always block.

In SystemVerilog, signals are considered as unsigned by default. Adding the signed modifier (e.g logic

signed [3:0] a ) causes the signal to be treated as signed.

4.8 Parameterized Modules

HDLs permit variable bit widths using parameterized modules . One such use could be a parameterized

2:1 MUX. Its implementation looks as follows:

4.9 Testbenches

A testbench is an HDL module that is used to test another module, called the devide under test (DUT) or

unit under test (UUT) . The testbench contains statements to apply inputs to the DUT and, ideally, to

check that the correct outputs are produced. The input and desired output patterns are called

testvector .

// parameterized 2:1 mux

module muxTwo2One

#(parameter width = 8)

(input [width-1:0] d0, d1, input s, output [width-1:0] y);

assign y = s ? d1 : 0

endmodule

// instatiation of with 12 bit wide busses

muxTwo2One#(12) mux1(d0, d1, s, y);

1

2

3

4

5

6

7

8

9

10

Testbenches are simulated the same as other HDL modules, however, they are not synthesizable. The

initial statement executes the statements in the body at the start of simulation. The assert

statement checks if a specified condition is true. If not, the else statement is executed. Testbenches

use the === and !== operators for comparisons of equality and inequality, because these operators

work correctly with operands like x and z .

$readmemb reads a file of binary numbers into the testvectors array. Similarly, $readmemh reads a

file of hexadecimal numbers. $display is a system task to print in the simulator window. For example,

$display("%b %b", y, yexpected) prints the two values y and yexpected , in binary. %d is used to print

values in decimal, and %h to print values in hexadecimal. $finish terminates a simulation.

Chapter 5 - Digital Building Blocks

5.1 Introduction

This chapter introduces more elaborate combinational and sequential building blocks used in digital

systems. These blocks include arithmetic circuits, counters, shifters, memory arrays, and logic arrays.

Each building block has a well-defined interface and can be treated as a black box when the

underlying implementation is unimportant.

5.2 Arithmetic Circuits

Arithemtic circuits are the central building blocks of computers. They perform the following operations:

additions, subtractions, comparisons, shifts, multiplications, and divisions.

5.2.1 Addition

Following are the most important building blocks for performing addition :

1-bit Half Adder

It has two inputs $A$ and $B$ and two outputs $S$ and $C_{out}$. It performs the addition of two 1-bit

signals and an additional $C_{out}$ signal that carries the overflow. The Boolean equations are:

$S = A \ xor \ B$

$C_{out} = A \ and \ B$

1-bit Full Adder

It works exactly the same as the half adder does, but has an additional input $C_{in}$. This allows to

connect multiple 1-bit adders to form a $N$-bit adder. The Boolean equations are:

$S = A \ xor \ B \ xor \ C_{in}$

$C_{out} = (A \ and \ B) \ xor \ (A \ and \ C_{in}) \ xor \ (B \ and \ C_{in})$

Ripple-Carry Adder

The simples way to build and $N$- bit carry propagate adder (CPA) is to chain together $N$ full adders.

The $C_{out}$ of one stage acts as the $C_{in}$ of the next stage. The delay of the adder, $t_{ripple}$,

grows directly with the number of bits, where $t_{FA}$ is the delay of the full adder:

$t_{ripple} = N \cdot t_{FA}$

Carry-Lookahead Adder

If we divide the $N$ bits to add into blocks of $k$-bit adders, we can shorten the delay of the carry

progragated adder. An $N$-bit adder divided into $k$-bit blocks has a delay:

$t_{CLA} = t_{pg} + t_{pg-block} + \Big (\frac{N}{k} - 1 \Big) \cdot t_{AND-OR} + k \cdot t_{FA}$

5.2.2 Subtraction

Subtraction itself is just addition with a negative variable. Therefore, subtraction is very easy to

implement. To compute $Y = A - B$, first create the two's complement of $B$: Invert the bits of $B$ to

obtain $\bar B$ and add $1$ to get $-B = \bar B + 1$. Add this quantity to $A$ to get $Y = A + \bar B + 1 =

A - B$. This operation can be performed with a single CPA and $C_{in} = 1$.

5.2.3 Comparators

A comparator determines whether or not two binary numbers are equal or if one is greater or less than

the other. There are two common types of comparators: An equality comparator produces a single

output indicating whether or not $A$ is equal to $B$ ( A === B ). A magnitude comparator produces one

or more outputs indicating the relative values of $A$ and $B$. A magnitude comparator of signed

numbers is usually done by computing $A-B$ and looking at the sign of the result. If the result is

negative (i.e. the sign bit is $1$), then $A$ is less than $B$, otherwise $A$ is greater or equal to $B$.

5.2.4 ALU

An Arithmetical/Logic Unit (ALU) combines a variety of mathematical and logic operators into a single

unit. For example, a typical ALU might perform addition, subtraction, AND, and OR operations. The ALU

forms the heart of most computer systems. The ALU takes multiple $N$-bit inputs where it performs the

operations on. It has an additional control input that defines which operation the ALU should output.

Some ALU may output, next to the result, additional flags that point to different edge cases (i.e.

negative results, overflow etc.). One implementation of an ALU could look like this:

5.2.5 Shifters and Rotators

Shifters and rotators move bits and multiply or divide by powers of 2. There are several kinds of

commonly used shifters:

Logical shifter: shifts the number to the left ( LSL) or to the right (LSR) and fills empty spots with

$0$s

Example: $11001 \ LSR2 \ = 00110$

Arithmetic shifter: Is the same as a logical shifter, but on the right shift fills the most significant

bits with a copy of the old most significant bit.

Example: $11001 \ ASR2 \ 11110$

Rotator: Rotates the number in a circle such that empty spots are filled with bits shifted off the

other end.

Example: $11001 \ ROR2 \ 01110$

A left shift is a special case of multiplication . A left shift by $N$ bits multiplies the number by $2^N$.

An arithmetic right shift is a special case of division. An arithemtic right shift by $N$ bits divides the

number by $2^N$.

5.2.6 Multiplication

Multiplying binary numbers is very similar to mutliplying decimal numbers. In both cases, partial

products are formed by multiplying a single digit of the multiplier with the entire multiplicand. The

shifted partial products are summed to form the result. In general, an $N \times N$ multiplier

multiplies two $N$-bit numbers and produces a $2 \cdot N$-bit result. An implementation of a $4

\times 4$-multiplier could look as follows:

5.2.7 Division

Binary division can be performer by using the following algorithm for $N$-bit numbers in the range $[0,

2^{N-1}]$:

The result satisfies $\frac{A}{B} = Q + \frac{R}{B}$, i.e. the divider computes $\frac{A}{B}$ and produces

a quotient $Q$ and a remainder $R$. Specifically, each row calculates $D = R-B$. The signal $N$

indicates whether or not $D$ is negative. So a row's multiplexer select lines receive the most

significant bit of $D$, which is $1$ when the differenc is negative. The quotien $Q_i$ is $0$ when $D$ is

negative and $1$ otherwise.

5.3 Number Systems

Computers operate on both integers and fractions. This section introduces fixed- and floating-point

number systems that can represent rational numbers.

5.3.1 Fixed-Point Number Systems

The fixed-point notation has an implied binary point between integer and fraction bits, analogous to

the decimal point between the integer and fraction digits of an ordinary decimal number. The integer

bits are called high word and the fraction bits are called the low word . Following the representation of

$-2.375$ in the fixed-point number system:

5.4 Sequential Building Blocks

5.4.1 Counters

An $N$-bit binary counter is a sequential arithmetic circuit with clock and reset inputs and an $N$-

bit output $Q$. Reset intializes the output to $0$. The counter then advance through all $2^N$ possible

outputs in binary order, incrementing on the rising edge of the clock.

5.4.2 Shift Registers

R' = 0;

for (i = N-1 to 0)

R = {R' << 1, A_i}

D = R - B

if(D < 0)

Q_i = 0

R' = R

else

Q_i = 1

R' = D

R = R'

1

2

3

4

5

6

7

8

9

10

11

0010.0110 //absolut value

1010.0110 //sign and magnitude

1101.1010 //two's complement

1

2

3

A shift register has a clock, a serial input $S_{in}$, a serial output $S_{out}$, and $N$ parallel outputs

$Q_{N-1:0}$. On each rising edge of the clock, a new bit is shifted in from $S_{in}$ and all the

subsequent contents are shifted forward. The last bit in the shift register is available at $S_{out}$.

Shift registers can be viewed as serial-to-parallel converters . After $N$ cycles, the past $N$ inputs are

available in parallel at $Q$.

5.5 Memory Arrays

Digital systems also require memories to store the data used and generated by such circuits. This

section describes memory arrays than can effectively store large amounts of data.

5.5.1 Overview

Memory is organized as a two-dimensional array of memory cells . The memory reads or writes the

contents of one of the rows of the array. This row is specified by an adress . The value read or written is

called data . An array with $N$-bit adresses and $M$-bit data has $2^N$ rows and $M$ columns. Each

row of data is called a word . The depth of an array is the number of rows, the width is the number of

columns.

Bit Cells

Memory arrays are built as an array of bit cells , each of which stores $1$ bit of data. Each bit cell is

connected to a wordline and a bitline . When the wordline is $HIGH$, the stored bit transfers to or from

the bitline.

Memory Types

Memories are classified based on how they store bits in the bit cell. The broadest classification is

random access memory (RAM) versus read only memory (ROM) . RAM is volatile , meaning that it loses its

data when the power is turned off. ROM is nonvolatile , meaning that it retains its data indefinitely,

even without power. The two major types of RAMs are dynamic RAM (DRAM) and static RAM (SRAM) .

Dynamic RAM stores data as charge on a capacitor, whereas static RAM stores data using a pair of

cross-coupled inverters.

5.5.2 Dynamic Random Access Memory (DRAM)

Dynamic RAM (DRAM) stores a bit as the presence or absence of charge on a capacitor. The nMOS

transistor behaves as a switch that either connects or disconnects the capacitor from the bitline. When

the wordline is asserted, the nMOS transistor turns $ON$, and the stored bit value transfers to or from

the bitline.

Reading destroys the bit value stored on the capacitor, so the data word must be restored ( rewritten )

after each read. Even when DRAM is not read, the contents must be refreshed ( read and rewritten )

every few milliseconds, because the charge on the capacitor gradually leaks away.

5.5.3 Static Random Access Memory (SRAM)

Static RAM (SRAM) is static because stored bits do not need to be refreshed. The data bit is stored on

cross-coupled inverters. When the wordline is asserted, both nMOS transistors turn $ON$, and data

values are transferred to or from the bitlines. Unlike DRAM, if noise degrades the value of the stored bit,

the cross-coupled inverters restore the value themselfes.

5.6 Logic Arrays

Like memory, gates can be organized into regular arrays. If the connections are made programmable,

these logic arrays can be configured to perform any function without the user having to connect wires

in specific ways. This section introduces two types of logic arrays: programmable logic arrays (PLAs) ,

and field programmable gate arrays (FPGAs) .

5.6.2 Programmable Logic Array

Programmable logic arrays (PLAs) implement two level combinational logic in sum-of-products (SOP)

form. PLAs are built from and AND array followed by an OR array.

5.6.2 Field Programmable Gate Array

A field programmable gate array (FPGA) is an array of reconfigurable gates. Using software

programming tools, a user can implement designs on the FPGA using either HDL or a schemantic. FPGAs

are built as an array of configurable logic elements (LEs). Each LE can be configured to perform

combinational or sequential functions. The LEs are surrounded by input/output elements (IOEs) for

interfacing with the outside world.

Chapter 6 - Architecture

6.1 Introduction

In this chapter, we jump up a few levels of abstraction to define the architecture of a computer. The

architecture is the programmer's view of a computer. It is defined by the instruction set ( language ) ,

and operand locations ( registers and memory ). Since computer hardware only understands $1$'s and

$0$'s, instructions are encoded as binary numbers in a format called machine language . The specific

arrangement of registers. memories, ALUs, and other building blocks is called microarchitecture .

Throughout the following sub-chapters, we motivate the design of the MIPS architecture using four

principles articulated by Patterson and Hennessy :

1. Simplicity favors regularity

2. Make the common case fast

3. Smaller is faster

4. Good design demands good compromises

6.2 Assembly Language

6.2.1 Instructions

The most common operation computers perform is addition. The following example shows how addition

and subtraction is performed in MIPS:

The first part of the assembly instrcution, add , is called the mnemonic and indicates what operation

to perform. The operation is performed on b and c , the source operands , and the result is written to

a , the destination operand .

In assembly language, only single-lined comments are used. They begin with # and continue until the

end of the line.

The MIPS instrcution set makes the common case fast by including only simple, commonly used

instructions. MIPS is a reduced instruction set computer (RISC) architecture. Architectures with many

complex instructions, such as Intel's IA-32 architecture, are complex instruction set computer (CISC)

architectures.

6.2.2 Operands: Registers, Memory, and Constants

An instruction operats on operands . Operands can be stored in registers or memory, or they may be

constants stored in the instrcution itself.

Registers

Instructions need to access operands quickly so that they can run fast. Therefore, most architectures

specify a small number of registers that hold commonly used operands. The MIPS architecture uses 32

registers, called the register set or register file .

add a, b, c # a = b + c

sub a, b, c # a = b - c

1

2

MIPS register names are preceded by the $ sign. MIPS generally stores variables in 18 of the 32

registers: $\$ s0 - \$ s7$, and $\$ t0 - \$ t9$. Register names beginning with $\$ s$ are called saved

registers. Register names beginning with $\$ t$ are called temporary registers and are used for storing

temporary variables. The addition and subtraction using registers looks as follows:

The Register Set

The MIPS atchitecture defines 32 registers. Each register has a name and a number ranging from 0 to

31. The register file is built up as follows:

Memory

MIPS uses byte-addressable memory . This is, each byte in memory has a unique address. However, for

explanation purposes only, we first introduce word-addressable memory, and afterwards introduce the

MIPS byte-addressable memory. In word-addressable memory , each 32-bit data word has a unique 32-

bit address. Both the 32-bit data word and the 32-bit address are written in hexadecimal.

MIPS uses the load word instrcution, $lw$, to read a data word from memory into a register. The $lw$

instruction specifies the effective address in memory as the sum of base address and an offset .

Similarly, MIPS uses the store word instruction, $sw$, to write a data word from a register into memory.

The MIPS memory model, however, is byte-addressable, not word-addressable . The word-address is four

times the word number. The offset can be written either in decimal or hexadecimal. The MIPS

architecture also provides the $lb$ and $sb$ instructions that load and store single bytes in memory

rather than words. The following code example describes how to access byte-addressable memory:

# calculating a = b + c - d

add $t0, $s2, $s3 # t = c - d

sub $s0, $s1, $t0 # a = b + t

1

2

3

# The following code assumes (unlike MIPS) word-addressable memory

lw $s3, 1($0) # read memory word 1 into $s3

sw $s7, 5($0) # write $s7 to memory word 5

1

2

3

In the MIPS architecture, word addresses for $lw$ and $sw$ must be word aligned . That is, the address

must be divisible by 4. Thus, the instruction $\$s0, \ 7(\$0)$ is an illegal instruction.

Constants / Immediates

Load word and store word, $lw$ and $sw$, also illsutrate the use of constants in MIPS instructions.

These constants are called immediates , because their values do not require a register or memory

access. Add immediate , $addi$, adds the immediate specified in the instruction to a value in a register.

Subtraction is equivalent to adding a negative number, so, in the interest of simplicity, there is no

$subi$ instruction in the MIPS architecture. Following are examples of how we are using immediate

operands:

6.3 Machine Language

Digital circuits understand only $1$'s and $0$'s. Therefore, a program written in assembly language is

translated from mnemonics to a representation using only $1$'s and $0$'s, called machine language .

MIPS makes the compromise of defining three instruction formats: R-type, I-type, and J-type. R-type

instructions operate on three registers. I-type instructions operate on two registers and a 16-bit

immediate. J-type (jump) instructions operate on one 26-bit immediate.

6.3.1 R-Type Instructions

The name R-type is short fro register-type . R-type instructions use three registers as operands: two as

sources, and one as a destination. The 32-bit instruction has six fields: op, rs, rt, rd, shamt and funct .

The performing operation is encoded in the two fields op , also called opcode , and funct , the

function. All R-type operations have an opcode of $0$. The operation is solely determined by the

funct code. The operands are encoded in the three fields rs, rt, and rd . The first two registers, rs and

rt , are the source registers, rd is the destination register. The fith field, shamt , is used only in shift

operations, it indicates the amount to shift.

6.3.2 I-Type Instructions

The name I-type is short for immediate-type . I-type instructions use two register operands and one

immediate operand. The 32-bit instruction has four fields: op, rs, rt, and imm . The first three fields are

like those of R-type instructions. The imm field holds the 16-bit immediate. The operation is solely

determined by the opcode .

lw $s0, 8($0) # read data word 2 into $s0

lw $s1, 0xC($0) # read data word 3 into $s1

sw $s3, 4($0) # write $s3 to data word 1

sw $s4, 0x20($0) # write $s4 to data word 9

1

2

3

4

addi $s0, $s0, 4 # a = a + 4

addi $s1, $s0, -12 # b = a - 12

1

2

6.3.3 J-Type Instructions

The name J-type is short for jump-type . This format is used only with jump instructions . Like other

formats, J-type instructions begin with a 6-bit opcode . The remaining 26-bits are used to specify an

address, addr .

6.4 Programming

6.4.1 Aithmetic / Logical Instructions

Logical Instructions

MIPS logical operations include and, or , xor and nor . These R-type instructions operate bit-by-bit on

two source registers and write the result in the destination register. The and instruction is useful for

masking bits, i.e. forcing unwanted bits to $0$. The or instruction is useful for combining two bits from

two registers. MIPS does not provide a not instruction, but $A \ NOR \ \$ 0 = NOT \ A$, so the nor

instruction can substitute. Local operations can also operate on immediates. These I-type instructions

are $andr, \ ori,$ and $xori$.

Shift Instructions

MIPS shift operations are $sll$ (shift left logical), $srl$ (shift right logical), and $sra$ (shift right

arithmetic). MIPS also has variable-shift operations (shift by a register): $sllv, srlv,$ and $srav$.

Multiplication and Division Instructions

The MIPS architecture has two special-purpose registers , $hi$ and $lo$, which are used to hold the

results of multiplication and division . $mult \ \$ s0, \ \$ s1$ multiplies the values in $\$ s0$ and $\$

s1$. The 32 most significant bits are placed in $hi$ and the 32 least significant bits are placed in $lo$.

Similarly, $div \ \$ s0, \ \$ s1$ computes $\frac{\$ s0}{\$ s1}$. The quotient is placed in $lo$ and the

remainder in $hi$.

6.4.2 Branching

Conditional Branches

The MIPS instruction set has two conditional branch instructions: branch if equal ($beq$) and branch

if not equal ($bne$). $beq$ branches when the value in two registers is equal, and $bne$ branches if

they are not equal. Note that the branches are written as $beq \ \$ rs, \ \$ rt, \ imm$, where $\$ rs$ is

the first source register. If a branch instruction is $TRUE$, one says that the branch is taken and the

next instruction that is executed is the one after the label , called target . This could look as follows:

Jump

A programm can unconditionally branch , or jump, using the three types of jump instructions : jump

($j$), jump and link ($jal$), and jump register ($jr$). Jump ($j$) jumpes directly to the instruction at

the specified label. Jump and link ($jal$) is similar to $j$ but is used by procedures to save a return

address, as will be discussed later on. Jump register ($jr$) jumps to the address held in a register.

Example:

6.4.3 Conditional Statements

If Statements

An if statement executes a block of code, the if block , only when a condition is met. The following

example shows how to translate an if statement into MIPS assembly code:

If/Else Statements

If/else statements execute one of two blocks of code depending on a condition. When the condition of

the if block is met, the if block is executed. Otherwise, the else block is executed. Following an

example if/else statement:

addi $s0, $0, 4 # $s0 = 0 + 4

addi $s1, $0, 2 # $s1 = 0 + 2

bne $s0, $s1, target # $s1 != $s0, so branch is taken

addi $s0, $s0, 1 # not executed

target:

add $s1, $s1, $s0 # $s1 = 2 + 4

1

2

3

4

5

6

7

addi $s0, $0, 4 # $s0 = 4

addi $s1, $0, 2 # $s1 = 2

j target # jump to target

addi $s1, $s1, 2 # not executed

target:

add $s1, $s1, $s0 # $s1 = 6

1

2

3

4

5

6

7

bne $s3, $s4, L1 # if i != j, skip if-block

add $s0, $s1, $s2 # if block: f = g + h

L1:

sub $s0, $s0, $s3 # f = f - i

1

2

3

4

6.4.4 Getting Loopy

While Loops

While loops repeatedly execute a block of code until a condition is not met . The while loop in the

following example code determines the value of $x$ such that $2^x = 128$:

For Loops

For loops , like while loops, repeatedly execute a block of code until a condition is not met .

However, for loops add support for a loop varialbe , which typically keeps track of the number of loop

executions. The following example code adds the number from 0 to 9 in a for loop:

6.4.5 Arrays

Arrays are useful for accessing large amounts of similar data. An array is organized as sequential data

addresses in memory. Each array element is identified by a number called its index .

Array Indexing

The base address gives the address of the first array element, array[0] . The first step in accessing an

array element is to load the base address of the array into a register. Recall that the load upper

immediate ( lui ) and or immediate ( ori ) instructions can be used to load a 32-bit constant into a

register. The offset can be used to access subsequent elements of the array. For example, array[1] is

bne $s3, $s4, else # if i != j, branch to else

add $s0, $s1, $s2 # if-block: f = g + h

j end # skip past the else block

else:

sub $s0, $s0, $s3 # else-block: f = f - i

end:

1

2

3

4

5

6

addi $s0, $0, 1 # pow = 1

addi $s1, $0, 0 # x = 0

addi $t0, $0, 128 # t0 = 128 for comparison

while:

beq $s0, $t0, done # if pow = 128, exit while loop

sll $s0, $s0, 1 # pow = pow * 2

addi $s1, $s1, 1 # x = x + 1

j while

done:

1

2

3

4

5

6

7

8

9

10

add $s1, $0, $0 # sum = 0

addi $s0, $0, 0 # i = 0

addi $t0, $0, 10 # n = 10

for:

beq $s0, $t0, done # if i == 10, branch to done

add $s1, $s1, $s0 # sum = sum + i

addi $s0, $s0, 1 # increment i by one

j for

done:

1

2

3

4

5

6

7

8

9

stored one word or 4 bytes after array[0] , so it is accessed at an offset of 4 past the base address.

Following is an example of how to access arrays in MIPS:

6.4.6 Procedure Calls

Procedures have inputs, called arguments , and an output, called the return value . When one procedure

calls another, the calling procedure, the caller , and the called procedure, the callee , must agree on

where to put the arguments and the return value. In MIPS, the caller conventionally places up to four

arguments in registers $\$ a0 - \$ a3$ before making the procedure call, and the callee places the

return value in registers $\$ v0 - \$ v1$ before finishing.

Procedure Calls and Returns

MIPS uses the jump and link ($jal$) instruction to call a procedure and the jump register ($jr$)

instruction to return from a procedure. jal performs two functions: it stores the address of the next

instruction (the instruction after jal ) in the return address register $\$ ra$, and it jumps to the target

instruction.

Input Arguments and Return Values

By MIPS convention, procedure use $\$ a0 - \$ a3$ for input arguments and $\$ v0 - \$ v1$ for return

values . A procedure that returns a 64-bit value, such as a double-precision floating point number, uses

both return registers.

The Stack

The Stack is memory that is used to save local variables within a procedure. The stack is a last-in-

first-out (LIFO) queue. The last item pushed onto the stack is the first one that can be pulled

( popped ) off. The MIPS stack grows down in memory . The stack expands to lower memory addresses

when a program needs more scratch space. The stack pointer $\$ sp$ is a special register that points

to the top of the stack.

Recall that a procedure should calculate a return value but have no other unintended side effects. To

obey to this rule, a procedure saves registers on the stack before it modifies them, then restores them

from the stack before it returns. Specifically, it performs the following steps:

1. Makes space on the stack to store the values of one or more registers.

2. Stores the values of the registers on the stack.

3. Executes the procedure using the registers.

4. Restores the original values of the registers from the stack.

5. Deallocates space on the stack.

Preserved Registers

lui $s0, 0x1000 # $s0 = 0x10000000

ori $s0, $s0, 0x7000 # $s0 = 0x10007000

lw $t1, 0($s0) # $t1 = array[0]

sll $t1, $t1, 3 # $t1 = $t1 << 3 = $t1 * 8

sw $t1, 4($s0) # array[1] = $t1

1

2

3

4

5

6

MIPS divides registers into preserved and nonpreserved categories. The preserved registers include $\$

s0 - \$s7$ (hence their name saved ). The nonpreserved registers include $\$ t0 - \$ t9$ (hence their

name temporary ). A procedure must save and restore any of the preserved registers that it wishes to

use, but it can change the nonpreserved registers freely. Following is an example of a procedure saving

preserved registers on the stack:

6.6 Compiling, Assembling, and Loading

6.6.1 The Memory Map

With 32-bit addresses, the MIPS address space spans $2^{32}$ bytes = 4 gigabytes (GB). Word addresses

are divisible by 4 and range from 0 to 0xFFFFFFFC . The MIPS architecture divides the address space into

four parts or segments. The following sections describe each segment:

The Text Segment

The text segment stores the machine language program. It is large enough to accommodate almost

256 MB of code.

The Global Data Segment

The global data segment stores global variables that, in contrast to local variables, can be seen by all

procedures in a programm. Global variables are defined at start-up , before the program begins

executing. The global data segment is large enough to store 64 KB of global variables.

The Dynamic Data Segment

The dynamic data segment holds the stack and the heap. This is the largest segment of memory used

by a program, spanning almost 2 GB of the address space.

The Reserved Segments

The reserved segments are used by the operating system and cannot directly be used by the

programmer. Part of the reserved memory is used for interrupts and for memory-mapped I/O.

6.6.2 Translating and Starting a Program

Following are the steps required to translate a program from a high-level language into machine

language and to start executing a program:

diffofsums:

addi $sp, $sp, -4 # make space on stack to store one register

sw $s0, 0($sp) # save $s0 on stack

add $t0, $a0, $a1 # $t0 = f + g

add $t1, $a2, $a3 # $t1 = h + i

sub $s0, $t0, $t1 # result = (f + g) - (h + i)

add $v0, $s0, $0 # put return value in $v0

lw $s0, 0($sp) # restore $s0 from stack

addi $sp, $sp, 4 # deallocate stack space

jr $ra # return to caller

1

2

3

4

5

6

7

8

9

10

Step 1 - Compilation: A compiler translates high-level code into assembly language.

Step 2 - Assembling: The assembler turns the assembly language code into an object file containing

machine language code. The assembler makes two passes through the assembly code. On the first pass,

the assembler assigns instruction addresses and finds all the symbols , such as labels and global

variable names. On the second pass through the code, the assembler produces the machine language

code.

Step 3 - Linking: Most large programs contain more than one file. The job of the linker is to combine all

of the object files into one machine language file called the executable .

Step 4 - Loading: The operating system loads the program by reading the text segment of the

executable file from a storage device into the text segment memory.

6.7 Odds and Ends

6.7.1 Pseudoinstructions

MIPS defines pseudoinstructions that are not actually part of the instruction set but are commonly

used by programmers and compilers. When converted to machine code, pseudoinstructions are

translated into one or more MIPS instructions. Following are some useful pseudoinstructions:

6.7.3 Signed and Unsigned Instructions

Addition and Subtraction

MIPS provides signed and unsigned versions of addition and subtraction. The signed versions are add,

addi, and sub . The unsigned versions are addu, addiu, and subu . The two versions are identical

except that signed version trigger an exception on overflow , whereas unsigned versions do not.

Multiplication and Division

Multiplication and division come both in signed and unsigned flavors. mult and div treat the

operands as signed numbers. multu and divu treat the operands as unsigned numbers.

Set Less Than

Set less than instructions can compare either two registers ( slt ) or a register and an immediate ( slti ).

Set less than also comes in signed and unsigned ( sltu and sltiu ) versions.

Chaptter 7 - Microarchitecture

7.1 Introduction

This chapter covers microarchitecture , which is the connection between logic and architecture.

Microarchitecture is the specific arrangement of reigsters, ALUs, finite state machines, memories, and

other logic building blocks needed to implement an architecture.

7.1.1 Architectural State and Instruction Set

Recall that a computer architecture is defined by its instruciton set and architectural state . The

architectural state for the MIPS processor consists of the programm counter and the 32 registers . To

keep the mircoarchitecture easy to understand, we handle the following instructions:

R-type arithmeitc/logic instructions: $add, \ sub, \ and, \ or, \ slt$

Memory instructions: $lw, \ sw$

Branches: $beq$

7.1.2 Design Process

We will divide our microarchitecture into two interacting parts: the datapath and the control . The

datapath operats on words of data. It contains structures such as memories, registers, ALUs, and

multtiplexers.

A good way to design a complex system is to start with hardware containing the state elements. These

elements include the memories and the architectural state (the program counter and registers).

Program Counter: The program counter is an ordinary 32-bit registers. Its output, $PC$, points to the

current instruction. Its input, $PC'$, indicates the address of the next instruction.

Instrcution Memory: The instruction memory has a single read port. It takes a 32-bit instruction

address input, $A$, and reads the 32-bit data (i.e. the instruction) from that address onto the read

data output, $RD$.

Register File: The 32-element $\times$ 32-bit register file has two read ports and one write port. The

read ports take 5-bit address inputs, $A1$ and $A2$, each specifying one of $2^5 = 32$ registers as

source operands. The write port takes a 5-bit address input, $A3$, a 32-bit write data input, $WD$, a

write enable input, $WE3$, and a clock. If write enable is 1, the register file writes the data into the

specified register on the rising edge of the clock.

Data Memory: The data memory has a singel read/write port. If the write enable, $WE$, is 1, it

writes data $WD$ into address $A$ on the rising edge of the clock. If write enable is 0, it reads

address $A$ into $RD$.

The instruction memory, register file, and data memory are all read combinationally . In other words, if

the address changes, the new data appears at $RD$ after some propagation delay, no clock is involved.

Since the state elements change their state only on the rising edge of the clock, they are synchronous

sequential circuits.

7.1.3 MIPS Microarchitectures

In this chapter, we differ between three types of microarchitectures for the MIPS processor architecture:

The single cycle microarchitecture executes an entire insttruction in one cycle. The cycle time is

therefore limited by the slowest instruction.

The multi cycle microarchitecture executes instructions in a series of shorter cycles. Simpler

instructions execute in fewer cycles than complicated ones. The multicycle processor executes only

one instruction at a time, but each instruction takes multiple clock cycles.

The pipelined microarchitecture applies pipelining to the single-cycle microarchitecture. It

therefore can execute several instructions simultaneously, improving the throughput significantly.

7.2 Performance Analysis

The only gimmick-free way to measure performance is by measuring the execution time of a program.

The execution time of a program, measured in seconds, is given by the following equation:

The number of cycles per instruction , often called CPI , is the number of clock cycles required to

execute an avaerage instruction. The number of seconds per cycle is the clock period $T_C$. The clock

period is determined by the critical path through the logic on the processor.

7.3 Single-Cycle Processor

We first design a MIPS microarchitecture that executes instructions in a single cycle. We begin by

constructiong the datapath by connecting the state elements from Figure 7.1 with combinational logic

that can execute the various instructions.

7.3.1 Single-Cycle Datapath

As already mentioned, the program counter $PC$ register contains the address of the instruction to

execute. The first step is to read this instruction from the instruction memory. The instruction memory

reads out, or fetches , the 32-bit instruction, labeled $Instr$.

Load-Word Instructions

For a $lw$ instruction, the next step is to read the source register containing the base address. This

register is specified in the $rs$ field of the instruction, $Instr_{25:21}$. The $lw$ instruction also

requires an offset. The offset is stored in the immediate field of the instruction, $Instr_{15:0}$. Because

the 16-bit immediate might be either positive or negative, it must be sign-extended to 32-bit value,

called $SignImm$.

The processor must add the base address to the offset to find the address to read from memory. The

ALU receives two operands, $SrcA$ and $SrcB$. $SrcA$ comes from the register file, and $SrcB$ comes

from the sign-extended immediate. The 3-bit $ALUControl$ signal specifies the operation the ALU

should execute. The ALU generates a 32-bit $ALUResult$ and a $Zero$ flag. For the $lw$ instruction, the

ALU must add the base address and the offset and the $ALUResult$ is send to the data memory as the

address for the load instruction. The data is read from the data memory onto the $ReadData$ bus, then

written back to the destination register, which is specified in the $rt$ field, $Instr_{20:16}$, in the

register file at the and of the cycle.

A control signal called $RegWrite$ is connected to the port 3 write enable input, $WE3$, and is asserted

during a $lw$ instruction so that the data value is written into the register file. While the instruction is

being executed, the processor must compute the address of the nextt instruction, $PC'$. Because

instructions are 32 bits = 4 bytes, the next instruction is at $PC + 4 = PC'$.

Store-Word Instructions

Like the $lw$ instruction, the $sw$ instruction reads a base address from port 1 of the register and sign-

extends an immediate. The $sw$ instruction also reads a second register from the register file and

writes it to the data memory. The register is specified in the $rt$ field, $Instr_{20:16}$. These bits of the

instruction arre connected to the second register file read port, $A2$. The write enable port of the data

memory, $WE$, is controlled by $MemWrite$. For a $sw$ instruction, $MemWrite = 1$, to write data to

memory.

R-Type Instructions

We will now consider extending the datapath to handle the R-type instructions $add, \ sub, \ and, \ or,$

and $slt$. All of these instructions read two registers from the reigster file, perform some ALU operation

on them, and write the result back to a third register file. Up to here, the ALU always received its $SrcB$

operand from the sign-extended immediate $SignImm$. Now, we add a multiplexer to choose $SrcB$

from either the reigster file $RD2$ port or $SignImm$. The multiplexer is controlled by a new signal,

$ALUSrc$. $ALUSrc$ is $0$ for R-type instructions to choose $SrcB$ from the register file and it is $1$ for

$lw$ and $sw$ instructions to choose $SignImm$. Also, until now the register file always got its write

data from the data memory. However R-type instructions write the $ALUResult$ to the register file.

Therefore, we add another multiplexer to choose between $ReadData$ and $ALUResult$. We call this

output $Result$. This multiplexer is controlled by another new signal, $MemtoReg$.

Similarly, the register to write was specified by the $rt$ field of the instruction, $Instr_{20:16}$.

However, for R-type instructions, the register is specified in the $rd$ field, $Instr_{15:11}$. Thus we add

a third multiplexer to choose $WriteReg$ from the appropriate field of the instruction. The multiplexer is

controlled by $RegDst$. $RegDst$ is $1$ for R-type instructions to choose $WriteReg$ from the $rd$ field.

Branch-If-Equal Instructions

The $beq$ instruction compares two registers. If they are equal, it takes the branch by adding the

branch offset to the program counter. Recall that the offset is a positive or negative number, stored in

the $imm$ field of the instruction, $Instr_{31:26}$. Hence the immediate must be sign-extended and

multiplied by 4 to get the new program counter value $PC' = PC + 4 + SignImm \times 4$. The next $PC$

value for a taken branch, $PCBranch$, is computed by shifting $SignImm$ left by $2$ bits, then adding

it to $PCPlus4$. The two register are compared by computing $SrcA - SrcB$ using the ALU. If

$ALUResult$ is $0$, as indicated by the $Zero$ flag from the ALU, the registers are equal. We add a

multiplexer to choose $PC'$ from either $PCPlus4$ or $PCBranch$.

7.3.2 Single-Cycle Control

The control unit computes the control signals based on the $opcode$ and $funct$ fields of the

instruction, $Instr_{31:26}$ and $Instr_{5:0}$. Most of the control informations comes from the

$opcode$, but R-type instructions also use the $funct$ field to determine the ALU operation. Thus, we

will simplify our design by factoring the control unit into two blocks of combinational logic. The main

decoder computes most of the outputs from the $opcode$. It also determines a 2-bit $ALUOp$ signal.

The following truth table for the main decoder summarizes the control signals as a function of the

$opcode$:

7.3.3 More Instructions

Excluded from this summary.

7.3.4 Performance Analysis

Each instruction in the single-cycle processor takes one clock cyle, so the $CPI = 1$. The critical path in

our processor is the $lw$ instruction, hence the cycle time is:

In most implementation technologie, the ALU, memory, and register file accesses are substantially

slower than other operations. Therefore, the cycle time simplifies to:

7.4 Multicycle Processor

The single-cycle processor has three primary weaknesses:

It requires a clock cyle long enough to support the slwoest instruction, even though most

instructions are faster.

It requires three adders (one in the ALU and two for the PC logic), which are relatively expensive

circuits, especially if they must be fast.

It has seperate instruction and and data memories, which may not be realistic.

The multi-cycle processor addresses these weaknesses by breaking an instruction into multiple shorter

steps. In each short step, the processor can read or write from the memory or register file or use the

ALU.

7.4.1 Multicycle Datapath

In the single-cycle design, we used separate instruction and data memories because we needed to read

the instruction memory and read or write the data memory all in one cycle. Now, we choose to use a

combined memory for both instructions and data. This is more realistic, and it is feasible because we

can read the instruction in one cycle, then read or write the data in a separate cycle.

The $PC$ contains the address of the instruction to execute. The instruction is read and stored in a new

nonarchitectural Instruction Register so that it is available for further cycles. The instruction register

receives an enable signal, called $IRWrite$, that is asserted when it should be updated with a new

instruction.

Load-Word Instructions

For a $lw$ instruction, the next step is to read the source register containing the base address.

specified in the $rs$ field of the instruction, $Instr_{25:21}$. The register file reads the value of the

register onto $RD1$, which is then stored in another nonarchitectural register $A$.

We again have to sign-extend the offset and add it to the base address to get the address of the load.

We use an ALU to compute this sum. $ALUControl$ should be set to $010$ to perform an addition.

$ALUResult$ is stored in another nonarchitectural register called $ALUOut$.

The next step is to load the data from the calculated address in the memory. We add a multiplexer in

front of the memory to choose the memory address, $Adr$, from either $PC$ or $ALUOut$. The

multiplexer select signal is called $IorD$, to indicate either an instruction or data address. The data

read from the memory is stored in another nonarchitectural register, aclled $Data$.

Finally, the data is written back to the register file. The destination register is specified in the $rt$ field

of the instruction, $Instr_{20:16}$. At the same time, the processor must also update the program

counter $PC$. We can use the existing ALU on one of the steps when it is not busy. To do so, we must

insert source multiplexers to choose the $PC$ and the constant $4$ as ALU inputs. A two-input

multiplexer controlled by $ALUSrcA$ chooses either the $PC$ or register $A$ as $SrcA$. A four-input

multiplexer controlled by $ALUSrcB$ chooses either $4$ or $SignImm$ as $SrcB$.

Store-Word Instructions

The only new feature for $sw$ is that we must read a second register from the register file and write it

into the memory. The register is specified in the $rt$ field of the instruction, $Instr_{20:16}$. When the

register is read, it is stored in a nonarchitectural register $B$. On the next step, it is sent to the write

data port $WD$ of the data memory to be written. The memory receives an additional $MemWrite$

control signal to indicate that the write should occur.

R-Type Instructions

For R-type instructions, the instruction is again fetched, and the two source registers are read from the

register file. The ALU performs the appropriate operation and stores the result in $ALUOut$. On the next

step, $ALUOut$ is written back to the register specified by the $rd$ field of the instruction,

$Instr_{15:11}$. This requires two new multiplexers. The $MemtoReg$ multiplexer selects whether $WD3$

comes from the $ALUOut$ (for R-type instructions) or from $Data$ (for $lw$ instructions). The $RegDst$

instruction selects whether the destination register is specified in the $rt$ or $rd$ field of the

instruction.

Branch-If-Equal Instructions

For the $beq$ instruction, the instruction is again fetched, and the two source registers are read from

the register file. To determine whether or not the two registers are equal, the ALU subtracts the

registers and examines the $Zero$ flag. Meanwhile, the datapath must compute the next value of the

$PC$ if the branch is taken. In the single-cycle processor, yet another adder was needed to compute the

branch address. In the multi-cycle processor, the ALU can be reused again to save hardware. On one

step, the ALU computes $PC + 4$ and writes it back to the program counter, as we done for other

instructions. On another step, the ALU uses this updated $PC$ value to compute $PC + SignImm \times

4$. A new multiplexer, controlled by $PCSrc$, chooses what signal should be sent to $PC'$. The program

counter should be written either when $PCWrite$ is asserted or when a branch is taken. A new control

signal, $Branch$, indicates that the $beq$ instruction is being executed.

7.4.2 Multicycle Control

As int the single-cycle processor, the control unit is partitioned into a main controller and an ALU

decoder. Now, however, the main controller is an FSM that applies the proper control signals on the

proper cycles or steps.

The main controller produces multiplexer select and register enable signals for the datapath. The

select signals are $MemtoReg, \ RegDst, \ IorD, \ PCSrc, \ ALUSrcB$, and $ALUSrcA$. The enable signals

are $IRWrite, \ MemWrite, \ PCWrite, \ Branch$, and $RegWrite$.

The first step for any instruction is to fetch the instruction from memory at the address held in the

$PC$. The FSM enters this state on reset. To read memory, $IorD = 0$. $IRWrite$ is asserted to write the

instruction into the instruction register. $ALUSrcA = 0$, so $SrcA$ comes from the $PC$. $ALUSrcB = 01$,

so $SrcB$ is constant 4. $ALUOp = 00$, so the ALU decoder produces $ALUControl = 010$ to make the

ALU add. To update the $PC$ with this new value, $PCSrc = 0$, and $PCWrite$ is asserted.

The next step is to read the register file and decode the instruction. No control signals are necessary to

decode the instruction, but the FSM must wait $1$ cycle for the reading and decoding to complete.

If the instruction is a memory load or store ($lw$ or $sw$), the multicycle porcessor computes the

address by adding the base address to the sign-extended immediate. This requires $ALUSrc = 1$ to

select register $A$ and $ALUSrcB = 10$ to select $SignImm$. $ALUOp = 00$, so the ALU adds. If the

instruction is $lw$, the multicycle processor must next read data from memory and write it to the

register file.

To read from memory, $IorD = 1$ to select the memory address that was just computed and saved in

$ALUOut$. On the next step, S4 , $Data$ is written to the register file. $MemtoReg = 1$ to select $Data$,

and $RegDst = 0$ to pull the destination register from the $rt$ field of the instruction. $RegWrite$ is

asserted to perform the write, completing the $lw$ instruction.

From state S2 , if the instruction is $sw$, the data read from the second port of the register files is

simply written to memory. $IorD = 1$ to select the address computed in S2 and saved in $ALUOut$.

$MemWrite$ is asserted to write the memory.

If the $opcode$ indicates an R-type instruction, the multicycle processor must calculate the result

using the ALU and write that result back to the register file. In S6 , the instruction is executed by

selecting the $A$ and $B$ registers ($ALUSrcA = 1, \ ALUSrcB = 00$) and performing the ALU operation

indicated by the $funct$ field of the instruction. $ALUOp = 10$ for all R-type instructions. The

$ALUResult$ is stored in $ALUOut$.

In S7 , $ALUOut$ is written to the register file, $RegDst = 1$, because the destination register is stored

in the $rt$ field of the instruction. $MemtoReg = 0$ because the write data, $WD3$, comes from

$ALUOut$. $RegWrite$ is asserted to write the register file.

For a $beq$ instruction, the processor must calculate the destination address and compare the two

source registers to determine whether the branch should be taken. $ALUSrcA = 0$ to select the

incremented $PC$, $ALUSrcB = 11$ to select $SignImm \times 4$, and $ALUOp = 00$ to add. The

destination address is stored in $ALUOut$.

In S8 , the processor compares the two registers by subtracting them and checking to determine

whether the result is $0$. $ALUSrcA = 1$ to select register $A$, $ALUSrcB = 00$ to select register $B$,

$ALUOp = 01$ to subtract, $PCSrc = 1$ to take the destination address from $ALUOut$, and $Branch = 1$

to updated the $PC$ with this address if the ALU result is $0$.

7.3.4 More Instructions

Remark: The exact explanations for the determination of the signals is left out in this summary and can

be found in the book.

7.4.4 Performance Analysis

The execution time of an instruction depends on both the number of cycles it uses and the cycle time.

The multicycle processor requires three cycles for $beq$ and $j$ instructions, four cycles for $sw, addi,$

and R-type instructions, and five cycles for $lw$ instructions. The CPI depends on the realtive likelihood

that each instruction is used. Examining the datapath reveals two possible critical paths that would

limit the cycle time:

7.5 Pipelined Processor

Pipelining is a powerful way to improve the throughput of a digital system. We design a pipelined

processor by sub-dividing the single-cycle processor into five pipeline stages. Thus, five instructions

can execute simultaneously, one in each stage. Specifically, we call the five stages Fetch, Decode,

Execute, Memory and Writeback :

$Fetch :$ The processor reads the instruction from instruction memory.

$Decode :$ The processor reads the source operands from the register file and decodes the

instruction to produce control signals.

$Execute :$ The processor performs a computation with the ALU.

$Memory :$ The processor reads or writes data memory.

$Writeback :$ The processor writes the result to the register file, when applicable.

In the followin figure, an abstracted view of the pipeline in operation is shown in which each stage is

represented pictorially. Each pipeline stage is represented with its major component – Instruction

memory ( IM ), register file ( RF ) read, ALU execution, data memory ( DM ), and register file writeback –

to illustrate the flow of instructions through the pipeline.

A central challenge in pipelined systems is handling hazards that occur when the results of one

instruction are needed by a subsequent instruction before the former instruction has completed.

7.5.1 Pipelined Datapath

The pipelined datapathis formed by chopping the single-cycle datapath into five stages separated by

pipeline register. Signals are given a suffix ( F, D, E, M or W ) to indicate the stage in which they reside.

The register file is peculiar because it is read in the Decoder stage and written in the Writeback stage.

It is drawn in the Decode stage, but the write address and data come from the Writeback stage. One of

the subtle but critical issues in pipelining is that all signals associated with a particular instruction

must advance through the pipeline in union.

7.5.2 Pipelined Control

The pipelined processor takes the same control signals as the single-cycle processor and therefore uses

the same control unit. These conttrtol signals must be pipelined along with the data so that they

remain synchronized with the instruction.

7.5.3 Hazards

In a pipelined system, multiple instructions are handled concurrently. When one instruction is

dependent on the result of another that has not yet completed, a hazard occurs.

The register file can be read and written in the same cycle. Let us assume that the write takes place

during the first half of the cycle and the read takes place during the secand half of the cycle, so that

the register can be written and read back in the same cycle without introducing a hazard. The following

figure illustrates hazards that occur when one instruction writes a regsiter ($\$s0$) and subsequent

instructions read this register. This is called a read after write (RAW) hazard .

In principle, we should be able to forward the result from one instruction to the next to resolve the

RAW hazard without slowing down the pipeline.

Hazards are classified as data hazards and control hazards. A data hazard occurs when an instruction

tries to read a register that has not yet been written back by a previous instruction. A control hazard

occurs when the decision of what instruction to fetch next has not been made by the time the fetch

takes place.

Solving Data Hazards with Forwarding

Some data hazards can be solved by forwarding (also called bypassing ) a result from the Memory or

Writeback stage to a dependent instruction in the Execute stage. This requires adding multiplexers in

front of the ALU to select the operand from either the register file or the Memory or Writeback stage.

Forwarding is necessary when an instruction in the Execute stage has a source register matching the

destination register of an instruction in the Memory or Writeback stage. We therefore add a hazard

detection unit and two forwarding multiplexers. The hazard detection unit receives the two source

registers from the instruction in the Execute stage and the destination registers from the instructions in

the Memory and Writeback stages. It also receives the $RegWrite$ signals from the Memory and

Writeback stages to know whether the destination register will actually be written.

In summary, the function of the forwarding logic for $SrcA$ is given below. The forwarding logic for

$SrcB$ ($FowardBE$) is identical except that it checks $rt$ rather than $rs$:

Solving Data Hazards with Stalls

Unfortunately, the $lw$ instruction does not finish reading data until the end of the Memory stage, so

its result cannot be forwarded to the Execute stage of the next instruction. We say that $lw$ has a two-

cycle latency , because dependent instructions cannot use its result until two cycles later. The

alternative solution to forwarding is to stall the pipeline , holding up operation until the data is

available.

if ((rsE != 0) AND (rsE == WriteRegM) AND (RegWriteM)) then

ForwardAE = 10

else if ((rsE != 0) AND (rsE == WriteRegW) AND (RegWriteW)) then

ForwardAE = 01

else

ForwardAE = 00

1

2

3

4

5

6

In summary, stalling a stage is performed by disabling the pipeline register, so that the contents do not

change. When a stage is stalled, all previous stages must also be stalled, so that no subsequent

instructions are lost. The pipeline register directly after the stalled stage must be cleared to prevent

bogus information from propagating forward. Stalls degrade performance, so they should only be used

when necessary.

Stalls are supported by adding enable inputs ($EN$) to the Fetch and Decode pipeline registers and a

synchronous reset/clear ($CLR$) input to the Execute pipeline register.

The logic to compute the stalls and flushes is:

Solving Control Hazards

lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoReg

StallF = StallD = FlushE = lwstall

1

2

The $beq$ instruction present a control hazard: the pipelined processor does not know what instruction

to fetch next, because the branch decisions has not been made by the time the next instruction is

fetched. One machanism for dealing with the control hazard is to stall the pipeline until the branch

decision is made (i.e. $PRSrc$ is computed). Because the decision is made in the Memory stage, the

pipeline would have to be stalled for three cycles at every branch. This would severly degrade the

system performance.

An alternative is to predict wether a branch will be taken and begin executing instructions based on

the prediction. Once the branch decision is available, the processor can throw out the instructions if

the predicition was wrong. These wasted instruction cycles are called the branch misprediction penalty.

We could reduce the branch misprediction penalty if the branch decision could be made eralier. Making

the decision simply requires comparing the values of two registers. Using a dedicated equality

comparator is much faster than performing a subtraction and zero detection. If the comparator is fast

enough, it could be moved back into the Decode stage, so that the operands are read from the register

file and compared to determine the next $PC$ by the end of the Decode stage.

Hazard Summary

In Summary, RAW data hazards occur when an instruction depends on the resultof another instruction

thathas not yet been written into the register file. The data hazards can be resolved by forwarding if

the result is computed soon enough, otherwise they require stalling the pipeline until the result is

available. Control hazards occur when the decision of what instruction to fetch has not been made by

the time the next instruction must be fetched. Control hazards are solved by predicting which

instruction should be fetched and flushing the pipeline if the prediction is later determined to be

wrong.

7.6 HDL Representation

Remark: Chapter 7.6 is excluded from this summary. The exact implementation in Verilog HDL can be read

in the book.

7.7 Exceptions

Remark: Chapter 7.7 is excluded from the summary since it is not required by the course.

7.8 Advanced Microarchitecture

Recall that the time required to run a program is proportional to the period of the clock and to the

number of clock cycles per instruction (CPI). Thus, to increase performance we would like to speed up

the clock and/or reduce the CPI.

7.8.1 Deep Pipelines

The easiest way to speed up the clock is to chop the pipeline into more stages. Each stage contains

less logic, so it can run faster. This chapter considered a classic five-stage pipeline, but 10 to 20 stages

are now commonly used. The maximum number of pipeline stages is limited by pipeline hazards,

sequencing overhead, and cost. The pipeline registers between each stage have a sequencing overhead

from their setup time and clk-to-Q delay. This sequencing overhead makes adding more pipeline stages

give diminishing returns.

7.8.2 Branch Prediction

An ideal pipelined processor would have a CPI of $1$. The branch misprediction penalty is a major

reason for increased CPI. To address this problem, most pipelined processors use a branch predictor to

guess whether the branch should be taken.

The simplest form of branch prediction checks the direction of the branch and predicts that backward

branches should be taken. This is called static branch prediction , because it does not depend on the

history of the program. Forward branches are difficult to predict without knowing more about the

specific program. Therefore, most processors use dynamic branch predictors , which use the history of

the program execution to guess whether a branch should be taken. A one-bit dynamic branch predictor

remembers whether the branch was taken the last time and predicts that it will do the same thing the

next time. In summary, a $1$-bit branch predictor mispredicts the first and last branches of a loop.

7.8.3 Superscalar Processor

A superscalar processor contains multiple copies of the datapath hardware to execute multiple

instructions simultaneously. For example, a two-way superscalar processor has a datapath, which

fetches two instructions at a time from the instruction memory. Unfortunately, executing many

instructions simultaneously is difficult because of dependencies.

7.8.4 Out-of-Order Processor

To cope with the problem of dependencies, an out-of-order processor looks ahead across many

instructions to issue , or begin executing, independent instructions as rapidly as possible. The

instructions can be issued in a different order than written by the programmer, as long as dependencies

are honored so that the program produces the intended result.

The dependence of $add$ on $lw$ by way of $\$t0$ is a read after write (RAW) hazard . This is the type

of dependency we are accustomed to handling in the pipelined processor. It inherently limits the speed

at which the program can run, even if infinitely many execution units are available.

The dependence between $sub$ and $add$ by way of $\$ t0$ is called write after read (WAR) hazard or

and antidependence. WAR hazards could not occur in the simple MIPS pipeline, but they may happen in

an out-of-order processor if the dependent instruction (int this case, $sub$) is moved too early.

A third type of hazard, not shown in the program, is called write after write (WAW) hazard or and

output dependence. A WAW hazard occurs if any instruction attempts to write a register after a

subsequent instruction has already written it. The hazard would result in the wrong value being written

to the register.

The instruction level parallelism (ILP) is the number of instructions that can be executed

simultaneously for a particular program and microarchitecture. Theoretical studies have shown that

the ILP can be quite large for out-of-order microarchitectures with perfect branch predictors and

enormous numbers of execution units.

7.8.5 Register Renaming

Out-of-order processors use a technique called register renaming to eliminate WAR hazards. Register

renaming adds some nonarchitectural renaming registers to the processor. The programmer cannot use

these registers directly, because they are not partof the architecture. However, the processor is free to

use them to eliminate hazards.

7.8.6 Single Instruction Multiple Data

The term SMID stands for single instruction multiple data , in which a single instruction acts on

multiple pieces of data in parallel. A common application of SIMD is to perform many short arithmetic

operations at once, especially for graphics processing. This is also called packed arithmetic. For

example, a 32-bit microprocessor might pack four 8-bit data elements into one 32-bit word. Packed

$add$ and $sub$ instructions operate on all four data elements within the word in parallel.

7.8.7 Multithreading

Multithreading is a technique that helps keep a processor with many execution units busy even if the

ILP if a program is low or the program is stalled waiting for memory. A program running on a computer

is called a process . Computers can run multiplr processes simultaneously. Each process consists of one

or more threads that also run simultaneously.

In a conventional processor, the threads only give the illusion of running simultaneously. The threads

acutally take turns being executed on the processor under control of the OS. This procedure is called

context switching . As long as the processor switches through all the threads fast enough, the user

preceives all of the threads as running at the same time. A multithreaded processor contains more than

one copy of its architectural state, so that more than one thread can be active at a time.

Multithreading does not improve the performance of an individual thread, because it does not increase

the ILP. However, it does improve the overall throughput of the processor, because multiple threads can

use processor resources that would have been idle when executing a single thread.

7.8.8 Multiprocessors

A multiprocessor system consists of multiple processors and a method for communication between the

processors. A common form of multi-processing in computer systems is symmetric multiprocessing

(SMP) , in which two or more identical processors share a single main memory. The multiple processors

may be seperate chips or multiple cores on the same chip.

Multiprocessors can be use to run more threads simultaneously or to run a particular thread faster.

Running more threads simultaneously is easy, the threads are simply divided up among the processors.

7.9 Real-World Perspective: IA-32 Microarchitecture

Remark: Chapter 7.9 is excluded in this summary. It can be read in the book.

Chapter 8 - Memory Systems

8.1 Introduction

Computer system performance depends on the memory system as well as the processor

microarchitecture. Chapter 7 assumed an ideal memory system that could be accessed in a single

clockcycle. However, this would be only true for a very small memory or a very slow processor.

The processor communicates with the memory system over a memory interface . The processor sends

and address over the $Address$ bus to the memory system. For a read, $MemWrite$ is $0$ and the

memory returns the data on the $ReadData$ bus. For write, $MemWrite$ is $1$ and the processor sends

data to memory on the $WriteData$ bus.

The principles to make memory fast are temporal locality and spatial locality . Temporal locality

means, that if you have used a piece of data you will likely use it again soon. Spatial locality means,

that if you used a piece of data, you will likely use related data again soon.Furthermore, memory

systems use a hierarchy of storage to quickly access the most commonly used data while still having

the capacity to store large amounts of data.

Computer memory is generally built from DRAM chips. But since the speed of processors grows much

faster than the speed of DRAM, which results in loss of overall performance, computers store the most

commonly used instructions and data in a faster but smaller memory, called cache. The cache is

usually built out of SRAM on the same chip as the processor. The cache speed is comparable to the

processor speed, beacuse SRAM is inherently faster than DRAM, and beacuse the on-chip memory

eliminates lengthy delays caused by traveling to and from a seperate chip.

If the processor requests data that is available in the cache, it is returned quickly. This is called a

cache hit . Otherwise, the processor retrieves the data from the main memory (DRAM) . This is called a

cache miss . The third level in the memory hierarchy is the hard drive . In the same way that a library

uses the basement to store books that do not fit in the stacks, computer systems use the hard drive to

store data that does not fit in main memory.

The hard drive provides an illusion of more capacity than actually exists in the main memory. It is thus

called virtual memory . The main memory can therefore be seen as a cache for the most commonly used

data from the hard drive.

8.2 Memory System Performance Analysis

Memory system performance metrics are miss rate or hit rate and average memory access time . Miss

and hit rates are calculates as:

Average memory access time (AMAT) is the average time a processor must wait for memory per load or

store instruction. AMAt is calculated as:

where $t_{cache}, t_{MM}$, and $t_{VM}$ are the access times of the cache, main memory, and virtual

memory, and $MR_{cache}$ and $MR_{MM}$ are the cache and main memory miss rates, respectively.

8.3 Caches

A cache holds commonly used memory data. The number of data words that it can hold is called the

capacity , $C$. As we explain in the following sections, caches are specified by their capacity ($C$),

number of sets ($S$), block size ($b$), number of blocks ($B$), and degree of associativity ($N$).

8.3.1 What Data is Held in the Cache?

The cache must guess what data will be needed based on the past pattern of memory accesses. In

particular, the cache exploits temporal and spatial locality to achieve a low miss rate. Recall that

temporal locality means that the processor is likely to access a piece of data again soon if it has

accessed that data recently. Spatial locality means that, when the processor accesses a piece of data,

it is also likely to access data in nearby memory locations. Therefore, when the cache fetches one word

from memory, it may also fetch several adjacent words. The number of words in the cache block, $b$, is

called the block size . A cache of capacity $C$ contains $B = \frac{C}{b}$ blocks.

8.3.2 How is Data Found?

A cache is organized into $S$ sets , each of which holds one or more blocks of data. The relationship

between the address of data in main memory and the location of that data in the cache is called the

mapping . Each memory address maps to exactly one set in the cache. We differ between the following

types of mappings:

Direct Mapped Cache

A direct mapped cache has one block in each set, so it is organized into $S = B$ sets. An address in

block 0 of main memory maps to set 0 of the cache, and address in block 1 of main memory maps to

set 1 of the cache, and so forth.

Because many addresses map to a single set, the cache must also keep track of the address of the

data actually contained in each set. The two least significant bits of the 32-bit address are called the

byte offset , because they indicate the byte within the word. The next three bits are called the set bits ,

because thex indicate the set to which the address maps (generally, the number of set bits is

$\log_2S$). The remaining 27 tag bits indiciate the memory address of the data stored in a given cache

set. The cache uses a valid bit for each set to indicate whether the set holds meaningful data. If the

valid bit is 0, the contents are meaningless.

The cache is constructed as an eight-entry SRAM. Each entry, or set, contains one line consisting of 32

bits of data, 27 bits of tag, and 1 valid bit.

When two recently accessed addresses map to the same cache block, a conflict occurs, and the most

recently accessed address evicts the previous one from the block.

Multi-way Set Associative Cache

A $N$-way set associative cache reduced conflicts by providing $N$ blocks in each set where data

mapping to that set might be found. Each memory address still maps to a specific set, but it can map

to any one of the $N$ blocks in the set. $N$ is also called the degree of associativity of the cache. The

following figure shows the hardwarew for a $C=8$-word, $N=2$-way set associative cache. The cache

now has only $S= 4$ sets rather than 8. Thus, only $\log_24 =2$ set bits rather than 3 are used to select

the set.

Set associative caches generally have lower miss rates than direct mapped caches of the same

capacity, because they have fewer conflicts. However, set associative caches are usually slower and

somewhat more expensive to build because of the output multiplexer and additional comparators.

Fully Associative Cache

A fully associative cache contains a single set with $B$ ways, where $B$ is the number of blocks. A

fully associative cache is another name for a $B$-way set associative cache with one set. The

following figure shows the SRAM array of a fully associative cache with eight blocks. Upon data

request, eight tag comparisons must be made, because the data could be in any block.

Fully associative caches tend to have the fewest conflict misses for a given cache capacity, but they

require more hardware for additional tag comparison.

Block Size

To exploit spatial locality, a cache uses larger blocks to hold several consecutive words. The

advantage of a block size greater than one is that when a miss occurs and the word is fetched into the

cache, the adjacent words in the block are also fetched. Therefore, subsequent accesses are more likely

to hit because of spatial locality. The following figure shows the hardware of a $C = 8$-bit direct

mapped cache with $b=4$-word block size. The cache now has only $B = \frac{C}{b} = 2$ blocks.

Putting it All Together

Caches are organized as two-dimensional arrays. The rows are called sets, and the columns are called

ways. Each entry in the array consists of a data block and its associated valid and tag bits.

8.3.3 What Data is Replaced?

If a set is full when new data must be loaded, the block in that set is replaced with the new data. In

associative and fully associative caches, the cache must choose which block to evict when a cache is

full. The principle of temporal locality suggests that the best choice is to evict the least recently used

block, because it is least likely to be used again soon. Hence, most associative caches have a least

recently used (LRU) replacement policy.

8.3.4 Advanced Cache Design

Multi Level Caches

Large caches are beneficial because they are more likely to hold data of interest and therefore have

lower miss rates. Modern systems often use at least two levels of caches, as shown in the following

figure. The first-level (L1) cache is small enough to provide a one- or two-cycle access time. The

second-level (L2) cache is also built from SRAM but is larger, and therefore slower, than the L1 cache.

Reducing Miss Rate

Cache misses can be reduced by changing capacity, block size, and/or associativity. We classify the

cache misses into the following categories:

Compulsory miss: The first request to a cache block is called compulsory miss, because it must be

read from memory regardless of the cache design.

Capacity miss: Occurs when the cache is too small to hold alle concurrently used data.

Conflict miss: Are cause when several addresses map to the same set and evict blocks that are still

needed.

Changing cache parameters can affect one ore more type of cache miss.

8.4 Virtual Memory

Most modern computer systems use a hard drive made of magnetic or solid state storage as the lowest

level in the memory hierarchy. A hard drive made of magnetic storage is also called a hard disk . As the

name implies, the hard disk contains one or more rigid disks or platters , each of which has a

read/write head on the end of a long triangular arm. The head moves to the correct locatin on the disk

and reads or writes data magnetically as the disk rotates beneath it.

The object of adding a hard drive to the memory hierarchy is to inexpensively give the illusion of a very

large memory while still providing the speed of faster memory for most accesses. A computer with only

128MB of DRAM, for example, could effectively provide 2GB of memory using the hard drive. This larger

2-GB memory is called virtual memory , and the smaller 128-MB main memory is called physical

memory.

Programs can access data anywhere in virtual memory, so they must use virtual addresses that specify

the location in virtual memory. Virtual memory is divided into virtual pages , typically 4KB in size.

Physical memory is likewise divided into physical pages of the same size. A virtual page may be

located in physical memory (DRAM) or on the hard drive. The process of determining the physical

address from the virtual address is called address translation . If the processor attempts to access a

vrtual address that is not in physical memory, a page fault occurs, and the operating system loads the

page from the hard drive into physical memory.

8.4.1 Address Translation

In a system with virtual memory, programs use virtual addresses so that they can access a large

memory. The computer must translate these virtual addresses to either find the address in physical

memory or take a page fault and fetch the data from the hard drive. The most significant bits of the

virtual or physical address specify the virtual or physical page number . The least significant bits

specify the word within the page and are called the page offset.

8.4.2 The Page Table

The processor uses a page table to translate virtual addresses to physical addresses. The page table

contains an entry for each virtual page. This entry contains a physical page number and a valid bit. If

the valid bit is $1$, the virtual page maps to the physical page specified in the entry. Otherwise, the

virtual pages is found on the hard drive. The page table can be stored anywhere in physical memory.

The processor typically uses a dedicated register, called the page table register , to store the base

address of the page table in physical memory.

8.4.3 The Translation Lookaside Buffer

Virtual memory would have a severe performance impact if it required a page table read on every load

or store, doubling the delay of loads and stores. Fortunately, page tables accesses have great temporal

locality. In general, the processor can keep the last several page table entries in a small cache called

a translation lookaside buffer (TLB) . The processor "looks aside" to find the translation in the TLB before

having to access the page table in physical memory.

8.4.4 Memory Protection

As we already know, modern computers typically run several programs or processors at the same time.

In a well-designed computer system, the programs should be protected from each other. Specifically,

no program should be able to access another program's memory without permission. This is called

memory protection .

Virtual memory systems provide memory protection by giving each program its own virtual address

space . Each program can use its entire virtual address space without having to worry about where

other programs are physically located. However, a program can access only those physical pages that

are mapped in its page table. In this way, a program cannot accidentally or maliciously access

another program's physical pages.

digitech - h&h - summary

Documents