sank art hes is
TRANSCRIPT
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 1/53
1
CHAPTER 1
Introduction:
What is Pipelining?
Definition:
• In computing, a pipeline is a set of data processing elements connected in series, so that
the output of one element is the input of the next one
• An instruction pipeline is a technique used in the design of computer and other digital
electronic devices to increase their instruction throughput (the number of instructions that
can be executed in a unit of time
The fundamental idea is to split the processing of a computer instruction into a series of
independent steps, with storage at the end of each step. This allows the computer's control
circuitry to issue instructions at the processing rate of the slowest step, which is much faster than
the time needed to perform all steps at once. The term pipeline refers to the fact that each step is
carrying data at once (like water), and each step is connected to the next (like the links of a pipe.)
Most modern CPUs are driven by a clock. The CPU consists internally of logic and register
(flipflops). When the clock signal arrives, the flip flops take their new value and the logic then
requires a period of time to decode the new values. Then the next clock pulse arrives and the flip
flops again take their new values, and so on. By breaking the logic into smaller pieces andinserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is
reduced. In this way the clock period can be reduced. For example, the classic RISC pipeline is
broken into four stages with a set of flip flops between each stage.
1. Instruction fetch
2. Instruction decode and register fetch
3. Execute
4. Memory access & Register write back
When a programmer (or compiler) writes assembly code, they make the assumption that each
instruction is executed before execution of the subsequent instruction is begun. This assumption
is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is
known as a hazard. Various techniques for resolving hazards such as forwarding and stalling
exist.
A non-pipeline architecture is inefficient because some CPU components (modules) are idle
while another module is active during the instruction cycle. Pipelining does not completely
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 2/53
2
cancel out idle time in a CPU but making those modules work in parallel improves program
execution significantly.
Processors with pipelining are organized inside into stages which can semi-independently work
on separate jobs. Each stage is organized and linked into a 'chain' so each stage's output is fed to
another stage until the job is done. This organization of the processor allows overall processingtime to be significantly reduced.
A deeper pipeline means that there are more stages in the pipeline, and therefore, fewer logic
gates in each stage. This generally means that the processor's frequency can be increased as the
cycle time is lowered. This happens because there are fewer components in each stage of the
pipeline, so the propagation delay is decreased for the overall stage .
Unfortunately, not all instructions are independent. In a simple pipeline, completing an
instruction may require 4 stages. To operate at full performance, this pipeline will need to run 3
subsequent independent instructions while the first is completing. If 3 instructions that do not
depend on the output of the first instruction are not available, the pipeline control logic must
insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately,
techniques such as forwarding can significantly reduce the cases where stalling is required.
While pipelining can in theory increase performance over an unpipelined core by a factor of the
number of stages (assuming the clock frequency also scales with the number of stages), in
reality, most code does not allow for ideal execution.
.Pipelining incorporates the concept of Time Overlapped Handling of Several Identical Tasks by
Several Non Identical Stages.Each Stage is made to handle a Distinct Section / Sub-Task for
each of the Tasks.At any point of time ideally each of the stages gets Busy with processing it‘s
own Sub Part belonging to different tasks . Each stage , at any given point of time, is processing
a Sub Task belonging to a Different Task hence if there are N stages then ideally N tasks are
being processed concurrently i.e. the Nth Task has started without any of the earlier N-1 tasks
being complete. Each of the Stages can go on independent of all the other stages provided it has
got some job to do / some Input to handle.Each of the stages, except the very first stage , gets
it‘s input from the previous stage and feeds the next stage [ except the very last one ].
Project Objective:
The main objective of this project is to design a a new pipelined RISC processor and developing
code for it in Verilog to verify it through simulation.
we designed a pipelined RISC architecture Processor from ground-up to implement some simple
functions like AND, OR, MOVE, STORE, ADD, SUBTRACT and simulated it using the code
we developed and found it satisfactory.
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 3/53
3
Usage of Pipelining
1.Used in many everyday applications without our notice.
a) Concrete Casting involving a Number of People passing on the Concrete Mix among different
Levels.
b)FireFighting .
2. Has proved to be a very popular and successful way to exploit Instruction Level
Parallelism[will be explaine in the next section]
Instruction pipes are being used in almost all modern processors.
Consider a sufficiently large number of Identical Tasks [ Dumping Concrete Mix, Throwing
Bucket of Water, Executing Instructions in a Computer ].
Break up each Task into several smaller Sub Tasks.Design & Employ one Sub Unit for carrying
out each of these Sub Tasks. Each Sub Unit takes Input from it‘s previous stage / Unit and
delivers Output to it‘s next stage / Unit. Keep each of these Sub Units busy ALL the Time i.e.
Operate them in a Time Overlapped Fashion. If there are N Sub units and the slowest among
them takes K units of time, then our Assembly Line will complete at least N tasks every K unitsof Time.
Classic Examples:
Consider the way in which any Typical Undergraduate Engineering College Works :
1. It offers a 4 Year Curriculum.
2. It has got facilities [ Sub Units ] to train/ teach students of a Particular Year.
3. Starting from a Particular Year onwards , it Admits M number of students every year. After
the First 4 Years , number of students graduating per year = M in each of the Subsequent
Years , Assuming NO failures / An Ideal Scenario :Pipelining Student Admissions
Salient Features of This Pipeline:Fixed Number of Stages : 4
1.Identical Stages : Each Training stage is of One Year duration and each of which Handles the
Same Set of Students as had been admitted in the First Year.
2.No stage is starved of Inputs : getting adequate Number of Students.
3.Synchronized Stages : Through the Common Exam Schedule.
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 4/53
4
Time Overlapped Processing:
[ Temporal Parallelism / Another Real Life Example]
A.Task : WASH , DRY & IRON Ten ( 10) Dirty clothes.
B. Units available [ Capacity ] Time Taken: WASHER [ Can Wash 5 Clothes in one go] takes
40 minutes.[ Hence Total Time Required to WASH 10 Dirty Clothes = 80 minutes] DRIER
[ Can Dry 8 Clothes at a time ] takes 20 minutes.[ Therefore Total Time Required to DRY 10Clothes= 40 minutes ]
Various Units involved in Washing : IRON [ Manual Ironing of 1 Cloth at a time ] takes 4
minutes [ Total Time Required to IRON 10 Clothes = 40 minutes ]
TOTAL Time Needed if operated in Strict , Time Non Overlapped Sequence = 160 Minutes.
Time Overlapped Processing
[ Temporal Parallelism / A Real Life Example - 2]
C. Time Overlapped Operation Sequence :
1. Put 5 Clothes in WASHER [ DRIER , IRON Idle ] .
2a. After 40 minutes [ WASHER Finishes washing 1st Lot ] put washed clothes [ 5 ] to DRY in
DRIER .2b. Load WASHER with the left over 5 Clothes so for the subsequent period both WASHER &
DRIER gets to work in a Time Overlapped Fashion. IRON is still Idle.
Time Overlapped Processing
[ Temporal Parallelism / A Real Life Example - 3]
3a. After 20 Minutes [ Total 60 Minutes ] DRIER will finish , one can take the clothes for
IRONING ( Provided there is space to keep those clothes) . Meanwhile WASHER is still
washing. DRIER is IDLE.
3b. After 20 more Minutes [Total 80 minutes] IRONING of first 5 clothes is finished while
WASHER has also finished washing ALL 10 clothes. DRIER remains IDLE.
Time Overlapped Processing
[ Temporal Parallelism / A Real Life Example - 4]
4. Engage DRIER to DRY remaining 5 clothes takes 20 more minutes [ Total Time Taken = 100
minutes ]. IRONING activity is idle due to lack of availability of clothes . WASHER can be kept
BUSY if more clothes were there.
5. IRONING these clothes will take 20 more minutes [ Total Time Taken = 120 minutes ].
Time Overlapped Processing
[ Real Life Example – Key Observations ]
.Net Time saved due to Time Overlapped Processing = 180 -120 = 60 minutes.
2.Slowest Stage in the Pipeline = IRONING .3. After all the 3 stages ( WASHER, DRIER, IRON ) have been made busy (after Step 3a.) one
will get one cloth ready after every 4 minutes.
Time Overlapped Usage of Different Processing Stations in an Assembly Line
( General Observations) – 1
Motivation: To Decrease the Processing Time of a Number of Identical Jobs.
The trick is to sub-divide the entire processing of a single job into a number of sub- tasks .
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 5/53
5
Each sub task is to be handled by a separate processing station / stage.
1.Time Overlapped Usage of Different Processing Stations in an Assembly Line
( General Observations) – 2
4. Each of the Processing Stages should have some Input to Work on in order to keep that unit
Busy as often as possible.
5. Each of the Processing Stages except the very last one is generating some Output to beconsumed by the next Processing Stage only.
6. Each of these Processing Stages may not take same time and also need not be synchronized.
Hence there will have to be some intermediate store / buffer to hold temporarily the Inputs to any
particular processing station.
Time Overlapped Usage of Different Processing Stations in an Assembly Line
( General Observations) – 3
7.Since each processing stage is dependent on it‘s predecessor processing stage only as well as
feeding to it‘s next processing stage only hence one cannot reduce the processing time for any
particular task/job lower than the slowest processing stage‘s processing time.
8.Each task normally passes through each of the processing stages regardless of the requirement9.Hence time taken to process a single task may increase as compared to the case where the
given task is processed based on it‘s specific requirements since a task may have to go through
some unnecessary stages.
Time Overlapped Usage of Different Processing Stations in an Assembly Line
( General Observations) – 4
10. System Throughput i.e. the number of tasks completed over a specific period of time will
increase because of Time Overlapped operation of the various processing stages.
11 However , if during the course of Processing , if any of the Processing Stage Fails / Stalls
then the entire Assembly Line will either crash OR get stalled.
Using Pipeline Inside a Computer
( Salient Queries - 1)
1.How this Assembly Line Concept is applicable in the Instruction Processing in a typical
Computer ?
Ans . a). The CPU of any Computer essentially fetches , decodes and then executes Instructions
belonging to a Program.
b) Each Instruction Processing is composed of an almost identical set of stages / Machine Cycle.
Hence one can view the CPU to represent an Assembly Line for Instruction Processing.
Using Pipeline Inside a Computer
( Salient Queries - 2)2. Is the improved Throughput i.e. number of tasks completed over a period of time dependent
on / proportional to the number of processing / PIPELINE stages ?
To be answered later in the context of Instruction Processing in a Computer.
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 6/53
6
CHAPTER2
The Typical Instruction Handling Sequence in a CPU
Typical Instruction Processing Stages Inside a CPU- 1
1.Fetch Instruction Op-Code [CISC ] / The Entire Instruction [ RISC ] from Instruction-Cache/
Memory into the Instruction Register using Instruction Pointer / PC appended by Code Segment
Register, as well as Update the Instruction Pointer / PC to point to next Instruction.[ IF] .
2.Decode Instruction Op-Code Inside the CPU and select some Register Operands [ RISC] , (In
this case Instruction Pointer / PC can be used to fetch the next Instruction ) or decide on future
Operand Address Reads as well as the next Instruction Location as in CISC. Update PC
Accordingly [ ID ]
Typical Instruction Processing Stages Inside a CPU- 2
3.Read Operand Addresses into the Instruction Register from I-Cache using the Instruction
Memory Address Register [ CISC only] [ROA] May have to be carried out a number of times
once for each of the Operand Addresses. (Optional) Not required for RISC.4.Execute Instruction Processing Op Code / Calculate Linear Operand Address Offset using
ALU [EX] . In the former case (processing) the operation may vary in time depending on the
type of Operation being carried out.
Typical Instruction Processing Stages Inside a CPU – 3
5.Read operand Values from Data -Cache / Memory using the computed Linear Offset as
obtained in the previous step appended by the appropriate Segment Registers
( DATA / STACK / EXTRA) . [ MEM]
N.B: For CISC the above two steps 4 & 5 may need to be executed a Number of times once each
for reading each of the Operand Addresses and at least once for performing computation. This
Computation Time need not be fixed.
Typical Instruction Processing Stages Inside a CPU - 4
6.Write Back Result [ Into the Designated Destination ]
[ WB ] . In case of a Memory being the destination the processor needs to compute the Linear
Address Offset using the step 4 .
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 7/53
7
7.Interrupt Handling : Here main issues being two fold namely preserving the Current Context in
the System Stack followed by computing / locating the Target and loading it to the Instruction
Pointer.
One can Time Overlap these operations provided
A. There is no Resource Conflict among the various stages [ No Structural Hazards ]
B. Each Instruction once in the Pipeline does in no way affect the Execution pattern of any of it‘sSuccessor Instructions in the Pipeline [ There exists no Inter Instruction Dependency in the form
of either DATA Hazards or Control Hazards ] .
A representative RISC Processor [ MIPS / DLX ] Salient Features:
1. 32 bit Processor i.e. can handle 32 bit Operands at one go.
2. Fixed Instruction length ( 32 bits). Hence can be fetched in one machine cycle.
3. Load Store Architecture i.e. all the source operands need to be brought in some CPU
Register before processing all Results are to be computed in some CPU register before being
stored in some Memory location.
4. Restricted Addressing modes [ Register Direct , Indexed , Relative, Implied ].5. Large GPR file set.
MIPS a RISC Processor Uses the following 5-stage Pipeline
1.IF: Instruction fetch from Instruction Memory.
2.ID: Decode operands and Select CPU Register operands.
3.EX: ALU operation or Memory Data operand Linear Address generation.
4.MEM: Data Memory reference to Read Operand Values.
5.WB: Write back into CPU Register file.
MIPS Pipeline Stages
5 stages of MIPS Pipeline:
1. IF Stage:Needs access to the Program Memory to Fetch the whole instruction.
Needs a dedicated adder to update the PC.
2. ID Stage:
Needs access to the Register File.
3. EX Stage:
Needs an ALU and Associated Registers.
4. MEM Stage:
Needs access to the Data Memory.
5. WB Stage:Needs access to the Register File for writing Result. Pipeline Registers :
Pipeline registers are essential part of pipelines serving as Inter Stage Buffers / Latches.
There are N-1 groups of pipeline registers in an N stage pipeline one group lying between two
successive Pipeline stages.
Each stage after completion of it‘s processing part saves ALL the relevant outputs generated by
it to the Intermediate Register lying at it‘s output. In MIPS Pipeline these Registers happens to
be .
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 8/53
8
1. IF /ID (Instruction Register writes the Fetched Instruction into this . Condition Code Flags
are also written into it ).
2. ID/EX ( ALU Operand Registers are written from it hence this stores the content of ALL the
Input Register Operands).
3. EX/MEM ( ALU Result Register + Flags writes into it ).
4. MEM/WB ( Memory Data / Buffer Register writes into it.)This way, each time ―something is computed‖...
Effective address, Immediate value, Register content, etc. are saved & can be made available in
the context of the instruction that needs it.
Pipeline Register Depiction
Historically, there are two different types of pipelines:
1.Instruction pipelines
2.Data / Arithmetic pipelines [ SIMD case ]
Arithmetic pipelines (e.g. Floating Point Processing) are mostly found within Special Purpose
Processors / Co-Processors since these are to be employed only occasionally and also such Data
Pipelines Need a continuous stream of arithmetic operations. e.g. Vector processors operating
on an array.
On the other hand Instruction Pipelines are used in almost every modern processor to Increase
Instruction Execution Throughput. Assumed a s default.
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 9/53
9
Instruction Level Parallelism [ILP] :
It is a measure of how many of the Instructions in a Computer Program can be executed
simultaneously [ In a Time Overlapped Fashion ] without violating the various Inter –
Instruction Dependencies that may exist.
Consider the following program:
I#1. e = a + bI#2. f = c + d
I#3. g = e * f
Instruction I#3 depends on the results of Instruction I#1 as well as on Instruction I#2 [ True
(Data) [RAW] Dependency ]
However, instructions I#1 and I#2 do not depend on any other Instruction , so they can be
Executed simultaneously.
If we assume that each Instruction can be completed in one unit of time then these three
instructions can be completed in a total of two units of time, giving an ILP of 3/2.
Goal & Motivation to achieve Speed Up:
Ordinary programs are typically written under a sequential execution model where instructionsexecute one after the other and in the order specified by the programmer.
ILP allows the compiler and the processor to overlap the execution of multiple instructions or
even to change the order in which instructions are executed.
A goal of compiler and processor designers is to identify and take advantage of as much ILP as
possible in a Specified Sequential Code.
How much ILP exists in programs is very application specific. In certain fields, such as graphics
[ Manipulation of Individual Pixels in a Group ]and scientific computing [ Matrix Multiplication]
the amount can be very large. However, workloads such as cryptography exhibit much less
parallelism because of the inherent RAW Data Dependency among the constituent Operations
Ordinary programs are typically written under a sequential execution model where instructions
execute one after the other and in the order specified by the programmer.
ILP allows the compiler and the processor to overlap the execution of multiple instructions or
even to change the order in which instructions are executed.
A goal of compiler and processor designers is to identify and take advantage of as much ILP as
possible in a Specified Sequential Code.
How much ILP exists in programs is very application specific. In certain fields, such as graphics
[ Manipulation of Individual Pixels in a Group ]and scientific computing
[ Matrix Multiplication] the amount can be very large. However, workloads such as
cryptography exhibit much less parallelism because of the inherent RAW Data Dependencyamong the constituent Operations.
Micro – Architectural Techniques used to Exploit ILP - 1
Instruction pipelining where the execution of multiple instructions can be partially overlapped.
Superscalar execution in which multiple execution units are used to execute multiple instructions
in parallel. In typical superscalar processors, the instructions executing simultaneously are
adjacent in the original program order.
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 10/53
10
Out-of-order execution where instructions execute in any order that does not violate data
dependencies. Note that this technique is independent of both pipelining and superscalar.
Register renaming which refers to a technique used to avoid unnecessary serialization of
program operations imposed by the reuse of registers by those operations, used to enable out-of-
order execution.
Speculative execution which allow the execution of complete instructions or parts of instructionsbefore being certain whether this execution should take place.
A commonly used form of speculative execution is control flow speculation where instructions
past a control flow instruction (e.g., a branch) are executed before the target of the control flow
instruction is determined [ Branch Prediction (used to avoid stalling for control dependencies to
be resolved) ].
Several other forms of speculative execution have been proposed and are in use including
speculative execution driven by value prediction, memory dependence prediction and cache
latency prediction.
Factors Affecting ILP Implementation
Inter Instruction Dependencies:Data Dependency & Control Dependency.Various types of Data Dependencies
A data dependency in computer science is a situation in which a program statement (instruction)
refers to the Data / Operand of a preceding statement / Instruction in some way or the other.
In compiler theory, the technique used to discover data dependencies among statements (or
instructions) is called Dependence analysis.
Data Dependency
Defn:Let‘s consider that in any Computer Program there are two Statements S1 & S2 where the
statement S1 happens to be preceding the statement S2 in the Program.
The Statement S2 is said to be Data dependent on the Statement S1 if any one of the following 3
cases exist.
Data Dependency Conditions
Bernstein Conditions :Assuming statement S1 and S2, S2 depends on S1
if: [I(S1) ∩ O(S2)] ∪ [O(S1) ∩ I(S2)] ∪ [O(S1) ∩ O(S2)] ≠ Φ
where: I (Si) is the set of memory locations read by Si and O (Sj) is the set of memory locations
written by Sj and there is a feasible run-time execution path from S1 to S2.This Condition is
called Bernstein Condition, named by A. J. Bernstein.
Cases of Data Dependency
True (data) Dependence: O(S1) ∩ I (S2)
Statement S1 precedes Statement S2 and S1 writes into some Place (Memory / Register ) thatwill be READ by the Successor Statement S2 . [ Read After Write (RAW) ]
Anti-( Name) Dependence: I(S1) ∩ O(S2) , mirror relationship of true dependence here the
predecessor Instruction S1 Reads from some Memory Location or Register which is later
modified / written onto by the Successor Instruction S2 [ Write After Read (WAR) ].
Output Dependence: O(S1) ∩ O(S2), S1->S2 and both the Instructions S1 & S2 writes to the
same memory location or Register [ Write After Write (WAW) ]
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 11/53
11
True Data [RAW] Dependency – 1
Statement S1 precedes Statement S2 and S1 writes into some Place (Memory / Register ) that
will be READ by the Successor Statement S2 . [ Read After Write (RAW) ]
Example :
A true dependency, also known as a data dependency, occurs when an instruction depends on the
result of a previous instruction:I#1. A = 3
I#2. B = A
I#3. C = B
True Data [RAW] Dependency - 2
Here Instruction I#3 is truly dependent on instruction I#2, as the final value of C depends on the
instruction updating B. Instruction I#2 is truly dependent on instruction I#1, as the final value of
B depends on the instruction updating A.
Since instruction I#3 is truly dependent upon instruction I#2 and instruction I#2 is truly
dependent on instruction I#1, instruction I#3 is also truly dependent on instruction I#1.
Instruction level parallelism is therefore not an option in this example.Anti (Name) [WAR] Dependency
An anti-dependency occurs when an instruction requires a value that is later updated. In the
following example, instruction 3 anti-depends on instruction 2 — the ordering of these
instructions cannot be changed, nor can they be executed in parallel (possibly changing the
instruction ordering), as this would affect the final value of A.
I#1. B = 3
I#2. A = B + 1
I#3. B = 7
An anti-dependency is an example of a name dependency. That is, renaming of variables could
remove the dependency, as depicted in the next Slide:
Removing Anti Dependency through Renaming of Variables
I#1 . B = 3
I#N. B2 = B
I#2. A = B2 + 1
I#3. B = 7
Here a new variable, B2, has been declared as a copy of B in a new instruction, instruction N.
The anti-dependency between the instruction I#2 and the Instruction I#3 has been removed,
meaning that these instructions may now be executed in parallel. However, the modification has
introduced new sets of RAW dependencies like instruction I#2 is now truly dependent oninstruction I#N, which is in turn truly dependent upon instruction I#1.
As true dependencies, these new dependencies are impossible to safely remove.
Output [WAW] Dependency
An output dependency occurs when the ordering of instructions will affect the final output value
of a variable. In the example below, there is an output dependency between instructions I#3 and
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 12/53
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 13/53
13
However, dependencies among statements or instructions may hinder parallelism — parallel
execution of multiple instructions, either by a parallelizing compiler or by a processor exploiting
instruction level parallelism [ILP].
Recklessly executing multiple instructions without considering related dependences may cause
danger of getting wrong results, namely hazards.
A Non Pipelined floating point Processing:
Pipelined Floating Point Processing:
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 14/53
14
Pipeline cycle :
The time required to move an instruction one step further in the pipeline.
Not to be confused with clock cycle.Determined by the time required by the slowest stage.
Basic Pipelining Terminologies
Pipeline designers try to balance the length (i.e. the processing time) of each pipeline stage.
For a perfectly balanced N stage pipeline, the execution time per instruction is t/N,where t is the execution time per instruction on non-pipelined machine and N is the number of
pipeline stages.
However, it is very difficult to make the different pipeline stages perfectly balanced. So different
Pipeline stages may possess different Processing time.
Besides, pipelining itself involves some overhead arising due to the Registers/ Latches used
between two successive pipeline stages
Some Important Pipeline Issues
Timing Factors in a Typical Pipeline
Pipeline cycle :
If Inter stage Latch / Register Delay = d
= max {m } + d
Pipeline frequency : f
f = 1 /
Ideal Pipeline Speedup
k-stage pipeline processes n tasks in k + (n-1) clock cycles:
k cycles for the first task and n-1 cycles for the remaining n-1 tasks.
Total time to process n tasks
Tk = [ k + (n-1)]
For the non-pipelined processor
T1 = n k [ n tasks passes through k stages each having delay ]Pipeline Speedup Expression
Speedup(SK )=T1 /TK = n K / [K+(n-1) = n K / [K+(n-1)
Observe that the memory bandwidth must increase by a factor of Sk :Otherwise, the processor
would stall waiting for data to arrive from memory
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 15/53
15
Exercise – 1
Consider an unpipelined processor:
Takes 4 cycles for ALU and other operations
5 cycles for memory operations.
Assume the relative frequencies:
ALU and other=60%,memory operations=40
Cycle time =1ns Compute speedup due to pipelining:
Ignore effects of branching. Assume pipeline overhead = 0.2ns
Solution
Average instruction execution time for large number of instructions
unpipelined= 1ns * (60%*4+ 40%*5) =4.4ns
Pipelined=1.2ns
Speedup=4.4/1.2=3.7 times
Pipeline Types:
Synchronous pipeline:Either Pipeline cycle is constant (OR)
Pipeline Cycle through any Pipeline stage is an Integer Multiple of Clock Frequency known a-
priori to each of the Pipeline stages so each stage knows when it‘s input will be available.
N.B: Assumed Default.
Asynchronous pipeline:
Time for moving from stage to stage varies.
Individual stages need not be aware about the Timing of any other Stage.
Handshaking communication between stages.
A stage may have to WAIT for Input availability thereby requiring Interlocking of Stages.
Synchronous Pipeline
Transfers between stages are simultaneous.
One task or operation enters the pipeline per cycle.
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 16/53
16
No of Pipeline Stages vs Performance – 1
Various Pipelined Processing Stages 1 – 8086:
Bus Interface Unit and Execution unit will work independently .(To enable two stage pipelined
processing) in 8086 Fetch and Execution overlap is there.
It is Only 2 stage pipelining
F E
F E
F E
F=Fetch the instruction and decode the inst, E=Execute the inst and write in to memory
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 17/53
17
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 18/53
18
Various Pipelined Processing Stages – 2:
Pipelined CPU – Memory Interface:
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 19/53
19
Pipelined CPU – GPR Interface:
Speedup Factors with Instruction Pipelining
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 20/53
20
Performance Evaluation Method
Amdahl’s Law
Quantifies overall performance gain due to improve in a part of a computation.
Performance improvement gained from using some faster mode of execution is limited by the
amount of time the enhancement is actually used.
Amdahl‘s Law Speedup=Execution time for task with out enhancement/Execution time for the task using
enhancement
Amdahl‘s Law and Speedup
Speedup tells us:How much faster a machine will run due to an enhancement.
For using Amdahl‘s law two things should be considered:
Fraction of the computation time in the original machine that can use the enhancement
If a program executes in 30 seconds and 15 seconds of execution uses enhancement,
Fraction = ½. This value termed as Fraction (Enhanced) is always less than or equal to
1.Improvement gained by enhanced Execution mode ; that is , how much faster the task would
run if the Enhanced mode were used for the entire program .If enhanced task takes 3.5 secondsand original task took 7 seconds, we say the speedup is
2. CISC processors are not suitable for pipelining because of:
Variable instruction format.
Variable execution time.
Complex addressing modes.
RISC processors are suitable for pipelining because of:
Fixed instruction format.
Fixed execution time.
Limited addressing modes.
Advantages and disadvantages: Pipelining does not help in all cases. There are several possible disadvantages. An instruction
pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A
pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline
Advantages of Pipelining:
1.An n-stage pipeline:Can improve performance upto n times.
2.Not much investment in hardware:No replication of hardware resources necessary.The
principle deployed is to keep the units as busy as possible.
3.Transparent to the programmers:Easy to use
4The cycle time of the processor is reduced, thus increasing instruction issue-rate in most cases.
5Some combinational circuits such as adders or multipliers can be made faster by adding more
circuitry. If pipelining is used instead, it can save circuitry vs. a more complex combinational
circuit.
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 21/53
21
6.Pipelines: Few Key Observations -1
Pipeline increases instruction throughput:But, does not decrease the execution time of the
individual instructions.In fact, slightly increases execution time of each instruction due to
pipeline overheads since each Instruction passes through Identical Pipeline stages.
Disadvantages of Pipelining:
1.A non-pipelined processor executes only a single instruction at a time. This prevents branchdelays (in effect, every branch is delayed) and problems with serial instructions being executed
concurrently. Consequently the design is simpler and cheaper to manufacture.
2.The instruction latency in a non-pipelined processor is slightly lower than in a pipelined
equivalent. This is because extra flipflops must be added to the data path of a pipelined
processor.
3.A non-pipelined processor will have a stable instruction bandwidth. The performance of a
pipelined processor is much harder to predict and may vary more widely between different
programs.
Pipeline Overheads
Pipeline register delay:Caused due to set up time.Clock skew:the maximum delay between clock arrival at any two registers.
Once clock cycle is as small as the pipeline overhead:No further pipelining would be useful.Very
deep pipelines may not be useful .
EXAMPLES:
Four Stages of an Instruction:
Instruction Fetch(F): Fetch the instruction from the Instruction Memory
Operand Fetch and Instruction Decode(D): Fetch the operand Data from the Memory or Reg
& Decode the inst
Execute(E):Calculate the memory address and/or execute the functionMemory & Write back(M) : Read the data from the Data Memory & Write Back to Register
INSTRUCTIONS WAITING:
D
C D
B C D
A B C D
0 1 2 3 4 5 6 7 8
FETCH A B C D X X X X
DECODE X A B C D X X X
XECUTE X X A B C D X X
MEMORY X X X A B C D X
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 22/53
22
INSTRUCTIONS COMPLETED:
4-stage pipeline; the boxes represent instructions independent of each other
The top box is the list of instructions waiting to be executed; the bottom gray box is the list of
instructions that have been completed; and the middle white box is the pipeline.
Execution is as follows:
Time Execution
0 Four instructions are awaiting to be executed
1 The A instruction is fetched from memory
2 the A instruction is decoded
the B instruction is fetched from memory
3
the A instruction is executed (actual operation is performed)
the B instruction is decoded
the C instruction is fetched
4
the A instruction's results are written back to the register file or memory
the B instruction is executed
the C instruction is decoded
the D instruction is fetched
5
the A instruction is completed
the B instruction is written back
the C instruction is executed
the D instruction is decoded
6
The B instruction is completed
the C instruction is written back
the D instruction is executed
7 the C instruction is completed
A B C D
A B C
A B
A
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 23/53
23
the D instruction is written back
8 the D instruction is completed
9 All instructions are executed
Bubble:
D
C D
B C D
A B C D
X A B B C D X X X X
X X A OO B C D X X X
X X X A OO B C D X X
X X X X A OO B C D X
0 1 2 3 4 5 6 7 8 9
COMPLETED INSTRUCTIONS:
Bubble in cycle 3 delays execution
Bubble (computing):
When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing
useful happens. In cycle 2, the fetching of the ‗B‘ instruction is delayed and the decoding stage
in cycle 3 now contains a bubble. Everything "behind" the ‗B‘ instruction is delayed as well buteverything "ahead" of the ‗B‘ instruction continues with execution.
Clearly, when compared to the execution above, the bubble yields a total execution time of 8
clock ticks instead of 7.
Bubbles are like stalls, in which nothing useful will happen for the fetch, decode, execute and
writeback. It can be completed with a NOP(no operation) code.
A A B C D
A B C
A B
A
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 24/53
24
Example2:
Pipelined Execution Of Six Instructions :
2 3 4 3 4 4 4 4 4
F D E MF D E M
F D E M
F D E M
F D E M
F D E M
2 4 6 8 10 12 17 20 25 29 32
Shaded region is the time while inst is waiting for processing unit(F or D or E or M)
Total Time Taken For Six Inst = 32…….(b) =2+3+4+3+(6-1)*4 (can be derived easily from the above fig)
If there are N number of instructions then total time required
=Total time required single inst + (N-1) * slowest process
Similarly in 8086 with F+E(2+3=5 units) & D+M(4+3 =7 units) as two stages
Total time = 12+(6-1) * 7 =47…………………….(c)
Throughput:
From (1)
Sequential Processing = 72/6 = 12 units.......(from (a)&(1))
8086(2 stage)= 47/6 ≈8 units.......(form (c)&(1))
4 Stage Pipelined Processing =32/6 ≈ 6 units……..(from (b) & (1))
From the above we can conclude that with the use of 4 stage pipelined architecture we can
reduce the throughput such that no of inst processed by a processor in a given time will increase
Example3
A typical instruction to add two numbers might be ADD A, B, C, which adds the values
found in memory locations A and B, and then puts the result in memory location C. In a
pipelined processor the pipeline controller would break this into a series of tasks similar to:
LOAD R1, A
LOAD R2, BADD R3, R1, R2STORE C, R3LOAD next instruction
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 25/53
25
The locations 'R1', 'R2' and 'R3' are registers in the CPU. The values stored in memory
locations labeled 'A' and 'B' are loaded (copied) into the R1 and R2 registers, then added,
and the result (which is in register R3) is stored in a memory location labeled 'C'.
In this example the pipeline is three stages long- load, execute, and store. Each of the steps
are called pipeline stages.
On a non-pipelined processor, only one stage can be working at a time so the entire
instruction has to complete before the next instruction can begin. On a pipelined processor,
all of the stages can be working at once on different instructions. So when this instruction is
at the execute stage, a second instruction will be at the decode stage and a 3rd instruction
will be at the fetch stage.
Pipelining doesn't reduce the time it takes to complete an instruction; it increases the
number of instructions that can be processed at once and reduces the delay between
completed instructions. The more pipeline stages a processor has, the more instructions it
can be working on at once and the less of a delay there is between completed instructions.
Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel
AVR and the PIC microcontroller each have a 2 stage pipeline.) Intel Pentium 4 processors
have 20 stage pipelines.
Example 4
To better visualize the concept, we can look at a theoretical 3-stage pipeline:
Stage Description
Load Read instruction from memory
Execute Execute instruction
Store Store result in memory and/or registers
and a pseudo-code assembly listing to be executed:
LOAD A, #40 ; load 40 in A
MOVE B, A ; copy A in B
ADD B, #20 ; add 20 to B
STORE 0x300, B ; store B into memory cell 0x300
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 26/53
26
This is how it would be executed:
Clock 1
Load Execute Store
LOAD
The LOAD instruction is fetched from memory.
Clock 2
Load Execute Store
MOVE LOAD
The LOAD instruction is executed, while the MOVE instruction is fetched from memory.
Clock 3
Load Execute Store
ADD MOVE LOAD
The LOAD instruction is in the Store stage, where its result (the number 40) will be stored
in the register A. In the meantime, the MOVE instruction is being executed. Since it must
move the contents of A into B, it must wait for the ending of the LOAD instruction.
Clock 4
Load Execute Store
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 27/53
27
STORE ADD MOVE
The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD
is calculating And so on. Note that, sometimes, an instruction will depend on the result of
another one (like our MOVE example). When more than one instruction references a
particular location for an operand, either reading it (as an input) or writing it (as an output),
executing those instructions in an order different from the original program order can lead
to hazards (mentioned above). There are several established techniques for either preventing
hazards from occurring, or working around them if they do.
. Complications
Many designs include pipelines as long as 7, 10 and even 20 stages (like in
the Intel Pentium 4). The later "Prescott" and "Cedar Mill" Pentium 4 cores (and
their Pentium D derivatives) had a 31-stage pipeline, the longest in mainstream consumer
computing. The Xelerator X10q has a pipeline more than a thousand stages long. The
downside of a long pipeline is that when a program branches, the processor cannot know
where to fetch the next instruction from and must wait until the branch instruction finishes,
leaving the pipeline behind it empty. In the extreme case, the performance of a pipelined
processor could theoretically approach that of an un-pipelined processor, or even slightly
worse if all but one pipeline stages are idle and a small overhead is present between
stages. Branch prediction attempts to alleviate this problem by guessing whether the branch
will be taken or not and speculatively executing the code path that it predicts will be taken.
When its predictions are correct, branch prediction avoids the penalty associated with
branching. However, branch prediction itself can end up exacerbating the problem if
branches are predicted poorly, as the incorrect code path which has begun execution must
be flushed from the pipeline before resuming execution at the correct location.
In certain applications, such as supercomputing, programs are specially written to branch
rarely and so very long pipelines can speed up computation by reducing cycle time. If
branching happens constantly, re-ordering branches such that the more likely to be needed
instructions are placed into the pipeline can significantly reduce the speed losses associatedwith having to flush failed branches.
Self-Modifying Programs: Because of the instruction pipeline, code that the processor
loads will not immediately execute. Due to this, updates in the code very near the current
location of execution may not take effect because they are already loaded into the Prefetch
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 28/53
28
Input Queue. Instruction caches make this phenomenon even worse. This is only relevant
to self-modifying programs.
Mathematical pipelines: Mathematical or arithmetic pipelines are different from
instructional pipelines, in that when mathematically processing large arrays or vectors, a
particular mathematical process, such as a multiply is repeated many thousands of times. Inthis environment, an instruction need only kick off an event whereby the arithmetic logic
unit (which is pipelined) takes over, and begins its series of calculations. Most of these
circuits can be found today in math processors and math processing sections of CPUs like
the Intel Pentium line.
History
Math processing (super-computing) began in earnest in the late 1970s as Vector Processors
and Array Processors. Usually very large bulky super-computing machines that needed
special environments and super-cooling of the cores. One of the early super computers was
the Cyber series built by Control Data Corporation. Its main architect was Seymour Cray,
who later resigned from CDC to head up Cray Research. Cray developed the XMP line of
super computers, using pipelining for both multiply and add/subtract functions. Later, Star
Technologies took pipelining to another level by adding parallelism (several pipelined
functions working in parallel), developed by their engineer, Roger Chen. In 1984, Star
Technologies made another breakthrough with the pipelined divide circuit, developed by
James Bradley. By the mid 1980s, super-computing had taken off with offerings from many
different companies around the world.
Today, most of these circuits can be found embedded inside most micro-processors.
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 29/53
29
CHAPTER 3ARCHITECTURE:
E MSAF S
RDWRLRG
LDR
ALU
(IR) 4-7=R2 0 1
(IR) 0-3 =R1 MUX
CO (IR) 8-11=R3NTROL
FETCH(1) DECODE(2) EXECUTE(3) MEMORY(4)
PROGRAM
MEMORY
ACCUMULATOR
(4)OA
RSM(4)
RSE(3)
PC(1)
REGISTER
ARRAY
+1
MPME
MPMMMPAM(4)
MPAE(3)
DECODER
INST REG(2)
DATA
MEMORY
STR(4)
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 30/53
30
INSTRUCTION FORMAT
OPCODE R3 R2 R1
AND SOURCE SOURCE DESTINATION
OR SOURCE SOURCE DESTINATION
ADD SOURCE SOURCE DESTINATION
SUB SOURCE SOURCE DESTINATION
MOVE XXXXXXX SOURCE DESTINATION
LOAD XXXXXXX SOURCE DESTINATION
STORE DESTINATION SOURCE XXXXXXX
NOT XXXXXXX SOURCE &DESTINATION XXXXXXX
EXAMPLE PROGRAM
NUMBER INSTRUCTION OPERATION BINARY CODE
I1 ADD R5 R4 R1 [R1]<-[R5]+[R4] 16‘H 0541
I2 SUB R6 R4 R7 [R7]<-[R4]-[R6] 16‘H 1647
I3 MOVE R4 R3 [R3]<-[R4] 16‘H 4043
I4 OR R3 R7 R0 [R0]<-[R3]||[R7] 16‘H 3370
I5 LOAD R0 R3 [R3]<-[[R0]] 16‘H 5503
I6 AND R7 R0 R2 [R0]<-[R3]&&[R7] 16‘H 2702
I7 STORE R1 R6 [[R6]]<-[R1] 16‘H 6160
MICRO PROGRAM MICRO
PROGRAM
MEMORY
CONTENT
MNEMONICS SAF(4-
bit)
S(1-bit) RGW(1-bit) MW(1-
BIT)
MR(1-
BIT)
CODE
ADD 4‘H 0 1 1 0 0 8‘H 0C SUB 4‘H 1 1 1 0 0 8‘H 1C
AND 4‘H 2 1 1 0 0 8‘H 2C
OR 4‘H 3 1 1 0 0 8‘H 3C
MOVE 4‘H 4 1 1 0 0 8‘H 4C
LOAD 4‘H 5 0 1 0 1 8‘H 55
STORE 4‘H 6 1 0 1 0 8‘H 6A
NOT 4‘H 7 1 1 0 0 8‘H 7C
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 31/53
31
SIGNALS FULLFORM
SAF SELECT ALU FUNCTION
S MUX SELECT LINE
RGW REGISTER WRITE
MW MEMORY WRITE
MR MEMORY READ
FETCH DECODE EXECUTE MEMORY
[IR]<-[[PC]][PC]<-[PC]+1
[IR]
DECODER=[IR]15-12
[PC]
[MPAE]<-DECODER[OB]<-[[IR]]11-8[OA]<-[[IR]]7-4[RSE]<-[IR]3-0
[MPAE]
[OB]
[OA]
[RSE]
[STR]<-[OA][AR]<-ALU_OUT[RSM]<-[RSE][MPAM]<-
[MPAE]
[STR]
[AR]
[RSM]
[MPAM]
clk1 clk2 clk3 clk4 clk5 clk6I1-FETCH I1-DECODE I1-EXECUTE I1-MEMORY
[IR]<-[[0000]][PC]<-0000+1
[IR]=16’H 0541 [PC]=4H’00001
[MPAE]<-DECODER[OB]<-[R5][OA]<-[R4][RSE]<-1
NOTE: Initally[R5]=4;[R4]=7
[OA]=4:[OA]=7
[RSE]=1;
[MPAE]=8’H0C
[STR]<-7[AR]<-11[RSM]<-1[MPAM}<-8‘H0C
[STR]=7
[AR]=11
[RSM]=1
[MPAM]=8’H0C
[R1]=11
I2-FETCH I2-DECODE I2-EXECUTE I2-MEMORY
[IR]<-[[0001]][PC]<-0001+1
[IR]=16’H 1647
[PC]=4H’00002
[MPAE]<-DECODER[OB]<-[R6][OA]<-[R4][RSE]<-7
NOTE: Initally[R6]=4;
[OA]=4:[OA]=7
[RSE]=7;[MPAE]=8’H1C
[STR]<-7[AR]<-3[RSM]<-7[MPAM}<-8‘H1C
[STR]=7
[AR]=3
[RSM]=7
[MPAM]=8’H1C
[R7]=3
I3-FETCH I3-DECODE I3-EXECUTE I3-MEMORY
[IR]<-[[0002]][PC]<-0002+1
[IR]=16’H 4043
[PC]=4H’00003
[MPAE]<-DECODER[OB]<-[R0][OA]<-[R4][RSE]<-3
[OA]=X:[OA]=7
[RSE]=3;
[MPAE]=8’H4C
[STR]<-7[AR]<-3[RSM]<-7[MPAM}<-8‘H4C
[STR]=7
[AR]=7
[RSM]=3[MPAM]=8’H4C
[R3]=7
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 32/53
32
(1)FETCH:
R2=[IR](7-4); [OA]<--[R2];R3=[IR](11-8); [OB]<--[R3];[RSE]=[IR](3-0);
CONTROL SIGNAL :::::: SAF;
I4-FETCH I4-DECODE I4-EXECUTE I4-MEMORY
Clk4
[IR]<-[[0003]][PC]<-0003+1
Clk5[IR]=16’H 3370
[PC]=4H’00004
[MPAE]<-DECODER
[OB]<-[R3][OA]<-[R7][RSE]<-0
Clk6[OA]=7:[OA]=3
[RSE]=0;
[MPAE]=8’H3C
[STR]<-3
[AR]<-7[RSM]<-0[MPAM}<-8‘H3C
Clk7
[STR]=3
[AR]=7
[RSM]=0[MPAM]=8’H3C
[R0]=7
Clk8 Clk9
I5-FETCH I5-DECODE I5-EXECUTE I5-MEMORY
[IR]<-[[0004]][PC]<-0004+1
[IR]=16’H 5503
[PC]=4H’00005
[MPAE]<-DECODER[OB]<-[R5][OA]<-[R0][RSE]<-3
[OB]=4;[OA]=7
[RSE]=3;
[MPAE]=8’H55
[STR]<-7[AR]<-7[RSM]<-3[MPAM}<-
8‘H55
[STR]=7
[AR]=7
[RSM]=3
[MPAM]=8’H55
[R0]=MEM[7](Contents atmemory location7)
I6-FETCH I6-DECODE I6-EXECUTE I6-MEMORY
[IR]<-[[0005]][PC]<-0005+1
[IR]=16’H 2702[PC]=4H’00006
[MPAE]<-DECODER[OB]<-[R7][OA]<-[R0][RSE]<-2
[OA]=3:[OA]=7
[RSE]=2;
[MPAE]=8’H2C
[STR]<-[7][AR]<-MEM[7]&& R3[RSM]<-2
[MPAM}<-8‘H2
C
[STR]=7
[AR]=MEM[7]
and 3
[RSM]=2
[MPAM]=8’H55
[R2]=MEM[7]
&& R3
I7-FETCH I7-DECODE I7-EXECUTE I7-MEMORY
Clk7
[IR]<-[[0006]][PC]<-0006+1
Clk8
[IR]=16’H 6160
[PC]=4H’00007
[MPAE]<-DECODER[OB]<-[R1][OA]<-[R6][RSE]<-2
Clk9
[OA]=3:[OA]=4
[RSE]=0;
[MPAE]=8’H6A
[STR]<-4[AR]<-3[RSM]<-0[MPAM}<-8‘H6A
Clk10
[STR]=4
[AR]=3
[RSM]=0
[MPAM]=8’H6A
MEM[4]=4
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 33/53
33
(3).EXECUTE:
[STR]<--[OA];[AR]<--ALU_OUT;[RSM]<--[RSE];
CONTROL SIGNAL::::::: S4, RD, WR, LRG;
(4).MEMORY:
R1=[RSM];DATA.MEMORY ADRS<--[AR];[DATA.MEMORY ADRS]<--[STR]; // FOR STORE INSTRUCTION ONLY[R1]<--[DATA.MEMORY ADRS]; // FOR LOAD INSTRUCTION ONLY[R1]<--[AR]; // FOR ARTHMETIC AND LOGIC INSTRUCTIONS ONLY
SAF---SELECT ALU FUNCTION
S4----MUX SELECT [ 1-AR; 0-LDR ]RGW-REGISTER WRITEMW--MEMORY WRITEMR--MEMORY READ
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 34/53
34
CHAPTER 4
VERILOG CODE:
module data_memory();
//parameter dataaddress=16;
//parameter datasize=256;
parameter data_address=65536;
parameter word_size=16;
//integer i;
//parameter data_address=16;
reg [word_size-1:0] datamemory[0:data_address-1]; //memory with 16 bit word size and 65536
memory locations
initial
begin
$readmemb("init.data",datamemory);
/*for(i=0;i<12;i=i+1)
$display("datamemory [%d]=%b",i,datamemory[i]);*/
end
endmodule
module program_memory(memory_out,address,data_in_memory,write_memory,clk,rst);
parameter wordsize=16;
parameter memorysize=256;
parameter addrsize=8;
output[wordsize -1 :0] memory_out;
input [addrsize -1 :0] address;
input [wordsize -1 :0] data_in_memory;
//input read_memory;
input write_memory; //not necessary
input clk;
input rst;
module ir(ir_out,data_in_ir,clk,rst);
parameter wordsize=16;
//parameter memorysize=256;
//parameter addrsize=8;
//input load_ir;
input [wordsize-1:0] data_in_ir; // INSTRUCTION FROM PROGRAM MEMORY
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 35/53
35
input clk;
input rst;
output [wordsize-1:0] ir_out;
reg ir_out;
always @(posedge clk) begin
if(rst) begin ir_out<=0; end
else begin ir_out<=data_in_ir; end
end
endmodule
module micro_memory(memory_out,address,data_in_memory,write_memory,clk,rst);
parameter uwordsize=8;
parameter umemorysize=16;
parameter uaddrsize=4;
output[uwordsize -1 :0] memory_out;
input [uaddrsize -1 :0] address;
input [uwordsize -1 :0] data_in_memory;
//input read_memory;
input write_memory;
input clk;
input rst;
module register8(register_out,register_in,clk,rst);
parameter r8wordsize=8;
output [r8wordsize -1 :0] register_out;
input [r8wordsize -1 : 0] register_in;
input clk;
input rst;
reg [r8wordsize -1:0] register_out;
initial begin register_out=0; end
always @(posedge clk)
begin
if(rst) begin register_out<=0;end
else if(clk) begin register_out<=register_in; end
end
endmodule
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 36/53
36
module
register_array(register_out1,register_out2,address1,address2,address3,data_in_register,write_regi
ster,clk,rst);
parameter regwordsize=16;
parameter regmemorysize=16;
parameter regaddrsize=4;
output[regwordsize -1 :0] register_out1;
output[regwordsize -1 :0] register_out2;
input [regaddrsize -1 :0] address1;
input [regaddrsize -1 :0] address2;
input [regaddrsize -1 :0] address3;
input [regwordsize -1 :0] data_in_register;
//input read_memory;
input write_register;
input clk;
input rst;
reg [regwordsize-1:0] memory[regmemorysize-1:0];
initial begin
memory[4'h0]<=16'h0001;
memory[4'h1]<=16'h0002;
memory[4'h2]<=16'h0003;
memory[4'h3]<=16'h0013;
memory[4'h4]<=16'h0023;
memory[4'h5]<=16'h0001;
memory[4'h6]<=16'h0002;
memory[4'h7]<=16'h0003;
memory[4'h8]<=16'h0013;
memory[4'h9]<=16'h0023;
memory[4'ha]<=16'h0001;
memory[4'hb]<=16'h0002;
memory[4'hc]<=16'h0003;
memory[4'hd]<=16'h0013;
memory[4'he]<=16'h0023;
memory[4'hf]<=16'h0001;
end
//asynchronous read operation for data_output
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 37/53
37
assign register_out1 = memory[address1];
assign register_out2 = memory[address2];
//data write operation
always @(posedge clk) begin
if(write_register) begin memory[address3]<=data_in_register; end
end
endmodule
oa,ob
module register16(register_out,register_in,clk,rst);
parameter r16wordsize=16;
output [r16wordsize -1 :0] register_out;
input [r16wordsize -1 : 0] register_in;
input clk;
input rst;
reg [r16wordsize -1:0] register_out;
initial begin register_out=0; end
always @(posedge clk)
begin
if(rst) begin register_out<=0;end
else if(clk) begin register_out<=register_in; end
end
endmodule
//rse
module register4(register_out,register_in,clk,rst);
parameter r4wordsize=4;
output [r4wordsize -1 :0] register_out;
input [r4wordsize -1 : 0] register_in;
input clk;
input rst;
reg [r4wordsize -1:0] register_out;
initial begin register_out=0; end
always @(posedge clk)
begin
if(rst) begin register_out<=0;end
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 38/53
38
else if(clk) begin register_out<=register_in; end
end
endmodule
module alu(alu_out,OB,OA,SAF);
parameter wordsize=16;
parameter N=4;
output [wordsize-1:0] alu_out;
input [wordsize-1:0] OB;
input [wordsize-1:0] OA;
input [N-1:0] SAF;
reg alu_out;
always@(SAF or OA or OB) begin
case(SAF)
4'd0 : alu_out = OA + OB; // ADDITION
4'd1 : alu_out = OA - OB; // SUBTRACTION
4'd2 : alu_out = OA & OB; // AND OF OA AND OB
4'd3 : alu_out = ~OA; // NOT of OA
4'd4 : alu_out = OA; // MOVE INSTRUCTION
4'd5 : alu_out = OA; // LOAD INSTRUCTION
4'd6 : alu_out = OB; // STORE INSTRUCTION
/*4'd7 : $display("INVALID ALU_CONTROL SIGNAL");
4'd8 : $display("INVALID ALU_CONTROL SIGNAL");
4'd9 : $display("INVALID ALU_CONTROL SIGNAL");
4'd10 : $display("INVALID ALU_CONTROL SIGNAL");
4'd11 : $display("INVALID ALU_CONTROL SIGNAL");
4'd12 : $display("INVALID ALU_CONTROL SIGNAL");
4'd13 : $display("INVALID ALU_CONTROL SIGNAL");
4'd14 : $display("INVALID ALU_CONTROL SIGNAL");
4'd15 : $display("INVALID ALU_CONTROL SIGNAL");
4'd16 : $display("INVALID ALU_CONTROL SIGNAL");
default : $display("INVALID ALU_CONTROL SIGNAL");*/
endcase
end
endmodule
module program_memory1(memory_out,address,data_in_memory,write_memory,clk,rst);
parameter dwordsize=16;
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 39/53
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 40/53
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 41/53
41
parameter uwordsize=8;
parameter r8wordsize=8;
parameter r16wordsize=16;
parameter regwordsize=16;
parameter r4wordsize=4;
input clk;
input rst;
output [wordsize -1 :0] mem_out;
output [addrsize -1 :0] pc_out;
output [wordsize -1 :0] ir_out;
//decode inputs
output [uwordsize -1 :0] umemory_out;
output [r8wordsize -1 :0] mpae_out;
output[regwordsize -1 :0] register_out1;
output[regwordsize -1 :0] register_out2;
output [r16wordsize -1 :0] oa_out;
output [r16wordsize -1 :0] ob_out;
output [r4wordsize -1 :0] rse;
//execute inputs
output [r16wordsize -1 :0] alu_out;
output [r16wordsize -1 :0] ar_out;
output [r16wordsize -1 :0] str_out;
output [r4wordsize -1 :0] rsm_out;
output [r8wordsize -1 :0] mpam_out;
//module
decode1(rse,oa_out,ob_out,register_out1,register_out2,mpae_out,umemory_out,ir_out,mem_out,
pc_out,clk,rst);
decode1
decode(rse,oa_out,ob_out,register_out1,register_out2,mpae_out,umemory_out,ir_out,mem_out,p
c_out,clk,rst);
//module alu(alu_out,OB,OA,SAF);
alu alu1(.alu_out(alu_out),.OB(ob_out),.OA(oa_out),.SAF(mpae_out[7:4]));
//module register16(register_out,register_in,clk,rst);
register16 ar(.register_out(ar_out),.register_in(alu_out),.clk(clk),.rst(rst));
register16 str(.register_out(str_out),.register_in(oa_out),.clk(clk),.rst(rst));
register4 rsm(.register_out(rsm_out),.register_in(rse),.clk(clk),.rst(rst));
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 42/53
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 43/53
43
output [addrsize -1 :0] pc_out;
output [wordsize -1 :0] ir_out;
//decode inputs
output [uwordsize -1 :0] umemory_out;
output [r8wordsize -1 :0] mpae_out;
output[regwordsize -1 :0] register_out1;
output[regwordsize -1 :0] register_out2;
output [r16wordsize -1 :0] oa_out;
output [r16wordsize -1 :0] ob_out;
output [r4wordsize -1 :0] rse;
//execute inputs
output [r16wordsize -1 :0] alu_out;
output [r16wordsize -1 :0] ar_out;
output [r16wordsize -1 :0] str_out;
output [r4wordsize -1 :0] rsm_out;
output [r8wordsize -1 :0] mpam_out;
output [dwordsize-1:0] ldr;
output [dwordsize-1:0] register_wire;
//FETCH PHASE
// [pc]<----[pc]+1
pc pc1(.pc_out(pc_out),.clk(clk),.rst(rst)); //pc is incremented at the positive edge of clock
// [mem_out]<----[[pc]]
program_memory pm(.memory_out(mem_out),.address(pc_out),.clk(clk),.rst(rst));
//asynchronous memory read
// [ir_out]<----[[pc]];
ir ir1(.ir_out(ir_out),.data_in_ir(mem_out),.clk(clk),.rst(rst));
// DECODE PHASE
// [umemory_out]<----[[ir_out[15:12]]]
micro_memory
umemory(.memory_out(umemory_out),.address(ir_out[15:12]),.clk(clk),.rst(rst));//decode of
opcode
// [mpae_out]<----[umemory_out]
register8 mpae(.register_out(mpae_out),.register_in(umemory_out),.clk(clk),.rst(rst));
// register_out1<----[ir_out[7:4]]
// register_out2<----[ir_out[11:8]]
// [[rsm_out]]<----register_wire
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 44/53
44
register_array
register(.register_out1(register_out1),.register_out2(register_out2),.address3(rsm_out),.data_in_r
egister(register_wire),.write_register(mpam_out[2:2]),.address1(ir_out[7:4]),.address2(ir_out[11:
8]),.clk(clk),.rst(rst));
// [oa]<----register_out1
register16 oa(.register_out(oa_out),.register_in(register_out1),.clk(clk),.rst(rst));
// [ob]<----register_out2
register16 ob(.register_out(ob_out),.register_in(register_out2),.clk(clk),.rst(rst));
// [rse]<----[ir_out[3:0]]
register4 rse1(.register_out(rse),.register_in(ir_out[3:0]),.clk(clk),.rst(rst));
//EXECUTE PHASE
// [alu_out]<----[oa] SAF [ob]
alu alu1(.alu_out(alu_out),.OB(ob_out),.OA(oa_out),.SAF(mpae_out[7:4]));
// [ar_out]<----[alu_out]
register16 ar(.register_out(ar_out),.register_in(alu_out),.clk(clk),.rst(rst));
// [str_out]<----[oa]
register16 str(.register_out(str_out),.register_in(oa_out),.clk(clk),.rst(rst));
// [rsm_out]<----[rse]
register4 rsm(.register_out(rsm_out),.register_in(rse),.clk(clk),.rst(rst));
// [mpam_out]<----[mpae_out]
register8 mpam(.register_out(mpam_out),.register_in(mpae_out),.clk(clk),.rst(rst));
// MEMORY PHASE
// DATA MEMORY
// [[ar_out]]<----[str_out] .....IF STORE INSTRUCTION
// ldr<----[[ar_out]] ...........IF LOAD INSTRUCTION
program_memory1
data_memory(.memory_out(ldr),.address(ar_out),.data_in_memory(str_out),.write_memory(mpa
m_out[1:1]),.clk(clk),.rst(rst));
// MUX
// register_wire<----[ar_out]......IF 1 IS SELECTED
// register_wire<----ldr ..........IF 0 IS SELECTED
assign register_wire = mpam_out[3:3] ? ar_out : ldr;
endmodule
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 45/53
45
SIMULATION RESULTS FOR SOME ENTITIES:
ALU
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 46/53
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 47/53
47
FETCH
EXECUTE
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 48/53
48
INST REGISTER
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 49/53
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 50/53
50
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 51/53
51
CHAPTER 5
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 52/53
52
Bibliography:
1.Pipelined architecture processors from behavioural-level(2001 IEEE) by Robert Heath
and Sreenivas Durbha ,Dept. of Electrical Engineering, 453 Anderson Hall,University of
Kentucky Lexington, KY 40506
2.Low-cost fault tolerance on the ALU in simple pipelined processors (2010 IEEE)Nguyen
Minh Huu∗ , Bruno Robisson and Michel Agoyan† CEA-Leti - Centre Micro´lectronique de
Provence e 880 route de Mimet,France
3. A study of floating-point architectures forpipelined RISC processors byReyes, J.A.P.; Alarcon, L.P.; Alarilla, L.;Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium
on Publication Year: 2006 , Page(s): 4 pp. – 2716
4.High-level implementation of the 5-stagepipelined ARM9TDM coreArandilla, C.C.; Constantino, J.B.A.; Glova, A.O.M.; Ballesil-Alvarez, A.P.; Reyes, J.A.P.;TENCON 2010 - 2010 IEEE Region 10 ConferencePublication Year: 2010
5. Design of High-Speed-Pipelined Execution Unit of 32-bit RISC ProcessorShofiqul Islam; Debanjan Chattopadhyay; Manoja Kumar Das; V Neelima; Rahul Sarkar;India Conference, 2006 Annual IEEEPublication Year: 2006
6. Design through verilog – T.R. Padmanabhan and B. Bala Tripura Sundari(WSE-2009)
7.Computer system architecture – Morris Mano 3rd Edition-Pearson Education.
8. Advanced Microprocessors And Peripherals by A.k.Ray Tata Mgraw Hill (2006)
9.www.isi.edu/~youngcho/csem
8/3/2019 Sank Art Hes Is
http://slidepdf.com/reader/full/sank-art-hes-is 53/53