pipeline processing (1)

PIPELINE PROCESSING

PARALLEL PROCESSING

• A parallel processing system is able to perform concurrent data processing to achieve faster execution time

• The system may have two or more ALUs and be able to execute two or more instructions at the same time

A computer employing parallel processing is also called parallel computer.

Parallel processing classification

Single instruction stream, single data stream – SISD

Single instruction stream, multiple data stream – SIMD

Multiple instruction stream, single data stream – MISD

Multiple instruction stream, multiple data stream – MIMD

Types of Parallel Computers

• Based on architectural configurations– Pipelined computers– Array processors– Multiprocessor systems

Architectural Classification

Number of Data Streams

Number ofInstructionStreams

Single

Multiple

Single Multiple

SISD SIMD

MISD MIMD

– Flynn's classification• Based on the multiplicity of Instruction Streams and

Data Streams• Instruction Stream

– Sequence of Instructions read from memory

• Data Stream– Operations performed on the data in the processor

COMPUTER ARCHITECTURES FOR PARALLEL PROCESSING

Von-Neuman based

Dataflow

Reduction

SISD

MISD

SIMD

MIMD

Superscalar processors

Superpipelined processors

VLIW(Very Long Instruction Word Arch.)

Nonexistence

Array processors

Systolic arrays

Associative processors

Shared-memory multiprocessors

Bus based Crossbar switch based Multistage IN based

Message-passing multicomputers

Hypercube Mesh Reconfigurable

PIPELINE PROCESSING

• Pipeline is a technique of overlapping the execution of several instructions to reduce the execution time of a set of instructions.

• It is a cascade of processing stages which are linearly connected to perform a fixed function over a stream of data flowing from one end to another.

Advantages of Pipeline Processing

• Reduced access time

• Increased throughput

Types of Pipeline Models

• Asynchronous pipeline

• Synchronous pipeline

Both models external inputs are fed into the first stage. The processed results are passed from stage Si to stage Si+1 for all i=1,2,…,k-1.

Final results appears in stage Sk.

Asynchronous Pipeline Model

• Data flow between adjacent stages is controlled by a handshaking protocol.

• Stage Si is ready to transmit data, it sends a ready signal to stage Si+1. After stage Si+1 receives the incoming data, it returns an ACK signal to stage Si.

• Advantages– Useful for designing communication channel

in message passing multicomputers

• Disadvantages– Variable throughput size– Different amounts of delay may be used in

different stages

Synchronous pipeline• Here clocked latches are used to interface

between stages. The latches are used to isolate input from outputs.

• Upon arrival of the clock pulses , all latches transfer data to the next stage simultaneously.

• Advantage– Equal delay in all stages

S R1 1 S R2 2 S R3 3 S R4 4Input

Clock

Instruction Execution steps

• Instruction fetch (IF) from MM

• Instruction Decoding (ID)

• Operand Fetch (OF), if any

• Execution of the decoded instruction (EX)

Non-pipelined computer

- 6 – Stage- Instruction fetch, Instruction Decode, Operand

Address calculate, Operand fetch, Execute, Write Result

Space-Time Diagram

1 2 3 4 5 6 7 8 9

T1

T1

T1

T1

T2

T2

T2

T2

T3

T3

T3

T3 T4

T4

T4

T4 T5

T5

T5

T5 T6

T6

T6

T6Clock cycles

Segment 1

2

3

4

Pipelined Computer

EX I1 I2 I3

OF I1 I2 I3

ID I1 I2 I3

IF I1 I2 I3 I4Stages/Time 1 2 3 4 5 6 7 8 9 10 11 12 13

In the first cycle instruction I1 is fetched from memory. In the second cycle another instruction I2 is fetched from memory and simultaneously I1 is decoded by the instruction decoding unit.

INSTRUCTION PIPELINE

Execution of Three Instructions in a 4-Stage Pipeline

Instruction Pipeline

FI DA FO EX

FI DA FO EX

FI DA FO EX

i

i+1

i+2

Conventional

Pipelined

FI DA FO EX

FI DA FO EX

FI DA FO EX

i

i+1

i+2

PIPELINING

R1 Ai, R2 Bi Load Ai and Bi

R3 R1 * R2, R4 Ci Multiply and load Ci

R5 R3 + R4 Add

A technique of decomposing a sequential process into suboperations, with each subprocess being executed in a partial dedicated segment that operates concurrently with all other segments.

Ai * Bi + Ci for i = 1, 2, 3, ... , 7

Ai

R1 R2

Multiplier

R3 R4

Adder

R5

MemoryBi Ci

Segment 1

Segment 2

Segment 3

OPERATIONS IN EACH PIPELINE STAGE

ClockPulse

Segment 1 Segment 2 Segment 3

Number R1 R2 R3 R4 R5 1 A1 B1

2 A2 B2 A1 * B1 C1 3 A3 B3 A2 * B2 C2 A1 * B1 + C1 4 A4 B4 A3 * B3 C3 A2 * B2 + C2 5 A5 B5 A4 * B4 C4 A3 * B3 + C3 6 A6 B6 A5 * B5 C5 A4 * B4 + C4 7 A7 B7 A6 * B6 C6 A5 * B5 + C5 8 A7 * B7 C7 A6 * B6 + C6 9 A7 * B7 + C7

INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE

1 2 3 4 5 6 7 8 9 10 12 1311

FI DA FO EX1

FI DA FO EX

FI DA FO EX

FI DA FO EX

FI DA FO EX

FI DA FO EX

FI DA FO EX

2

3

4

5

6

7

FI

Step:

Instruction

(Branch)

Fetch instructionfrom memory

Decode instructionand calculate

effective address

Branch?

Fetch operandfrom memory

Execute instruction

Interrupt?Interrupthandling

Update PC

Empty pipe

no

yes

yesno

Segment1:

Segment2:

Segment3:

Segment4:

Example: 6 tasks, divided into 4 segments

1 2 3 4 5 6 7 8 9

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

Pipeline Performance

• Latency– It is the amount of time, that a single operation takes

to execute

• Throughput– It is the rate at which each operations gets executed. (Operations/second or operations/cycle)

Non-pipelined processor, throughput =

pipelined processor, throughput >

Latency

1

Latency

1

• Cycle time of pipeline processor– Dependent on 4 factor

• Cycle time of unpipelined processor• Number of pipeline stages• How evenly data path logic is divided among the stages• Latency of the pipeline stages

• If the logic is evenly divided, then the clock period of the pipeline processor is

• If the logic is cannot be evenly divided, then the clock period of the pipeline processor is

• Cycle Time = Longest pipeline stages + Pipeline latch latency• Latency of each pipeline = Cycle time of the pipeline x No. of pipeline stages

latencylatch pipelinestages .

pipelineofNo

CycleTimeCycleTime dunpipeline

pipelined

• An unpipelined processor has a cycle time of 25 ns. What is the cycle time of pipelined version of the rocessor with 5 evenly divided pipeline stages, if each pipeline latch latency of 1 ns? What if the processor is divided into 50 pipeline stages? What is the total latency of the pipeline? How about if the processor is divided into 50 pipeline stages?

Questions

Solution Given data: Cycle Timeunpipelined = 25 ns

No. of pipeline stages = 5

pipeline latch latency = 1 ns


pipelineofNo


pipelined

= (25 / 5) + 1 ns = 6 ns

Therefore, Cycle time of the 5 pipeline stages = 6 ns

Latency of each pipeline = Cycletime of the pipeline x No. of pipeline stages

= 6 ns x 5 = 30 ns

For the 50 stage pipeline, cycle time = (25 ns / 50) + 1 ns = 1.5 ns

Therefore, Cycle time of the 50 pipeline stages = 1.5 nsLatency of each pipeline = Cycletime of the pipeline x No. of pipeline stages

= 1.5 ns x 50 = 75 ns

Questions

• Suppose an unpipelined processor with a 25 ns cycle time is divided into 5 pipeline stages with latencies of 5, 7, 3,6 and 4 ns. If the pipeline latch latency is 1 ns, what is the cycle time of the pipeline processor? What is the latency of the resulting pipeline?

Solution

• Here, unpipeline processor is used.

• The longest pipeline stage is : 7 ns

• Pipeline latch latency is = 1 ns• Cycle time = Longest pipeline stages + Pipeline latch Latency

= 7 + 1 = 8 ns

Therefore, cycle time of the unpipelined processor = 8 ns

There are 5 pipeline stages.

Total latency = Cycle Time of the pipeline x No. of pipeline stages

= 8 ns x 5 = 40 ns

Question• Suppose that an unpipelined processor has a

cycle time of 25 ns and that its datapath is made up of modules with latencies of 2,3,4,7,3,2 and 4 ns (in that order). In pipelining this processor, it is not possible to rearrange the order of the modules (For example, putting the register read stage before the instruction decode stage) or to divide a module into multiple pipeline stages (for complexity reasons). Given pipeline latches with 1 ns latency. What is the minimum cycle time that can be achieved by pipelining this processor?

Solution

• There is no limit on the number of pipeline stages.

• The minimum cycle time =

Latency of the longest module in the datapath + Pipeline latch time

= 7 + 1 ns

= 8 ns

Question • Given an unpipelined processor with a 10 ns

cycle time and pipeine latches with 0.5 ns latency?

a. What are the cycle times of pipelined versions of the processor with 2,4,7 and 16 stages if the datapath logic is evenly divided among the pipeline stages?

b. What is the latency of the pipelined versions of the processor?

c. How many stages of pipelining are required to achieve a cycle time of 2 ns and 1 ns?

Solution – a


pipelineofNo


pipelined

Given data: Cycle Timeunpipelined = 10 nsNo. of pipeline stages = 2,4,7 and 16pipeline latch latency = 0.5 ns

Cycle time pipeline for 2 stage pipeline = (10 ns / 2) + 0.5 = 5.5 ns

Cycle time pipeline for 4 stage pipeline = (10 ns / 4) + 0.5 = 3 ns

Cycle time pipeline for 7 stage pipeline = (10 ns / 7) + 0.5

= 1.42857 + 0.5 = 1.92857 ns

Cycle time pipeline for 7 stage pipeline = (10 ns / 16) + 0.5

= 0.625 + 0.5 = 1.125 ns

Solution – b

• Latency of each pipeline

= Cycle time of the pipeline x No. of pipeline stages

Latency for 2 stage pipeline = 5.5 x 2 = 11 ns

Latency for 4 stage pipeline = 3 x 4 = 12 ns

Latency for 7 stage pipeline = 1.92857 x 7

= 13.49999 ns

Latency for 16 stage pipeline = 1.125 x 16 = 18 ns

Solution – C • 1st solve the number of pipeline stages


pipelineofNo


pipelined

latency latch time stages pipeline of

pipelined Pipelinecycle

CycleTimeNumber dunpipeline

= (10 ns / (2ns – 0.5 ns)) = 10 / 1.5 = 6.6667

Therefore, Number of pipeline stages required to achieve 2 ns cycle time is 6.6667 = 7 stages (approx)

(since fractional part of pipeline stages is not allowed)

Similarly, Number of pipeline stages required to achieve 1 ns cycle = 10 ns / (1 ns – 0.5 ns) = 10/0.5 = 20 stages

Pipeline Hazards

Pipeline Hazards

• Pipeline increases the processor performance.– Several instructions are overlapped in the pipeline,

cycle time can be reduced, increasing the rate at which instructions are executed.

– There are number of factors that limits a pipeline ability to execute instructions at its peak rate, including dependencies between instructions, branches and the time required to access the memory.

Types of Hazards

• Instruction Hazards

• Structural Hazards

• Control Hazards

• Branches

Hazards in Pipelining

• Procedural dependencies => Control hazards– conditional and unconditional branches,

calls/returns

• Data dependencies => Data hazards– RAW (read after write)– WAR (write after read)– WAW (write after write)

• Resource conflicts => Structural hazards– use of same resource in different stages

Instruction Hazards– Occurs when instructions are R/W reg. that are used by other

instructions.• RAR Hazards

– Occurs when 2 instructions both read from the same reg.– Example:

» ADD R1, R2, R3» SUB R4, R5, R3

• RAW Hazards– Occurs when instruction reads a reg. that was written by prev. instructions– Example:

» ADD R1, R2, R3» SUB R4, R5, R1

• WAR Hazards– Occurs when output reg. of an instruction has been read by a prev. instructions– Example

» ADD R1, R2, R3» SUB R2, R5, R6

• WAW Hazards– Occurs when output reg. of an instruction has been written by prev. instructions– Example:

» ADD R1,R2,R3» SUB R1,R5,R6

STRUCTURAL HAZARDS

Structural Hazards

It occurs when the processor’s H/W is not capable of executing all the Instructions in the pipeline simultaneously.

Example: With one memory-port, a data and an instruction fetch cannot be initiated in the same clock

The Pipeline is stalled for a structural hazard<- Two Loads with one port memory -> Two-port memory will serve without stall

FI DA FO EXi

i+1

i+2

FI DA FO EX

FI DA FO EXstallstall

Control Hazards

• The delay between when a branch instructions enters the pipeline and the time at which the next instructions enters the pipeline is called the processor’s branch delay or Control Hazards.

• The delay is mainly due to the control flow of the program

CONTROL HAZARDS

Branch Instructions

- Branch target address is not known until the branch instruction is completed

- Stall -> waste of cycle times

FI DA FO EX

FI DA FO EX

BranchInstruction

NextInstruction

Target address available

Branches

• Branch instructions can also cause delay in pipelined processor because the processor cannot determine which instruction to fetch next until the branch has executed.

• Conditional branch instruction creates data dependencies between the branch instructions and the instruction fetch stage of the pipeline.

Question A. Identify all of the RAW hazards in this instruction queue

DIV R2, R5, R8SUB R9, R2, R7ASH R5, R14, R6MUL R11, R9, R5BEG R10, #0, R12OR R8, R15, R2

B. Identify all of the WAR hazards in the previous instruction sequence

C. Identify all of the WAW hazards in the previous instruction sequence

D. Identify all of the control hazards in the previous instruction sequence

Solution A). RAW hazards exists between following

instructions– Between DIV and SUB– Between ASH and MUL– Between SUB and MUL– Between DIV and OR.

B). RAW hazards exists between following

- Between DIV and ASH

- Between DIV and OR

C) There are no WAW hazardsD) There is only one control hazard. Between BEQ and OR

pipeline processing (1)

Documents

fetch instruction

instruction i2

stage instruction fetch

stream of data

fi da fo ex pipelinedi

c7 instruction execution

cycle instruction i1

stage si