chapter 08 - pipeline and vector processing

7/28/2019 Chapter 08 - Pipeline and Vector Processing

http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 1/14

Chapter 8 Pipeline and Vector Processing

1 | P a g e

Parallel ProcessingParallel Processing is a term used to denote a large class of techniques that are used to provide

simultaneous data-processing tasks for the purpose of increasing the computational speed of a

computer system. Instead of processing each instruction sequentially as in a conventional computer, a

parallel processing system is able to perform concurrent data processing to achieve faster execution

time.

For example, while an instruction is being executed in the ALU, the next instruction can be read from

memory. The system may have two or more ALUs and be able to execute two or more instructions at the

same time. Furthermore, the system may have two or more processors operating concurrently. The

purpose of parallel processing is to speed up the computer processing capability and increase its

throughput, that is, the amount of processing that can be accomplished during a given interval of time.

The amount of increase with parallel processing, and with it, the cost of the system increases. However,

technological developments have reduced hardware costs to the point where parallel processing

techniques are economically feasible.

Parallel processing can be viewed from various levels of complexity. At the lowest level, we distinguish

between parallel and serial operations by the type of registers used. Shift registers operate in serialfashion one bit at a time, while registers with parallel load operate with all the bits of the word

simultaneously, parallel processing at a higher level of complexity cab be achieved by have a multiplicity

of functional units that perform identical or different operations simultaneously. Parallel processing is

established by distributing the data among the multiple functional units. For example, the arithmetic

logic and shift operations can be separated into three units and the operands diverted to each unit under

the supervision of a control unit.

A parallel processing system is able to perform concurrent data processing to achieve faster execution

time. The system may have two or more ALUs and be able to execute two or more instructions at the

same time. Also, the system may have two or more processors operating concurrently. Goal is to

increase the throughput – the amount of processing that can be accomplished during a given interval of

time. Parallel processing increases the amount of hardware required.

Example: The ALU can be separated into three units and the operands diverted to each unit under the

supervision of a control unit. All units are independent of each other. A multifunctional organization is

usually associated with a complex control unit to coordinate all the activities among the various

components.

There are a variety of ways that parallel processing can be classified. It can be considered from the

internal organization of the processors, from the interconnection structure between processors, or from

the flow of information through the system. One classification introduced by M.J. Flynn considers theorganization of a computer system by the number of instructions and data items that are manipulated

simultaneously. The normal operation of a computer is to fetch instructions from memory and execute

them in the processor. The sequence of instructions read from memory constitutes an instruction

stream. The operations performed on the data in the processor constitute a data stream. Parallel

processing may occur in the instruction stream, in the data stream, or in both.




2 | P a g e




3 | P a g e

Flynn’s classification divides computers into four major groups as follows:

1. Single instruction stream, single data stream (SISD)

2. Single instruction stream, multiple data stream (SIMD)

3. Multiple instruction streams, single data stream (MISD)

4. Multiple instruction streams, multiple data stream (MIMD)

Single Instruction Stream, Single Data Stream (SISD)

In computing, SISD (single instruction, single data) is a term referring to a computer architecture in which

a single processor, a uniprocessor, executes a single instruction stream, to operate on data stored in a

single memory.

SISD represents the organizations of a single computer containing a control unit, a processor unit, and amemory unit. Instructions are executed sequentially and the system may or may not have internal

parallel processing capabilities. Parallel processing in this case may be achieved by means of multiple

functional units or by pipeline processing. Instruction fetching and pipelined execution of instructions are

common examples found in most modern SISD computers

Single Instruction Stream, Multiple Data Stream (SIMD)

Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes

computers with multiple processing elements that perform the same operation on multiple data points

simultaneously. Thus, such machines exploit data level parallelism. SIMD is particularly applicable to

common tasks like adjusting the contrast in a digital image or adjusting the volume of digital audio. Most

modern CPU designs include SIMD instructions in order to improve the performance of multimedia use.

SIMD represents an organization that includes many processing units under the supervision of a common

control unit. All processors receive the same instruction from the control unit but operate on different

items of data. The shared memory unit must contain multiple modules so that it can communicate with

all the processors simultaneously.




4 | P a g e

Multiple Instruction Streams, Single Data Stream (MISD)

In computing, MISD (multiple instruction, single data) is a type of parallel computing architecture where

many functional units perform different operations on the same data. Pipeline architectures belong to

this type.

MISD structure is only of theoretical interest since no practical system has been constructed using this

organization. One prominent example of MISD in computing are the Space Shuttle flight control

computers. A systolic array is an example of a MISD structure.

Multiple Instruction Streams, Multiple Data Stream (MIMD)

In computing, MIMD (multiple instruction, multiple data) is a technique employed to achieve parallelism.

Machines using MIMD have a number of processors that function asynchronously and independently. At

any time, different processors may be executing different instructions on different pieces of data.

MIMD organization refers to a computer system capable of processing several programs at the same

time. Most multiprocessor and multi-computer systems can be classified in this category. MIMDarchitectures may be used in a number of application areas such as computer-aided design/computer-

aided manufacturing, simulation, modeling, and as communication switches. MIMD machines can be of

either shared memory or distributed memory categories. These classifications are based on how MIMD

processors access memory. Shared memory machines may be of the bus-based, extended, or

hierarchical type. Distributed memory machines may have hypercube or mesh interconnection schemes.




5 | P a g e

Flynn’s classification depends on the distinction between the performance of the control unit and the

data processing unit. It emphasizes the behavioral characteristics of the computer system rather than its

operational and structural interconnections. One type of parallel processing that does not fit Flynn’s

classification is pipelining.

Here we are considering parallel processing under the following main topics:1. Pipeline processing

2. Vector processing

3. Array processors

Pipeline Processing

Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process

being executed in special dedicated segments that operates concurrently with all other segments. A

pipeline can be visualized as a collection of processing segments through which binary information flows.

Each segment performs partial processing dictated by the way the task partitioned. The result obtained

from the computation in each segment is transferred to the next segment in the pipeline. The final resultis obtained after the data have passed through all segments. The name “pipeline” implies the flow of

information analogous to an industrial assembly line. It is character of pipelines that several

computations can be in progress in distinct segments at the same time. The overlapping of computations

is made possible by associating a register with each segment in the pipeline. The registers provide

isolation between each segment so that each can operate on distinct data simultaneously.

The pipeline organization will be demonstrated by means of a simple example. Suppose that we want to

perform the combined multiply and add operation with a stream of numbers.

Ai * Bi + C i for I = 1, 2, 3, ………..,7

Each suboperation is to be implemented in a segment within a pipeline. Each segment has one or two

registers and a combinational circuit as shown in figure below. R1 through R5 are registers that receive

new data with every clock pulse. The multiplier and adder are combinational circuits.




6 | P a g e

The sub operations performed in each segment of the pipeline are as follows:

R1 ← Ai, R2 ← Bi Input Ai and Bi

R3 ← R1 * R2, R4 ← Ci Multiply and input Ci

R5 ← R3 + R4 Add Ci to product

The five registers are loaded with new data every clock pulse. The effect of each clock is shown in Table

below.




7 | P a g e

The first clock pulse transfers A1 and B1 into R1 and R2. The second clock pulse transfers the product of R1 and R2 into R3 and C1 into R4. The same clock pulse transfers A2 and B2 into R1 and R2.The third clock

pulse operates on all three segments simultaneously. It places A3 and B3 into R1 and R2, transfers the

product of R1 and R2 into R3, transfers C2 into R4, and places the sum of R3 and R4 into R5. It takes three

clock pulses to fill up the pipe and retrieve the first output from R5. From there on, each clock produces

a new output and moves the data one step down the pipeline. This happens as long as new input data

flow into the system. When no more input data are available, the clock must continue until the last

output emerges out of the pipeline.

Arithmetic PipelineArithmetic units are usually found in very high-speed computers. They are used to implement floating-

point operations, multiplication of fixed-point numbers, and similar computations encountered in

scientific problems. A pipeline multiplier is essentially an array multiplier as described in figure below,

with special adders designed to minimize the carry propagation time through the partial products. We

will now show an example of a pipeline unit for floating point addition and subtraction.

The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers.

X = A X 2a

Y = B X 2b

A and B are two fractions that represents the mantissas and a and b are the exponents. The floating-

point addition and subtraction can be performed in four segments. The registers labeled R are placedbetween the segments to store intermediate results.

The sub operations that are performed in the four segments are:

1. Compare the exponents.

2. Align the mantissas.

3. Add or subtract the mantissas

4. Normalize the result.




8 | P a g e




9 | P a g e

Instruction PipelinePipeline processing can occur not only in the data stream but in the instruction stream as well. An

instruction pipeline reads consecutive instructions from memory while previous instructions are being

executed in other segments. This cause the instructions fetch and execute phases to overlap and

perform simultaneous operations. One possible digression associated with such a scheme is that an

instruction may cause a branch out of sequence. In that case the pipeline must be emptied and all the

instructions that have been read from memory after the branch instruction must be discarded.

Consider a computer with an instruction fetch unit and an instruction execution unit designed to provide

a two-segment pipeline. The instruction fetch segment can be implemented by means of a first-in, first-

out (FIFO) buffer. This is a type of unit that forms a queue rather than a stack. Whenever the execution

unit is not using memory, the control increments the program counter and uses its address value to read

consecutive instructions from memory. The instructions are inserted into the FIFO buffer so that they

can be executed on a first-in, first-out basis. Thus an instruction stream can be placed in a queue, waiting

for decoding and processing by the execution segment. The instruction stream queuing mechanism

provides an efficient way for reducing the average access time to memory for reading instructions.

Whenever there is space in the FIFO buffer, the control unit initiates the next instruction fetch phase.

The buffer acts as a queue form which control then extracts the instructions for the execution unit.

Figure: Instruction Pipeline




10 | P a g e




11 | P a g e

Computers with complex instructions require other phases in addition to the fetch and execute to

process an instruction with the following sequence of steps.

1. Fetch the instruction from memory.

2. Decode the instruction.

3. Calculate the effective address.

4. Fetch the operands from memory.

5. Execute the instruction.

6. Store the result in the proper place.

RISC PipelineInstruction pipelining is often used to enhance performance. Let us consider this in the context of RISC

architecture. In RISC machines most of the operations are register-to-register.

Therefore, the instructions can be executed in two phases:

F: Instruction Fetch to get the instruction.

E: Instruction Execute on register operands and store the results in register.

In general, the memory access in RISC is performed through LOAD and STORE operations. For suchinstructions the following steps may be needed:

F: Instruction Fetch to get the instruction

E: Effective address calculation for the desired memory operand

D: Memory to register or register to memory data transfer through bus.

Let us explain pipelining in RISC with an example program execution sample. Take the following program

(R indicates register).

LOAD RA (Load from memory location A)

LOAD RB (Load from memory location B)

ADD RC, RA, RB (RC = RA + RB)

SUB RD, RA, RB (RD = RA - RB)

MUL RE, RC, RD (RE = RC × RD)

STOR RE (Store in memory location C)

Return to main.

Figure: Sequential Execution of Instructions




12 | P a g e

Vector ProcessingVector processing is the arithmetic or logical computation applied on vectors whereas in scalar

processing only one or pair of data is processed. Therefore, vector processing is faster compared to

scalar processing. When the scalar code is converted to vector form then it is called vectorization. A

vector processor is a special coprocessor, which is designed to handle the vector computations. A vector

is an ordered set of the same type of scalar data items. The scalar item can be a floating pint number, an

integer or a logical value.

Vector instructions can be classified as below:

(1) Vector-Vector Instructions

In this type, vector operands are fetched from the vector register and stored in another vector register.

These instructions are denoted with the following function mappings:

F1 : V → V

F2 : V × V →V

(2) Vector-Scalar Instructions

In this type, when the combination of scalar and vector are fetched and stored in vector register. These

instructions are denoted with the following function mappings:F3 : S × V →V where S is the scalar item

(3) Vector reduction Instructions

When operations on vector are being reduced to scalar items as the result, then these are vector

reduction instructions. These instructions are denoted with the following function mappings:

F4 : V →S

F5 : V × V →S

(4) Vector-Memory Instructions

When vector operations with memory M are performed then these are vector-memory instructions.

These instructions are denoted with the following function mappings:

F6 : M→V

F7 : V →V

Array ProcessingThere is another method for vector operations. If we have an array of n processing elements (PEs) i.e.,

multiple ALUs for storing multiple operands of the vector, then an n instruction, for example, vector

addition, is broadcast to all PEs such that they add all operands of the vector at the same time. That

means all PEs will perform computation in parallel. All PEs are synchronized under one control unit. This

organization of synchronous array of PEs for vector operations is called array processor . The organization

is same as in SIMD. An array processor can handle one instruction and multiple data streams as we haveseen in case of SIMD organization. Therefore, array processors are also called SIMD array computers.

The organization of an array processor is shown in Figure below. The following components are

organized in an array processor:




13 | P a g e

Figure: Organization of SIMD Array Processor

Control Unit (CU): All PEs are under the control of one control unit. CU controls the inter communication

between the PEs. There is a local memory of CU also called CY memory. The user programs are loaded

into the CU memory. The vector instructions in the program are decoded by CU and broadcast to the

array of PEs. Instruction fetch and decoding is done by the CU only.

Processing elements (PEs): Each processing element consists of ALU, its registers and a local memory for

storage of distributed data. These PEs have been interconnected via an interconnection network. All PEs

receive the instructions from the control unit and the different component operands are fetched from

their local memory. Thus, all PEs perform the same function synchronously in a lock-step fashion underthe control of the CU.

It may be possible that all PEs need not participate in the execution of a vector instruction. Therefore, it

is required to adopt a masking scheme to control the status of each PE. A masking vector is used to

control the status of all PEs such that only enabled PEs are allowed to participate in the execution and

others are disabled.




Interconnection Network (IN): IN performs data exchange among the PEs, data routing and manipulation

functions. This IN is under the control of CU.

Host Computer: An array processor may be attached to a host computer through the control unit. The

purpose of the host computer is to broadcast a sequence of vector instructions through CU to the PEs.

Thus, the host computer is a general-purpose machine that acts as a manager of the entire system.

Array processors are special purpose computers which have been adopted for the following:• various scientific applications,

• matrix algebra,

• matrix eigen value calculations,

• real-time scene analysis

SIMD array processor on the large scale has been developed by NASA for earth resources satellite image

processing. This computer has been named Massively parallel processor (MPP) because it contains

16,384 processors that work in parallel. MPP provides real-time time varying scene analysis.

However, array processors are not commercially popular and are not commonly used. The reasons arethat array processors are difficult to program compared to pipelining and there is problem in

vectorization.

Array processing is used in radar, sonar, and seismic exploration, anti-jamming and wireless

communications.

References: J. P. Hayes, “Computer Architecture and Organization”, McGraw Hill, 3rd Ed, 1998. M. Morris Mano, “Computer System Architecture”, Pearson, 3rd Ed, 2004.

www.google.com

en.wikipedia.com

Assignments:(1) Differentiate between instruction pipeline and arithmetic pipeline?

(2) Identify the factors due to which speed of the pipelining is limited?

(3) What is the difference between scalar and vector processing?

A Gentle Advice:Please go through your text books and reference books for detail study!!! Thank you all.

Notes Compiled By:Bijay Mishra

biizay.blogspot.com

9813911076 or 9841695609

chapter 08 - pipeline and vector processing

Documents