cache and pipeline

Microprocessors & Microcontrollers 1 RN Biswas

Cache and Pipeline

Prof. R. N. Biswas


Improvement of Speed by Cache

A Cache is a high-speed Memory interposed

between the processor and the slower Main

Memory, enabling faster access to data/code .

Primary or L1 cache is at the chip level

Secondary or L2 cache is at the board level

Cache reduces access time by exploiting

Locality of Reference

Holds more often used data/code

Frees the external bus for other operations


Block M4095

tag

tag

tag

5 7 4

Structure of

Main Memory Address

Tag Block Word

Main Memory

Cache

Direct-mapped Cache

Block M0

Block C0

Block C127

Block C1

Block M127

Block M1

Block M128

Block M255

Block M3968

tag = 0

tag = 1

tag =

31


tag

tag

tag

12 4

Tag Word

Main Memory Cache

Fully Associative Cache

Structure of Main Memory Address

Block C0

Block C1

Block C127

Block M0

Block M1

Block M2

Block M4095

Block M4094

tag = 0

tag = 1

tag = 2

tag = 4 094

tag = 4 095


Block M127

tag

tag

tag

7 5 4

Tag Set Word

Main Memory tag

tag

tag

Cache

Set 1

Set

31

Set-associative Cache (4 blocks/set)

Structure of Main Memory Address

Block C0

Block C1

Block C126

Block C127

Block C2

Block C3

Block M0

Block M1

Block M128

Block M255

Block M3968

Block M4095

Block C125

Block C124

tag

tag

Set 0

Set

0

Set 31

tag = 0

tag = 1

tag = 127

tag = 0

tag = 127

tag = 127

tag = 0


Cache Access and Update Sequence

CPU floats memory address.

Cache Controller compares the tag field of the address with the tags in the selected set:

Cache miss main memory is accessed and the fetched contents stored in the cache.

Cache hit cache is accessed.

Cache write requires memory update:

Write-back - memory updated only when the location is replaced by a new one from memory.

Write-through - memory updated for every write.


Speed Improvement by Pipelining

Processor speed can be enhanced by having separate hardware units for the different functional blocks, with buffers between the successive units.

The number of unit operations into which the instruction cycle of a processor can be divided for this purpose defines the number of stages in the pipeline.

A processor having an n-stage pipeline would have up to n instructions simultaneously being processed by the different functional units of the processor.

Effective processor speed increases ideally by a factor equal to the number of pipelining stages.


Typical Pipeline Organisation

A common choice is to have four such units :

Fetch: Fetch the instruction code from the memory;

Decode: Decode the Op Code and fetch operand(s);

Operate: Perform operation required by the op code;

Write: Store the result in the destination location.

A four-stage pipeline would require three buffers, each separating two functional units of the processor.

Write cycle of I1, Operate cycle of I2, Decode cycle of I3 and Fetch cycle of I4 take place in the same time slot, and have to be completed within the same time as prescribed by the pipeline design


A Four-stage Pipeline


Data Dependency in Pipelining

If the input data for an instruction depends on the

outcome of the previous instruction, the Write cycle of

the previous instruction has to be over before the

Operate cycle of the next instruction can start. The

pipeline effectively idles through one instruction,

creating a bubble in the pipeline which persists for

several instructions.

F4 D4

O3

F2 D2 idle W2 O2

W4

F3 idle D3 W3

O4

Bubble ends here

F1 D1 O1 W1


Branch Dependency in Pipelining

A Branch instruction can cause a pipeline stall if the

branch is taken, as the next instruction has to be

aborted in that case. If I1 is an unconditional branch

instruction, the next Fetch cycle (F2) can start after D1.

But if I1 is a conditional branch instruction, F2 has to

wait until O1 for the decision as to whether the branch

will be taken or not.

F1 D1 O1 W1

F2 D2 O2 W2 executed if branch is not taken

F2 D2 O2 W2

F2 D2 O2 W2

executed for unconditional branch

for conditional branch, if taken

branch instruction


Avoidance of Pipeline Bubbles

Data Dependency - An instruction unaffected by

the write operation has to be placed in the Load

Delay Slot.

Branch Dependency - The branch instruction

has to perform a delayed branch, with

instructions preceding the branch placed in the

Branch Delay Slots.

Requires optimising compilers to be written

along with the design of the microprocessors.

cache and pipeline

Documents