cs 152 lab#4 reportkubitron/courses/cs152... · web viewbased on this emplementation, the miss...

CS 152 Computer Architecture

Final Project Report

Spring 2001Prof. Kubiatowicz

Section: 2-4 pm WednesdayTA: Ed Liao

Team members:

Lin Zhou 14255590Sonkai Lao 14429797Wen Hui Guan 12270914Wilma Yeung 14499996

- 0 -

Table of Content:

Abstract …………………………………………………………… 2Division of Labors ………………………………………………… 2Detailed Strategies ………………………………………………… 2Results …………………………………………………………….. 12Conclusion ………………………………………………………… 12Appendix I (notebooks) …………………………………………… 13Appendix II (schematics) ………………………………………….. 13Appendix III (VHDL files) ………………………………………… 14Appendix IV (testing) ……………………………………………… 14Appendix V (delay time table) …………………………………….. 15

- 1 -

Abstract:

We implemented a deep-pipelined processor with seven stages, along with a branch predictor, separate instruction and data cache, and stream buffer. Based on the original 5-staged pipeline, we divided the instruction fetch stage into two, the same as the execution stage. At the first fetch stage, the PC is read from the instruction cache while the branch decision is predicted at the second fetch stage. Since the ALU is the function unit that accounts for the worst delay time (15ns), to achieve optimal performance, we broke it down into two 16-bit adders. Each adder is paired with a logical unit and is placed on different execution stages. We also added a branch predictor on the second stage. With this implementation, our pipelined processor can sustained a cycle time of 22.5ns, with the ideal memory system.

To optimize our memory system, we implemented two cache subsystems, one for the data and the other for the instruction. Furthermore, due to the frequency of sequential memory accesses to the instruction cache, we attached a stream buffer to it. And finally, the performance metrics that we used for the cache are the hit time and miss penalty. The hit time for both data and instruction caches is 7ns. The miss penalty for the data cache is estimated as 124.5ns while that for the instruction cache is 422ns, based on the result from running the mystery program with the ideal memory.

Division of Labor:

Project Design Areas Team Member(s) InvolvedProcessor datapath, and controllers

Sonkai Lao,Lin Zhou

Cache system, controllers, and arbiter

Wilma Yeung,Wen Hui Guan

Branch Target predictor Sonkai LaoTracer monitor Wen Hui GuanForwarding Unit Wilma YeungReport write-up Lin Zhou,

Wen Hui Guan

Detailed Strategy:

a. General Design Strategy

Memory System

(1) Instruction Cache and Stream Buffer

- 2 -

Stream buffer is added to reduce the compulsory and capacity misses. Since instructions are likely executed in sequence as a group of four to five instructions due to branches and jumps. We design the instruction cache and stream buffer with fully associative lookup method. Thus, we associate each cache and buffer block with a comparator.

For the instruction cache, we implement the FIFO replacement policy, which is enforced by using a counter to select a block for replacement in a fashion of a clock. Therefore, when read-miss occurs on both the cache and the buffer, the buffer would be flushed. The counter would advance after the instruction cache read in 2 sequential words for a block. The buffer controller would then fill the buffer with next 4 sequential words. This sequence of reads corresponds to two memory accesses under our implementation of burst request of 3 words. Based on this emplementation, the miss penalty is about 124.5ns. However, when a hit is found on the cache, the hit time is 7ns. If the requested word misses on cache but hits on the buffer, it only costs one extra ns to bring the word to the NPC.

(2) Data Cache Re-design (with fully associative, burst request, and write-back)

We implement fully associative access method for the data cache along with write-back write policy. Since we employ two 32-bit DRAM banks in parallel, we connect the data cache to the memory via a 64-bit bus. We also implement the FIFO replacement policy. This policy is enforced by using a counter to select a block for replacement in a fashion of a clock as for the instruction cache. Thus, the counter only advances when there is a read-miss or write-miss.

To increase the performance of the cache, we design the cache in such a way that when a write to the DRAM is required, the cache only writes one word, instead of two, to the memory. However, when there is a read-miss and a block is selected for replacement, two words would be read into the block to take advantage of the spatial locality. This implementation also reduces the overhead for write since only one word is written back to the DRAM if the dirty bit corresponding to that word is set. Therefore, we implement each block with two dirty bits.

Write-miss and write-hit involves a little complication since we need to update only one word (32-bit) via a 64-bit bus and make the implementation simple. For write-miss, the block selected for replacement is handled in the same manner as for read-miss. In this case, we just request a read of two words from the DRAMs and set the valid bit after writing the dirty word(s) to the memory if any.

- 3 -

However, to put these two scenarios together and fit them into a 64-bit input bus, we need to consider several write cases. Our approach is to keep two valid bits and two dirty bit for each block. For example, on write-hit, we simply update the appropriate word and set the dirty bit. For a compulsory write-miss, we would need to update the tag, the cache word, and the corresponding valid bit. Therefore, by keeping two valid bit, we could do non-allocate-on-write to reduce the overhead. Evidently, due to the data source for write can come from either CPU or DRAM and the fact the target block may be empty (all valid bits are not set) and the input data is from a 64-bit bus, we need to appropriately choose the 64-bit input data. In this regard, we implement the data cache controller to distinguish the input source and generate a 3-bit source selection signal to choose the input data.

(3) Data Cache controller design

From the data cache design section, write-hit and read-hit are handled similarly. Read-hit involves selecting the right word to be output to the CPU. On write-hit, the appropriate word needs to be updated appropriately. These two cases can be handled on the same state by recognizing the type of hit and generate the corresponding selection signals.

For both misses, the cache would request permission to access the DRAM through the arbiter (describe below). It has to wait for the grant signal from the arbiter before performing the write or read. The diagram below shows that the hit is handled at the START stage. When a miss is triggered, the controller would go to either GRANT_W or GRANT_R states until the write or read permission is granted. Once permission is granted, the controller would make a transition to either write or read states and wait until the DRAM is idle. When the operation is completed, the controller would go back to the initial state.

It is possible that a read-miss occurs after servicing a write-miss. As shown in the diagram, if this is the case, the state machine can request read right after finishing the write from the DRAM. This implementation actually saves one cycle on the fly. The state diagram is here.

(4) Instruction and Stream Buffer controllers design

Instruction and stream buffer controller are two communicating state machines. When it hits on either the cache or the buffer, both machine stay at the START state. If it is a hit on the buffer, while on START

- 4 -

state, the buffer controller would instruct to output the requested word instead.

On misses, the buffer controller takes an active role to communicate with the arbiter and memory controllers to request a sequence of 6 words (burst read 3 times, two words each). Once the memory access request is granted to the buffer controller by the arbiter, the first two words would be written to the selected cache block while the next 4 sequential words would be put into the buffer. After each burst read from the memory, the buffer controller would wait for one cycle to let the DRAM to recover. In addition, it is also quite possible that a write-miss is followed by a read-miss. We treat this case by requesting memory access again immediately after the write-miss is handled because the availability of instruction data is critical to the processor performance. The instruction cache state diagram is here. And the stream buffer controller state diagram is here.

(5) Memory controller design

Memory controller (memctrl) is a synchronous process, which contains two asychronous components, memory read component (mem_read) and memory write component (mem_write). The synchronous input signals of memctrl will trigger the mem_read and mem_write to do the operation, and mem_read and mem_write will send the result signals back as the output signal of memctrl after certain amount of delay. The separation of mem_read and mem_write is easy for memctrl to distinguish the read and write operation. In order to increase the throughput of the DRAM’s, we implemented the burst read request of 3 words. Within the RAC and CAS timing constraint, for each RAC signal, we toggle the CAS signal three times.

(6) Memory arbiter

We design the arbiter as a state machine consisting of 3 states, Start, Instruction, and Data. From the start state, any request comes either from data or instruction memory access, the current state would make a transition to the respective state along with the appropriate memory request signal. However, when requests come from both instruction and data at the same time, the request from data takes priority and the instruction fetch register is stalled until the memory access is completed. At this point of time, the request from instruction is serviced.

Datapath

- 5 -

(7) Main structure of 7 stages deep pipeline (overview).

(8) Branch Target Predictor and controller

To implement a branch target predictor, we put a branch prediction table on the second instruction fetch stage. The prediction table maintains 16 entries of PC, predicted PC, and the next state bits. The prediction is based on the input PC from the next PC logic. The prediction table is associated with a 2-bit state machine, which always predicts “not taken” at the first time. The predicted NPC would be latched to the next PC logic. But the actual decision is made in the first EXE stage. If the prediction is wrong, the instruction register would be flushed. This costs extra two cycles in delay.

- 6 -

IF2IF2Dec MEM Reg

IF1IF1IF1 IF2 DEC WB

ALU1

ALU2

REG

Branch Predict Table

Instr Cache

PC

instruction

NT

Predict Taken

Predict Not Taken

NT

Predict Not Taken

Predict Taken

NT

NT

TT

T

T

T

NT

Predict Taken

Predict Not Taken

NT

Predict Not Taken

Predict Taken

NT

NT

TT

T

T

T

Next_PC Logic

Predicted_PC / Correct Pc

Next_PC Logic

Predicted_PC / Correct Pc

DEC EXE

Right decision??Right decision??

instruction

FlushFlush

PC

IF1 IF2 EXE1DEC

(9) ALU Modification

To reduce the cycle time, we divide the original 32-bit ALU into 2 16-bit adders. Addition and subtraction can be handled by the adders from the 2 stages. The first adder in EXE1 would calculate the lower 16 bits of operands while the upper 16 bits would be calculated by the other adder in EXE2. Also each execution stage has a 32-bit logic unit, which handles logic operations and the shifts. The reason having two logic units is to avoid stall between arithmetic and shift/logic operations. Both logic units would do the operations at the different stages, and the output of the first logic unit will be forward to the EXE2 stage. If there is a RAW hazard, the controller will detect the hazard and select the right output of logic unit through the control of the MUX. Forwarding can also be completed as usual. For example, choose the first output of Logic Unit : sll $1 $2 $3 addiu $4 $1 3

choose the second output of Logic Unit: addiu $1$2 $3 sll $4 $1 2

In addition, the gate level logic block, such as SLT logic unit, is built in order to reduce the cycle time of the critical path.

(10) Forwarding Unit

- 7 -

REG

A[15:0]

32

16

32A_Half[31:16]

B[31:16]

A_Half[31:0]

B[31:0]

B_Half[31:16]

A[31:0]

B[31:0]

A[31:0]

B[31:0]

Out_1[31:0]

MUX

LogicLogic

Adder

Out_2[31:0]

Out_1[31:0]

Out_2[31:0]

B[15:0]

B[31:0]SLT Logic

B_Hal[15:0], A_Half[15:0]

REG

REG

Adder

MUX

MUX

Forward to DE ( Mux32 x8)

The forwarding unit (hazard controller) is written in VHDL. It compares the registers among the decode, 1st execution, 2nd execution, and memory stage, then compares them. When data dependency is detected, it would forward the depended value to the previous stage. As a result, it avoids the RAW hazard. To implement the forwarding unit, we code the VHDL to generate the necessary selection signals for various MUXes to accomplish the correct data flow based on the result of the comparison. There are 2 muxes at the decode stage, 4 muxes at the 1st execution stage, and 5 muxes at the 2nd execution stage. It also controls which logical unit to to use for each shift/logic operations.

b. Testing Strategy

Delay modeling of major components:

(1) Forwarding Unit

The forwarding unit is used to generate the selection signals for muxes. It compares registers from different stages. In hardware implementation, it can be designed using muxes and simple logic gates. As a result, we model the time delay as 5ns.

(2) Predict entry

The predict entry is used to keep the PC, NPC, and the state bits of the prediction state machine. It is can be implemented as a register. Thus, the delay time of 3ns is reasonable.

(3) Branch Table Controller

The state transitions of the controller are following the edge-triggered flip flops. Since all the output signals result from simple

- 8 -

Dec MEMDEC WB

ALU1

ALU2

Fo ra rd M

u x

Logic,Shift, (Arith)Comp, Arith, Logic, Shift

Comp, Arith, Logic, Shift, lw

Comp, Arith, Logic, Shift

lw

Comp, Arith, Logic, Shift, lw

Fo ra rd M

u x

LogicLogic

Fo ra rd M

u x

combinational logic, we consider the delay of the controller having the same delay as a register, which is 6 ns.

(4) Comparator

The 9-bit comparator used to compare the tag can be implemented with 9 2-input NAND gates, 9 inverters. However, ANDing these 9 single bit values may need a 9-input AND gate. This high fanout does increase the delay. Another way to do it is to break it down into more levels. This also increases the time delay for 3 levels of gates, excluding inverters. Therefore, we give it a delay of 6 ns.

(5) 3-bit counter

The 3-bit counter can be implemented with 3 toggle flip flops in parallel and a fair amount of combinational logic to select the output bits. Thus, the delay of the counter is tied to the time delay of the flip flops. It is reasonable to model the delay for toggle flip flop as 2ns, that of register. Therefore, that the 3-bit counter has delay time of 2ns seems justifiable.

(6) Cache block select decoder

The decoder takes in a 3-bit value as input and output 8 single bits value in parallel with only one bit got asserted at most, based on the value of the 3-bit. This can be realized as a one level of 8 NAND gates in parallel, excluding inverters. As a result, we assign to it a time delay of 2 ns.

(7) Instruction, Data Cache controllers

The cache controllers can be implemented with pure combinational logic. Thus state transitions are merely triggered by flip flops. The reasonable delay is 3ns.

(8) Memory controller

The state transitions of the controller are following the edge-triggered flip flops. Since all the output signals result from simple combinational logic, we consider the delay of the controller having the same delay as a register, which is 6 ns.

(9) Memory arbiter

The memory arbiter is designed as a state machine using state bit assignment. The state bits are done by direct bit assignment.

- 9 -

Therefore, its structure is a two level combinational logic. Same as the memory controller, we use 6ns delay time to model the arbiter.

(10) Other common components are modeled with the same values as for lab 4 to lab 6.

Memory System Testing Methodology

Cache, Stream buffer, and controllers:(1) We first test individual cache and controller components. The

individual testing scripts (.cmd file) can be found in Appendix IV. Then we do a comprehensive test on the memory cache system as a whole. The read-hit is tested on a straightforward manner by addressing cache content and verify it value.

(2) In particular, we check for the correct handling of the replacement policy by doing write-misses on all blocks first. An additional write-miss would choose the first block for replacement because of the FIFO policy. As a result, the dirty word would be written back to the dram before writing the new word into the cache block. We check the memory content for the correct update. The read-miss is tested similarly.

(3) For the instruction cache, we particularly check for the read-miss on the cache but have a first hit on the buffer. Then reading the next 3 sequential words shouldn’t cause any unnecessary delay and stall because the buffer has the 4 sequential words already. After that, miss both on the cache and buffer would require buffer flushing and bring in another 6 sequential words.

Memory controller:

(1) We first test the basic write and read operations and verify the corresponding memory content and the output.

(2) To test the burst read of 3 words from memory, we input a row address and then lower the RAS signal. During this period, we supply the controller with three consecutive column addresses, each followed by lowering the CAS signal. After one access of the DRAM’s, we expect that 6 sequential words would be the output since we have two DRAM’s in parallel and we access both at the same time.

Comprehensive testing:

(3) Our testing strategy is a hierarchical order. We first break down the pipeline into 7 stages and test it with the mystery program used in the

- 10 -

previous labs to make sure the pipeline works. After testing all the basic components such as cache block, caches, controllers, forwarding unit, branch predictor, and arbiter, etc. We proceed to add one functional unit at a time to the pipeline and run it with an ideal memory. Ensuring those functional units work properly, we attach the memory and cache system to the pipeline. And we do a series of intensive tests to further affirm the functionality of each unit and the whole processor. A monitor is also added to datapath to monitor the committed result on the write-back stage.

Results:

a. Critical path

(4) The processor critical path:

- 11 -

REG

A[15:0]

32

16

32A_Half[31:16]

B[31:16]

A_Half[31:0]

B[31:0]

B_Half[31:16]

A[31:0]

B[31:0]

A[31:0]

B[31:0]

Out_1[31:0]

MUX

LogicLogic

Adder

Out_2[31:0]

Out_1[31:0]

Out_2[31:0]

B[15:0]

B[31:0]SLT Logic

B_Hal[15:0], A_Half[15:0]

REG

REG

Adder

MUX

MUX

Forward to DE (Mux3 2 x8)

Time Delay f or Critical PathMux16X 2 : 1.5 nsMux 32x8 :3.5 ns Mux32x5 : 3.5 ns16-Bit Adder : 8 ns SLT Logic : 3 ns32-Bit Reg : 3 ns

Total = 22.5 ns

Cmparing with the cycle time of 37ns obtained from the 5-stage pipeline, the cycle time for the deep-pipeline processor is a 39.2% improvement.

(5) Cache performance metrics:

Data and Instruction Hit Time (on cache or buffer):9-bit comparator(6ns) + 32-bit Tristate Buffer(1ns) 7ns

Instruction Cache Miss penalty = 422ns.

Data Cache Miss penalty => 124.5ns:

- 12 -

Tag[8:0]

I_addr[9:1]

Selected Word32

9

99-bit

Comparator

32-bit Tristate

32

To PC6ns 1ns

hit

9-bitComparator

32

To PC

Arbiter

MemoryController

request

Stream Buffer Controller6ns

64

2ns

400ns

DRAM 6ns

granted

addresses access

1ns

Cache

miss

64Word SelectionLogic (Tristate)

Tristate 9-Bit

ComparatorI_Addr[9:1]

Tag[8:0]9

6ns

1ns

Miss Penalty = (6 + 2 + 6 + 6 + 400 + 1 + 1)ns = 422ns

To PC

Arbiter

MemoryController

request

64

2ns

DRAM 6ns

granted

addresses access

miss

9-Bit Comparator

I_Addr[9:1]

Tag[8:0]9

6ns

1ns

Cache

Word selectionlogic (Tristate)

mu x

3.5ns

64 32

miss

9-Bit Comparator

I_Addr[9:1]

Tag[8:0]9

6ns

1ns

Cache

Word selectionlogic (Tristate)

mu x

3.5ns

64

Comparator 6nsArbiter 2nsCache Controller 6nsMem Controller 6nsDram 100ns Mux 3.5nstriState 1ns

Total Delay = 124.5ns

Miss penalty for the data and instruction cache differs significantly due to the fact that instruction cache burst read 3 times on a miss while the data cache only read once.

b. Cycle time for the pipelined processor

The calculated delay time for the critical path is 22.5 ns. And the clock rate for the new process is 1/22.5ns = 44.44MHZ.

Conclusion:

We have implemented a deep-pipeline processor running at 44.44 MHZ with branch prediction capability. We also implemented a stream buffer to the instruction cache. Both the data and instruction caches’ access method is fully associative access with write-back write policy and FIFO replacement policy. Furthermore, we implemented the burst memory read request of 3 words for each dram. It works well with the memory controller, DRAM,, cache, and cache controller as a whole via a 64-bit bus. An arbiter is constructed to prioritize the simultaneous data and instruction. The data cache request always takes priority. If time permits, in order to further enhance the cache system, we would implement a victim cache for the data cache.

Appendix I (notebooks):

a. Lin Zhou b. Sonkai Lao c. Wilma Leung d. Wen Hui Guan

Appendix II (schematics):

All schematics shown on this report are not listed.

Component Schematic fileData cache controller Data_cache_fsm.pptInstruction cache controller Instr_cache_fsm.pptCache and memory system Mem_cache_system.pptArbiter controller Fsm_for_arbiter.pptStream buffer controller Stream_buffer.fsm.pptPresentation schematics Sum.ppt

Appendix III (VHDL files):

- 13 -

VHDL files for all components and blocks used:

Component VHDL fileData cache controller Data_cache_ctrl.vhdPredict entry Predict_entry.vhdInstruction cache controller Instr_cache_ctrl.vhdStream buffer controller Stream_buf_ctrl.vhdMemory controller Memctrl.vhdBranch table controller B_table_ctrl.vhdMemory arbiter Final_mem_arbiter.vhdHazard controller (forwarding) Hazardcontroller.vhd16-bit ALU Half_alu.vhdDebugging Tracer (monitor) Display.vhd3-bit counter Counter_3bit.vhdm1x8 mux Mux1x8.vhdm10x2 mux M10x2.vhdm32x2 mux M32x2.vhdm8x2 mux M8x2.vhd10-bit tristate Tristate_10bit.vhd32-bit tristate Tristate.vhd9-bit comparator Comparator9.vhd9-bit register Register9int.vhd1-bit register Reg1.vhdBlock select decoder Block_sel_decoder.vhdMemory read component Mem_read.vhdMemory write component Mem_write.vhd

Note: All other components such as extender, shifter, muxes are the same as those used in lab 4 to lab 6. So they are not listed here again.

Appendix IV (testing):

Component testing:

Tests are performed in hierachical order:

Component Command file Other filesOverall system test Final.cmd

Quicksort.cmd All testbenches are saved in folder:finalproj/behv/testbnch/

Cache system Cachesystem.cmdTest_cache_sys.cmdA_partial.cmd

Pipeline datapath Test_datapathfinal.cmd

- 14 -

Forwarding unit Forward.cmdForward_no_sort.cmd

Forwarding on quickSort

Forward_sort.cmd

Instruction cache Instr_cacheMemory system Test_mem_sys.cmdDram bank Test_memoryBank.cmdData cache Test_datacache.cmdReplacement policy Replace.cmdarbiter Test_arbiter.cmd

Log files:

(1) for testing system finalproj/fsimtemp.log

Final Datapath testing:

Datapath schematic file: ../ finalproj/sch/datapathhazard.1Datapath command file: ../ finalproj /cmd/final.cmdMonitor output file: ../ finalproj /disassembly.out

Appendix V : delay time for each component

Component Dalay time Stream buffer controller 6nsPredict entry 3nsHazard controller (forwarding) 5nsBranch table controller 6nsData cache controller 6nsInstruction cache controller 6nsBlock select decoder 2 ns3-bit counter 2 nsm1x8 mux 3.5 nsm32x8 mux 3.5 nsm10x2 mux 1.5 nsm32x2 mux 1.5 nsm8x2 mux 1.5 ns10-bit tristate 1 ns32-bit tristate 2 ns9-bit comparator 6 ns9-bit register 3 ns

- 15 -

32-bit register 3 nsLogic Unit (shifter/logic unit) 10 ns 16-bit Adder 8 ns SLT logic 3 ns1-bit register 3 nsData, instr. Cache controller 6 nsMemory controller 6 nsMemory arbiter 6 nsMemory read component Both are part of the

memory controllerMemory write component

- 16 -

cs 152 lab#4 reportkubitron/courses/cs152... · web viewbased on this emplementation, the miss...

Documents