burst mode memories improve cache design

8/3/2019 Burst Mode Memories Improve Cache Design

1/4

Burst Mode Memories Improve Cache DesignZwie Amitai, Product Planing and Applications ManagerDavid C. Wyland,Vice President of EngineeringQuality Semiconductor, Inc.

851 Martin AvenueSanta Clara. CA 95050-2903Tel: (408) 86-8326

ABSTRACTBurst mode memories improve cache design by improvingrefill time on cache misses. Burst mode RAMs allow refill of afour word cache line in five clock cycles at 50mHz rather thanthe eight clock cycles that would be required for aconventional SRAM. Burst mode RAMs also have clocksynchronous interfaces which make them easier to designinto systems, particukirly at clock rates of 25mHz and above.

EA burst mode RAM provides high speed transfer of a block ofsequential words, called a burst. A block diagram of a burstmode SRAM is shown in Figure 1. A burst mode RAMconsistsof a conveniional SRAM plus an address counter, aread/write flip flop and a write register. Read and write timingis controlled by a clock in combination with the addresscounter load and read/write signals. In this configuration,random access to a word in the SRAM requires two clockcycles with successive words being read or written at oneclock cycle per word. This is shown in the timing diagrams ofFigures2 and3.

Figure 1 Burst RAM Block Diagram

Write RegisterclOckTf14T:Address--)l f

fJ ReadMlrite DataIn he read timing diagram of Figure2,he first clock cycle isused to load the address counter and the readhnrrite flip flopfor random access 110 the first word. Read data comes out ofthe SRAM before Ihe end of the second clock cycle. l h eaddress counter is incremented at the end of the secondclock cycle, and the next word is read from the SRAM. Thisallows one clock cycle per successive word read followingthe initial random access.

Fax: (408) 96-0591Figure2: Burst RAM Read Timing

Clock

, I: , 8mAlT

Counter

Read DataFor write operations, the first word of data to be written isclocked in to the write register at the same time the addresscounter and the read/write flip flop are loaded, as shown inFigure3. Data from the write register is written into the SRAMduring the second clock cycle. At the end of the secondclock cycle, new data is clocked into the write register andaddress counter is incremented o the next location to writethe next sequential word.

Figure3: Burst RAM Write TimingClock

Address

mADCounter

-wwWrite Reg

SRAM Wriie(Internal)

, I: !

I,,,1b,

279

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 8, 2009 at 11:36 from IEEE Xplore. Restrictions apply.


2/4

The burst mode memory is capable of high speed operationafter the initial access because the sequential addresses aregenerated internally by the address counter. This greatlyreduces the read and write cycle times for sequential datafollowing the first access. Clock speeds of up to 50 mHz arepossible in a TTL system, making the burst mode memoryparticularly well suited to the newer generations of highspeed RlSC and ClSC chips.Burst mode RAMS are faster than SRAM based memorysystems because the address counter is integrated into theirdesign. In a burst mode SRAM, the minimum cycle time ofthe burst operation is approximately he same as the addressaccess time of an equivalent SRAM. This can be as low as 20ns. In an conventional burst mode memory system designusing an SRAM and an address counter, the minimumminimum cycle time is determined by the sum of the clock tooutput delay of the counter plus the address access time ofthe SRAM. The cycle time is therefore increased by thedelay of the address counter. This adds 6.2 ns to thememory cycle time using the QSFCT161A, one of the fastestcounters commercially available. If a 20 ns SRAM is used, theminimum cycle is 26.2 ns. Alternately, a 13.8 ns SRAMwould be required to achieve the 20 ns cycle time of a burstmode RAM.

CACHF MFMORY IN RlSC AND ClSC PROCESSQBSThe use of cache memories has become a standard featureof high performance processor design. Indeed, RlSC designis based on cache memory. The function of a cache memoryis to improve the effective access time of the main memory,usually medium speed DRAM, by eliminating processor waitstates. The cache does this by keeping copies of the mostfrequently read words from main memory in a small, highspeed buffer memory. When the processor attempts to reada word from main memory, the cache checks to see if it has acopy. If it does, it responds immediately. If not, the mainmemory is started on a normal read cycle, and the processorwaits for it to respond. The cache therefore speeds up thesystem by reducing the average amount of time theprocessor has to wait to readaword from memory.Caches are effective because most of the memory accessesare read cycles from a relatively small cluster of memorylocations, n typical programs.Cache performancecan be defined in terms of effective waitstates with a cache relative to the number of wait stateswithout n. A 33 mHz processor witn medium Speed DRAMmemory may require three wait states without a cache and0.5 wait states with a cache. The three wait states without acache are determined by the timing requirements of the mainmemory. The 0.5wait states is a statistical average. Itcanbeestimated by the product of cache miss rate and the numberof wait states required or cache refill on a miss.

A direct mapped cache fo r a 32-bit processor is shown inFigure 4. A direct mapped cache consists of a cache tagRAM, a cache data RAM and a small amount of logic to controlevents when a cache hit or a cache miss occurs. A cache hitis said to occur if a requested word is found in the cache. A

Figure 4: CacheBlock Diagramiss occurs when the word is not found in the cache.

32-BitF P

Ready HiVMissControl Logic

MainMemory( D R A M )

ReadMlrite

The cache stores copies of words read from main memory inthe cache data RAM and stores the location these words areread from in the cache tag RAM. In the direct mapped cache,the least significant bits of the address bus are sent to boththe tag and data RAMS while the most significant bits arestored in the tag RAM when data is stored in the cache dataRAM. In the example shown, both the tag and data RAMSare8K words deep.When a read request is made to main memory, the leastsignificant bits of the address are used to select one of the8K words in both memories. The most significant bits of theaddress are compared against the bits stored in the tag RAM.If there is a match between the two, then the data stored inthe data RAM is a copy of the data at the requested locationand can be immediately supplied to the processor. This is acache hit. If the upper address bits do not match, the datastored came from a different location. This is a cache miss.Direct mapped caches work because most accesses to mainmemory are typically to a small cluster of a few thousandwords located somewhere in the memory space. If the cacheis larger than this cluster size, most of the read data will beprovided by the cache. The least significant bits of theaddress bus are used to index within this cluster of words,and the most significant bits identify the region of memorythat they came from. (Cache theory is a little more subtle thanthis. It treats the least significant bits of the address as ahashing function for a hash indexed buffer.)

280



3/4

Figure 6: 80488'32K Byte Cache Block Diagram Figure 8: 80486 Cache Timing Diagram

T a g D a t a0 A 4 - A l b R A M R A M2 K x 1 5 8 K x 3 2a +* =D2: l x 2 xm Q S 8 8 1 3 Q S 8 8 1 1

A2.3 4Ratch-Enable7 HWMissControl Logic

The design of Figure 6 uses one 088813 8Kx18 Tag RAMand two OS8811 8Kx18 Burst Mode RAMs for the tag anddata memories respectively. The QS8813 is an 8Kx18 TagSRAM with built-in match enable logic that allows it to directlydrive the BRDY input of the 80486. This eliminates he needfor additional ogic in the propagation delay path between theTag SRAM and the microprocessor. This can save five ormore nanoseconds in match time. Only 2K of the 8K areused; however, the E1813 provides a single chip designsolution for the TAG RAM. The complete design requiresonly three RAM chips.Figure7: 80486 '128K Byte Cache Block Diagram

DataAddress*A1 5-A: 1

T a g D a t aR A Mw = l i l K x l 5* =o mCD l x 8 xC I S 8 8 1 3 Q S 8 1 5 8 95A2.3 A -

Ratch-Enable7 H t/M ssControl Logic1

T1 T2 T2 T2 T2 T1Clock

, , , ,Address

, b, ,Read Data

The design of Figure 7 uses one QS8813 8Kx18 Tag RAMand four QS8839 32Kx9 Burst Mode RAMs for the tag anddata memories respectively. The full 8K words of the 8813are used to support the 32K words of the 8839.Both the 8811 and 8339 Burst Mode RAM chips provide anon-chip address counter and logic for burst mode operation.The address counter provides for bursts of up to four wordsusing the 80486 address counting algorithm. Also, the burstcounter on the 8811 count in either binary or 80486counting modes, pin selectable.

CONCl (ISIONBurst mode memories provide performance mprovement orthe cache systems used in high speed ClSC and RlSCsystems which use multiple words per cache line. They areparticularly useful at CPU clock speeds above 25 mHz due totheir higher performance and simpler interface. Because ofthese advantages, burst mode memories are becoming astandard component for cache design of high speedsystems.

281



4/4

Qche Performevs Reload T i mCache performance is defined by miss rate and reload time.Miss rate is the percentageof accesses that miss, and reloadtime is the number of wait states required to get the data forthe processor and reload the cache on a miss. The miss rateof a cache is a function of cache size, cache organization andthe statisticsof the program running on the processor. Missrates are like EPA gas mileage estimates: with differentprograms, your miss rate will vary from benchmark estimates.Generally, caches range from 16 KBytes to 256 KBytes insize, with larger caches having ower miss rates. Target missrates are in the 2-20% range. Cache reload time for thecache in Figure 4 is the time to access one wordout of mainmemory. This may require three wait states in a conventionalaccess and four wait states with a cache. The cache systemhas an extra wait state because one clock cycle is required todetermine if the data is in the cache before main memoryaccess can be started on a miss.

A FOUR WORD PER LINE CACHECache refill performance can be improved by loading morethan one word on a miss. A cache using this approach isshown in Figure 5. In this design, the data cache is four timesas deep as the cache tag memory. The two least significantbits of the address bus go to the cache data memory but donot go to the tag memory. On a cache miss, four words areloaded into the cache data memory, and a single tag - thecommon tag for the four locations - is written at the sametime. This is called a four word per line cache memory, wherea line refersto the amount of data fetched on a cache miss.

Figure 5 : Four Word/Line Cache Block Diagram

32-BitP P

Ready

Data

MainMemory( DRAM)

ReadNVrite

Four WO d/Lne Gache PerformanceChanging the cache from one word per line to four words perline does not change performance significantly if the reloadtiming - i.e., number of wait states per word - is not changed.If all four words are eventually used by the processor and iffour wait states are required per word, a total of 16 wait stateswill be used by either cache to load the four words from mainmemory. In some cases, not all four words will be used, sothe one word per line cache has a small advantage for thesame reload timing.Performance of the four word per line cache of Figure 5 canbe improved, however, by reducing the number of waitstates required o load he four words. The main memory canbe designed using interleaving techniques to provide thefirst word in four wait states and the next three words at onewait state each for a total of 7 wait states rather than 16. Thisapproximately doubles the performanceof the cache.The four word per line cache has an implied requirement hatthe cache data memory must be capable of absorbing data atone clock cycle per word. This is not easy at 33-50mHz clockrates. The burst mode memory provides a natural advantageat these speeds. A burst mode memory with two cycle firstaccess and one cycle per word thereafter can accept data atthe rates capable of being generated by the interleavedDRAM main memory.The burst access memory is particularly useful for cachememory reload because the interleaving echniques that canbe applied in main memory using static column or nibblemode access generally result in unacceptable chip count andpropagation delay when attempted in the cache. This isbecause the cache memory must be capable of two cycle firstaccess in normal operation as well as burst mode operationfor refill on a miss.

BURSTMODF INSECONDARY CACHESBurst mode operation s becoming a widely used standard inboth RlSC and ClSC processors. For example, in the Intel80486, the small, on-board cache uses a four word per linerefill which is typically supplied from a larger, off-chipsecondary cache. In this case, burst mode operation is usedby the secondary cache both in its normal operating mode ofsupplying data to the 80486 as well as the reload on a missmode when it receives data from main memory.Figure 6 shows a four word per line 32 KByte secondarycache for an 80486 using 8kx18 burst mode SRAMs for thedata portion of the cache and an 8K x 18 tag RAM withon-board comparator for the tag memory. A 128 KBytecache using this architecture is shown in Figure 7 . A timingdiagram for both designs is shown in Figure 8 . Thisarchitecture provides a 32 KByte cache in three chipsexpandable to 128 KBytes in nine chips using the same tagRAM.

282


burst mode memories improve cache design

Documents