cache data

8/2/2019 Cache Data

1/30

In computer engineering, a cache ( / k/kashorAust/NZ:/k e/kaysh) is a component

that transparently stores data so that future requests for that data can be served faster. The data

that is stored within a cache might be values that have been computed earlier or duplicates of

original values that are stored elsewhere. If requested data is contained in the cache (cache hit),

this request can be served by simply reading the cache, which is comparatively faster. Otherwise

(cache miss), the data has to be recomputed or fetched from its original storage location, which is

comparatively slower. Hence, the more requests can be served from the cache the faster the

overall system performance is.

To be cost efficient and to enable an efficient use of data, caches are relatively small. Nevertheless,

caches have proven themselves in many areas of computing because access patterns in

typicalcomputer applications have locality of reference. References exhibit temporal localityif data

is requested again that has been recently requested already. References exhibitspatial locality if

data is requested that is physically stored close to data that has been requested already.

Diagram of a CPU memory cache

Contents

[hide]

1 Operation

2 Applications

o 2.1 CPU cache

o 2.2 Disk cache

o 2.3 Web cache

o 2.4 Other caches

o 2.5 The difference between buffer and cache

3 See also

4 Further reading

5 References

[edit]Operation
http://en.wikipedia.org/wiki/Computer_engineeringhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Australiahttp://en.wikipedia.org/wiki/Australiahttp://en.wikipedia.org/wiki/New_Zealandhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Application_softwarehttp://en.wikipedia.org/wiki/Application_softwarehttp://en.wikipedia.org/wiki/Locality_of_referencehttp://en.wikipedia.org/wiki/Memory_localityhttp://en.wikipedia.org/wiki/Spatial_localityhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Cache#Operationhttp://en.wikipedia.org/wiki/Cache#Applicationshttp://en.wikipedia.org/wiki/Cache#CPU_cachehttp://en.wikipedia.org/wiki/Cache#Disk_cachehttp://en.wikipedia.org/wiki/Cache#Web_cachehttp://en.wikipedia.org/wiki/Cache#Other_cacheshttp://en.wikipedia.org/wiki/Cache#The_difference_between_buffer_and_cachehttp://en.wikipedia.org/wiki/Cache#See_alsohttp://en.wikipedia.org/wiki/Cache#Further_readinghttp://en.wikipedia.org/wiki/Cache#Referenceshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=1http://en.wikipedia.org/wiki/File:Cache,basic.svghttp://en.wikipedia.org/wiki/Computer_engineeringhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Australiahttp://en.wikipedia.org/wiki/New_Zealandhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Application_softwarehttp://en.wikipedia.org/wiki/Locality_of_referencehttp://en.wikipedia.org/wiki/Memory_localityhttp://en.wikipedia.org/wiki/Spatial_localityhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Cache#Operationhttp://en.wikipedia.org/wiki/Cache#Applicationshttp://en.wikipedia.org/wiki/Cache#CPU_cachehttp://en.wikipedia.org/wiki/Cache#Disk_cachehttp://en.wikipedia.org/wiki/Cache#Web_cachehttp://en.wikipedia.org/wiki/Cache#Other_cacheshttp://en.wikipedia.org/wiki/Cache#The_difference_between_buffer_and_cachehttp://en.wikipedia.org/wiki/Cache#See_alsohttp://en.wikipedia.org/wiki/Cache#Further_readinghttp://en.wikipedia.org/wiki/Cache#Referenceshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=1

8/2/2019 Cache Data

2/30

Hardware implements cache as a block of memory for temporary storage of data likely to be used

again.CPUs and hard drivesfrequently use a cache, as do web browsers and web servers.

A cache is made up of a pool of entries. Each entry has a datum (a nugget of data) - a copy of the

same datum in some backing store. Each entry also has a tag, which specifies the identity of the

datum in the backing store of which the entry is a copy.

When the cache client (a CPU, web browser, operating system) needs to access a datum

presumed to exist in the backing store, it first checks the cache. If an entry can be found with a tag

matching that of the desired datum, the datum in the entry is used instead. This situation is known

as a cache hit. So, for example, a web browser program might check its local cache on disk to see

if it has a local copy of the contents of a web page at a particular URL. In this example, the URL is

the tag, and the contents of the web page is the datum. The percentage of accesses that result in

cache hits is known as the hit rate orhit ratio of the cache.

The alternative situation, when the cache is consulted and found not to contain a datum with the

desired tag, has become known as acache miss. The previously uncached datum fetched from

the backing store during miss handling is usually copied into the cache, ready for the next access.

During a cache miss, the CPU usually ejects some other entry in order to make room for the

previously uncached datum. Theheuristic used to select the entry to eject is known as

thereplacement policy. One popular replacement policy, "least recently used" (LRU), replaces the

least recently used entry (see cache algorithms). More efficient caches compute use frequency

against the size of the stored contents, as well as the latenciesand throughputs for both the cache

and the backing store. While this works well for larger amounts of data, long latencies and slow

throughputs, such as experienced with a hard drive and the Internet, it is not efficient for use with a

CPU cache.[citation needed]

When a system writes a datum to the cache, it must at some point write that datum to the backing

store as well. The timing of this write is controlled by what is known as thewrite policy.

In a write-through cache, every write to the cache causes a synchronous write to the backing

store.

Alternatively, in a write-back (orwrite-behind) cache, writes are not immediately mirrored to the

store. Instead, the cache tracks which of its locations have been written over and marks these

locations as dirty. The data in these locations are written back to the backing store when those

data are evicted from the cache, an effect referred to as a lazy write. For this reason, a read miss

in a write-back cache (which requires a block to be replaced by another) will often require two
http://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Hard_drivehttp://en.wikipedia.org/wiki/Hard_drivehttp://en.wikipedia.org/wiki/Operating_systemhttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Page_replacement_algorithmhttp://en.wikipedia.org/wiki/Page_replacement_algorithmhttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Access_timehttp://en.wikipedia.org/wiki/Access_timehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Hard_drivehttp://en.wikipedia.org/wiki/Operating_systemhttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Page_replacement_algorithmhttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Access_timehttp://en.wikipedia.org/wiki/Wikipedia:Citation_needed

8/2/2019 Cache Data

3/30

memory accesses to service: one to retrieve the needed datum, and one to write replaced data

from the cache to the store.

Other policies may also trigger data write-back. The client may make many changes to a datum in

the cache, and then explicitly notify the cache to write back the datum.

No-write allocation (a.k.a. write-no-allocate) is a cache policy which caches only processor reads,

i.e. on a write-miss:

Datum is written directly to memory,

Datum at the missed-write location is not added to cache.

This avoids the need for write-back or write-through when the old value of the datum was absent

from the cache prior to the write.

Entities other than the cache may change the data in the backing store, in which case the copy in

the cache may become out-of-date orstale. Alternatively, when the client updates the data in the

cache, copies of those data in other caches will become stale. Communication protocols between

the cache managers which keep the data consistent are known ascoherency protocols.

[edit]Applications

[edit]CPU cache

Main article:CPU cache

Small memories on or close to the CPU can operate faster than the much larger main memory.Most CPUs since the 1980s have used one or more caches, and modern high-end embedded,

desktop and servermicroprocessors may have as many as half a dozen, each specialized for a

specific function. Examples of caches with a specific function are the D-cache and I-cache (data

cache and instruction cache).

[edit]Disk cache

Main article:Page cache

While CPU caches are generally managed entirely by hardware, a variety of software manages

other caches. The page cache in main memory, which is an example of disk cache, is managed by

the operating system kernel.

While the hard drive's hardwaredisk bufferis sometimes misleadingly referred to as "disk cache",

its main functions are write sequencing and read prefetching. Repeated cache hits are relatively

rare, due to the small size of the buffer in comparison to the drive's capacity. However, high-

end disk controllersoften have their own on-board cache of hard diskdata blocks.
http://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=2http://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=3http://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Microprocessorshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=4http://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Kernel_(computer_science)http://en.wikipedia.org/wiki/Kernel_(computer_science)http://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_controllerhttp://en.wikipedia.org/wiki/Disk_controllerhttp://en.wikipedia.org/wiki/Block_(data_storage)http://en.wikipedia.org/wiki/Block_(data_storage)http://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=2http://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=3http://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Microprocessorshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=4http://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Kernel_(computer_science)http://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_controllerhttp://en.wikipedia.org/wiki/Block_(data_storage)

8/2/2019 Cache Data

4/30

Finally, fast local hard disk can also cache information held on even slower data storage devices,

such as remote servers (web cache) or local tape drives oroptical jukeboxes. Such a scheme is the

main concept ofhierarchical storage management.

[edit]Web cache

Main article:Web cache

Web browsers andweb proxy servers employ web caches to store previous responses fromweb

servers, such as web pages. Web caches reduce the amount of information that needs to be

transmitted across the network, as information previously stored in the cache can often be re-used.

This reduces bandwidth and processing requirements of the web server, and helps to

improveresponsiveness for users of the web.

Web browsers employ a built-in web cache, but some internet service providersor organizations

also use a caching proxy server, which is a web cache that is shared among all users of that

network.

Another form of cache isP2P caching, where the files most sought for by peer-to-peerapplications

are stored in an ISP cache to accelerate P2P transfers. Similarly, decentralised equivalents exist,

which allow communities to perform the same task for P2P traffic, e.g. Corelli [1]

[edit]Other caches

The BIND DNSdaemon caches a mapping of domain names toIP addresses, as does a resolver

library.

Write-through operation is common when operating over unreliable networks (like an Ethernet

LAN), because of the enormous complexity of thecoherency protocolrequired between multiple

write-back caches when communication is unreliable. For instance, web page caches andclient-

sidenetwork file system caches (like those inNFS orSMB) are typically read-only or write-through

specifically to keep the network protocol simple and reliable.

Search engines also frequently make web pagesthey have indexed available from their cache. For

example,Googleprovides a "Cached" link next to each search result. This can prove useful when

web pages from a web serverare temporarily or permanently inaccessible.

Another type of caching is storing computed results that will likely be needed again,

ormemoization.ccache, a program that caches the output of the compilation to speed up the

second-time compilation, exemplifies this type.

Database cachingcan substantially improve the throughput ofdatabaseapplications, for example

in the processing ofindexes, data dictionaries, and frequently used subsets of data.
http://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Tape_drivehttp://en.wikipedia.org/wiki/Optical_jukeboxhttp://en.wikipedia.org/wiki/Optical_jukeboxhttp://en.wikipedia.org/wiki/Hierarchical_storage_managementhttp://en.wikipedia.org/wiki/Hierarchical_storage_managementhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=5http://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Web_browserhttp://en.wikipedia.org/wiki/Proxy_serverhttp://en.wikipedia.org/wiki/Proxy_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Responsivenesshttp://en.wikipedia.org/wiki/Responsivenesshttp://en.wikipedia.org/wiki/Internet_Service_Providerhttp://en.wikipedia.org/wiki/Internet_Service_Providerhttp://en.wikipedia.org/wiki/P2P_cachinghttp://en.wikipedia.org/wiki/Peer-to-peerhttp://en.wikipedia.org/wiki/ISPhttp://en.wikipedia.org/wiki/Cache#cite_note-0http://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=6http://en.wikipedia.org/wiki/Domain_Name_Systemhttp://en.wikipedia.org/wiki/Domain_Name_Systemhttp://en.wikipedia.org/wiki/IP_addresshttp://en.wikipedia.org/wiki/IP_addresshttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Network_File_Systemhttp://en.wikipedia.org/wiki/Network_File_System_(protocol)http://en.wikipedia.org/wiki/Network_File_System_(protocol)http://en.wikipedia.org/wiki/Server_Message_Blockhttp://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Memoizationhttp://en.wikipedia.org/wiki/Memoizationhttp://en.wikipedia.org/wiki/Ccachehttp://en.wikipedia.org/wiki/Ccachehttp://en.wikipedia.org/wiki/Database_cachinghttp://en.wikipedia.org/wiki/Database_cachinghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Index_(database)http://en.wikipedia.org/wiki/Data_dictionaryhttp://en.wikipedia.org/wiki/Data_dictionaryhttp://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Tape_drivehttp://en.wikipedia.org/wiki/Optical_jukeboxhttp://en.wikipedia.org/wiki/Hierarchical_storage_managementhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=5http://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Web_browserhttp://en.wikipedia.org/wiki/Proxy_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Responsivenesshttp://en.wikipedia.org/wiki/Internet_Service_Providerhttp://en.wikipedia.org/wiki/P2P_cachinghttp://en.wikipedia.org/wiki/Peer-to-peerhttp://en.wikipedia.org/wiki/ISPhttp://en.wikipedia.org/wiki/Cache#cite_note-0http://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=6http://en.wikipedia.org/wiki/Domain_Name_Systemhttp://en.wikipedia.org/wiki/IP_addresshttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Network_File_Systemhttp://en.wikipedia.org/wiki/Network_File_System_(protocol)http://en.wikipedia.org/wiki/Server_Message_Blockhttp://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Memoizationhttp://en.wikipedia.org/wiki/Ccachehttp://en.wikipedia.org/wiki/Database_cachinghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Index_(database)http://en.wikipedia.org/wiki/Data_dictionary

8/2/2019 Cache Data

5/30

Distributed caching[2] uses caches spread across different networked hosts, e.g. Corelli

[edit]The difference between buffer and cache

The terms "buffer" and "cache" are not mutually exclusive and the functions are frequently

combined; however, there is a difference in intent.

Abufferis a temporary memory location, that is traditionally used because CPUinstructionscannot

directly address data stored in peripheral devices. Thus, addressable memory is used as

intermediate stage. Additionally such a buffer may be feasible when a large block of data is

assembled or disassembled (as required by a storage device), or when data may be delivered in a

different order than that in which it is produced. Also a whole buffer of data is usually transferred

sequentially (for example to hard disk), so buffering itself sometimes increases transfer

performance or reduce the variation or jitter of the transfer's latency as opposed to caching where

the intent is to reduce the latency. These benefits are present even if the buffered data are written

to thebufferonce and read from the buffer once.

A cache also increases transfer performance. A part of the increase similarly comes from the

possibility that multiple small transfers will combine into one large block. But the main performance-

gain occurs because there is a good chance that the same datum will be read from cache multiple

times, or that written data will soon be read. A cache's sole purpose is to reduce accesses to the

underlying slower storage. Cache is also usually anabstraction layerthat is designed to be invisible

from the perspective of neighbouring layers.

A CPU cache is a cacheused by thecentral processing unit of acomputerto reduce the average

time to accessmemory. The cache is a smaller, faster memory which stores copies of the data

from the most frequently usedmain memory locations. As long as most memory accesses are

cached memory locations, the averagelatency of memory accesses will be closer to the cache

latency than to the latency of main memory.

When the processor needs to read from or write to a location in main memory, it first checks

whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to

the cache, which is much faster than reading from or writing to main memory.

Most modern desktop and server CPUs have at least three independent caches: aninstruction

cache to speed up executable instruction fetch, a data cache to speed up data fetch and store,
http://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://en.wikipedia.org/wiki/Cache#cite_note-1http://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://eprints.comp.lancs.ac.uk/2044/1/MMCN09.pdfhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=7http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/RAM_latencyhttp://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://en.wikipedia.org/wiki/Cache#cite_note-1http://eprints.comp.lancs.ac.uk/2044/1/MMCN09.pdfhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=7http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/RAM_latency

8/2/2019 Cache Data

6/30

and atranslation lookaside buffer(TLB) used to speed up virtual-to-physical address translation for

both executable instructions and data. Data cache is usually organized as a hierarchy of more

cache levels (L1, L2, etc.; seeMulti-level caches).

Details of operation

This section describes a typical data cache and some instruction caches; A TLB may have more

complexity and an instruction cache may be simpler. The diagram on the right shows two

memories. Each location in each memory contains data (acache line), which in different designs

may range in size from 8 to 512 bytes.[citation needed] The size of the cache line is usually larger than the

size of the usual access requested by a CPU instruction[citation needed], which ranges from 1 to 16

bytes[citation needed] (the largest addresses and data handled by current 32 bit and 64 bit architectures

being 128 bits long, i.e. 16 bytes).[citation needed] Each location in each memory also has an index, which

is a unique number used to refer to that location. The index for a location in main memory is called

anaddress. Each location in the cache has a tag that contains the index of the datum in main

memory that has been cached. In a CPU's data cache these entries are calledcache lines orcache

blocks.

When theprocessorneeds to read or write a location in main memory, it first checks whether that

memory location is in the cache. This is accomplished by comparing the address of the memory

location to all tags in the cache that might contain that address. If the processor finds that the

memory location is in the cache, we say that a cache hithas occurred; otherwise, we speak of

a cache miss. In the case of a cache hit, the processor immediately reads or writes the data in the

cache line. The proportion of accesses that result in a cache hit is known as thehit rate, and is a

measure of the effectiveness of the cache for a given program or algorithm.

In the case of a miss, the cache allocates a new entry, which comprises the tag just missed and acopy of the data. The reference can then be applied to the new entry just as in the case of a hit.

Read misses delay execution because they require data to be transferred from a much slower

memory than the cache itself. Write misses may occur without such penalty since the data can be

copied in the background. Instruction caches are similar to data caches but the CPU only performs

read accesses (instruction fetch) to the instruction cache. Instruction and data caches can be
http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/CPU_cache#Multi-level_cacheshttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Memory_addresshttp://en.wikipedia.org/wiki/File:Cache,basic.svghttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/CPU_cache#Multi-level_cacheshttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Memory_address

8/2/2019 Cache Data

7/30

separated for higher performance withHarvard CPUs but they can also be combined to reduce the

hardware overhead.

In order to make room for the new entry on a cache miss, the cache has toevictone of the existing

entries. Theheuristic that it uses to choose the entry to evict is called thereplacement policy. The

fundamental problem with any replacement policy is that it must predict which existing cache entry

is least likely to be used in the future. Predicting the future is difficult, especially for hardware

caches that use simple rules amenable to implementation in circuitry, so there are a variety of

replacement policies to choose from and no perfect way to decide among them. One popular

replacement policy, LRU, replaces the least recently used entry. Defining some memory ranges

non cacheableavoids affecting performance by storing in caches information which are never re-

used or seldom used. Cache misses are simply ignored for not cacheable data. Cache entries may

also be disabled or locked depending on the context.

If data are written to the cache, they must at some point be written to main memory as well. The

timing of this write is controlled by what is known as thewrite policy. In a write-through cache, every

write to the cache causes a write to main memory. Alternatively, in a write-backorcopy-backcache,

writes are not immediately mirrored to the main memory. Instead, the cache tracks which locations

have been written over (these locations are marked dirty). The data in these locations are written

back to the main memory when that data is evicted from the cache. For this reason, a miss in a

write-back cache may sometimes require two memory accesses to service: one to first write the

dirty location to memory and then another to read the new location from memory.

There are intermediate policies as well. The cache may be write-through, but the writes may be

held in a store data queue temporarily, usually so that multiple stores can be processed together

(which can reducebus turnarounds and so improve bus utilization).

The data in main memory being cached may be changed by other entities (e.g. peripherals

using direct memory accessormulti-core processor), in which case the copy in the cache may

become out-of-date orstale. Alternatively, when the CPU in a multi-core processor updates the

data in the cache, copies of data in caches associated with other cores will become stale.

Communication protocols between the cache managers which keep the data consistent are known

as cache coherence protocols. Another possibility is to share non cacheable data.

The time taken to fetch one datum from memory (read latency) matters because the CPU will run

out of things to do while waiting for the datum. When a CPU reaches this state, it is called astall.

As CPUs become faster, stalls due to cache misses displace more potential computation; modern

CPUs can execute hundreds of instructions in the time taken to fetch a single datum from the main

memory. Various techniques have been employed to keep the CPU busy during this time.Out-of-
http://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Computer_bushttp://en.wikipedia.org/wiki/Computer_bushttp://en.wikipedia.org/wiki/Direct_memory_accesshttp://en.wikipedia.org/wiki/Direct_memory_accesshttp://en.wikipedia.org/wiki/Multi-core_processorhttp://en.wikipedia.org/wiki/Cache_coherencehttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Computer_bushttp://en.wikipedia.org/wiki/Direct_memory_accesshttp://en.wikipedia.org/wiki/Multi-core_processorhttp://en.wikipedia.org/wiki/Cache_coherencehttp://en.wikipedia.org/wiki/Out-of-order_execution

8/2/2019 Cache Data

8/30

orderCPUs (Pentium Proand laterInteldesigns, for example) attempt to execute independent

instructions after the instruction that is waiting for the cache miss data. Another technology, used by

many processors, is simultaneous multithreading (SMT), or -in Intel's terminology- hyper-

threading (HT), which allows an alternate thread to use the CPU core while a first thread waits for

data to come from main memory.

[edit]Cache entry structure

Cache row entries usually have the following structure:

tag data blocks valid bit

The data blocks (cache line) contain the actual data fetched from the main memory. The valid bit

(dirty bit) denotes that this particular entry has valid data.

An effective memory address is split (MSB to LSB) into the tag, the index and the displacement

(offset),

tag index displacement

The index length is bits and describes which row the data has been put

in. The displacement length is and specifies which block of the ones we

have stored we need. The tag length

isaddress_length index_length displacement_length and contains the most significant bits of

the address, which are checked against the current row (the row has been retrieved by index) to

see if it is the one we need or another, irrelevant memory location that happened to have the same

index bits as the one we want.

[edit]Associativity

Which memory locations can be cached by which cache locations
http://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Pentium_Prohttp://en.wikipedia.org/wiki/Pentium_Prohttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Thread_(computer_science)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=2http://en.wikipedia.org/wiki/Most_significant_bithttp://en.wikipedia.org/wiki/Least_significant_bithttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=3http://en.wikipedia.org/wiki/File:Cache,associative-fill-both.pnghttp://en.wikipedia.org/wiki/File:Cache,associative-fill-both.pnghttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Pentium_Prohttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Thread_(computer_science)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=2http://en.wikipedia.org/wiki/Most_significant_bithttp://en.wikipedia.org/wiki/Least_significant_bithttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=3

8/2/2019 Cache Data

9/30

Associativity is a trade-off. If there are ten places to which the replacement policy could have

mapped a memory location, then to check if that location is in the cache, ten cache entries must be

searched. Checking more places takes more power, chip area, and potentially time. On the other

hand, caches with more associativity suffer fewer misses (see conflict misses, below), so that the

CPU wastes less time reading from the slow main memory. The rule of thumb is that doubling the

associativity, from direct mapped to 2-way, or from 2-way to 4-way, has about the same effect on

hit rate as doubling the cache size. Associativity increases beyond 4-way have much less effect on

the hit rate,[1]and are generally done for other reasons (see virtual aliasing, below).

In order of increasing (worse) hit times and decreasing (better) miss rates,

direct mapped cachethe best (fastest) hit times, and so the best tradeoff for "large"

caches

2-way set associative cache

2-way skewed associative cache "the best tradeoff for .... caches whose sizes are in the

range 4K-8K bytes" Andr Seznec[2]

4-way set associative cache

fully associative cache the best (lowest) miss rates, and so the best tradeoff when the

miss penalty is very high

[edit]2-way set associative cache

If each location in main memory can be cached in either of two locations in the cache, one logical

question is: which two? The simplest and most commonly used scheme, shown in the right-hand

diagram above, is to use the least significant bits of the memory location's index as the index for the

cache memory, and to have two entries for each index. One benefit of this scheme is that the tags

stored in the cache do not have to include that part of the main memory address which is implied by

the cache memory's index. Since the cache tags are fewer bits, they take less area on the

microprocessor chip and can be read and compared faster.

[edit]Speculative execution

One of the advantages of a direct mapped cache is that it allows simple and fastspeculation. Once

the address has been computed, the one cache index which might have a copy of that datum is

known. That cache entry can be read, and the processor can continue to work with that data before

it finishes checking that the tag actually matches the requested address.

The idea of having the processor use the cached data before the tag match completes can be

applied to associative caches as well. A subset of the tag, called ahint, can be used to pick just
http://en.wikipedia.org/wiki/Trade-offhttp://en.wikipedia.org/wiki/Trade-offhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-0http://en.wikipedia.org/wiki/CPU_cache#cite_note-0http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=4http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=5http://en.wikipedia.org/wiki/Speculative_executionhttp://en.wikipedia.org/wiki/Speculative_executionhttp://en.wikipedia.org/wiki/Trade-offhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-0http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=4http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=5http://en.wikipedia.org/wiki/Speculative_execution

8/2/2019 Cache Data

10/30

one of the possible cache entries mapping to the requested address. This datum can then be used

in parallel with checking the full tag. The hint technique works best when used in the context of

address translation, as explained below.

[edit]2-way skewed associative cache

Other schemes have been suggested, such as the skewed cache,[2]where the index for way 0 is

direct, as above, but the index for way 1 is formed with ahash function. A good hash function has

the property that addresses which conflict with the direct mapping tend not to conflict when mapped

with the hash function, and so it is less likely that a program will suffer from an unexpectedly large

number of conflict misses due to a pathological access pattern. The downside is extra latency from

computing the hash function.[3] Additionally, when it comes time to load a new line and evict an old

line, it may be difficult to determine which existing line was least recently used, because the new

line conflicts with data at different indexes in each way; LRUtracking for non-skewed caches is

usually done on a per-set basis. Nevertheless, skewed-associative caches have major advantages

over conventional set-associative ones.[4]

[edit]Pseudo-associative cache

A true set-associative cache tests all the possible ways simultaneously, using something like

a content addressable memory. A pseudo-associative cache tests each possible way one at a time.

A hash-rehash cache is one kind of pseudo-associative cache.

In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as

a direct-mapped cache. But it has a much lower conflict miss rate than a direct-mapped cache,

closer to the miss rate of a fully associative cache. [3]

[edit]Cache misses

A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in

a main memory access with much longer latency. There are three kinds of cache misses:

instruction read miss, data read miss, and data write miss.

A cache read miss from an instruction cache generally causes the most delay, because the

processor, or at least the thread of execution, has to wait (stall) until the instruction is fetched from

main memory.

A cache read miss from a data cache usually causes less delay, because instructions not

dependent on the cache read can be issued and continue execution until the data is returned from

main memory, and the dependent instructions can resume execution.
http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=6http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/CPU_cache#cite_note-3http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=7http://en.wikipedia.org/wiki/Content_addressable_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=8http://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=6http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/CPU_cache#cite_note-3http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=7http://en.wikipedia.org/wiki/Content_addressable_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=8http://en.wikipedia.org/wiki/Simultaneous_multithreading

8/2/2019 Cache Data

11/30

A cache write miss to a data cache generally causes the least delay, because the write can be

queued and there are few limitations on the execution of subsequent instructions. The processor

can continue until the queue is full.

In order to lower cache miss rate, a great deal of analysis has been done on cache behavior in an

attempt to find the best combination of size, associativity, block size, and so on. Sequences of

memory references performed by benchmark programs are saved asaddress traces. Subsequent

analyses simulate many different possible cache designs on these long address traces. Making

sense of how the many variables affect the cache hit rate can be quite confusing. One significant

contribution to this analysis was made byMark Hill, who separated misses into three categories

(known as the Three Cs):

Compulsory misses are those misses caused by the first reference to a datum. Cache size

and associativity make no difference to the number of compulsory misses. Prefetching can helphere, as can larger cache block sizes (which are a form of prefetching). Compulsory misses are

sometimes referred to as cold misses.

Capacity misses are those misses that occur regardless of associativity or block size,

solely due to the finite size of the cache. The curve of capacity miss rate versus cache size

gives some measure of the temporal locality of a particular reference stream. Note that there is

no useful notion of a cache being "full" or "empty" or "near capacity": CPU caches almost

always have nearly every line filled with a copy of some line in main memory, and nearly every

allocation of a new line requires the eviction of an old line.

Conflict misses are those misses that could have been avoided, had the cache not evicted

an entry earlier. Conflict misses can be further broken down intomapping misses, that are

unavoidable given a particular amount of associativity, andreplacement misses, which are due

to the particular victim choice of the replacement policy.
http://www.cs.wisc.edu/~markhillhttp://www.cs.wisc.edu/~markhill

8/2/2019 Cache Data

12/30

Miss rate versus cache size on the Integer portion of SPEC CPU2000

The graph to the right summarizes the cache performance seen on the Integer portion of the SPEC

CPU2000 benchmarks, as collected by Hill and Cantin.[5]These benchmarks are intended to

represent the kind of workload that an engineering workstation computer might see on any given

day. The reader should keep in mind that finding benchmarks which are even usefully

representative of many programs has been very difficult, and there will always be important

programs with very different behavior than what is shown here.

We can see the different effects of the three Cs in this graph.

At the far right, with cache size labelled "Inf", we have the compulsory misses. If we wish to improve

a machine's performance on SpecInt2000, increasing the cache size beyond 1 MB is essentially

futile. That's the insight given by the compulsory misses.

The fully associative cache miss rate here is almost representative of the capacity miss rate. The

difference is that the data presented is from simulations assuming an LRU replacement policy.

Showing the capacity miss rate would require aperfect replacement policy, i.e. an oracle that looks

into the future to find a cache entry which is actually not going to be hit.

Note that our approximation of the capacity miss rate falls steeply between 32KB and 64 KB. This

indicates that the benchmark has aworking setof roughly 64 KB. A CPU cache designer examining

this benchmark will have a strong incentive to set the cache size to 64 KB rather than 32 KB. Note

that, on this benchmark, no amount of associativity can make a 32 KB cache perform as well as a

64 KB 4-way, or even a direct-mapped 128 KB cache.
http://en.wikipedia.org/wiki/CPU_cache#cite_note-4http://en.wikipedia.org/wiki/CPU_cache#cite_note-4http://en.wikipedia.org/wiki/Benchmark_(computing)http://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithmhttp://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithmhttp://en.wikipedia.org/wiki/Kilobytehttp://en.wikipedia.org/wiki/Kilobytehttp://en.wikipedia.org/wiki/Working_sethttp://en.wikipedia.org/wiki/File:Cache,missrate.svghttp://en.wikipedia.org/wiki/File:Cache,missrate.svghttp://en.wikipedia.org/wiki/CPU_cache#cite_note-4http://en.wikipedia.org/wiki/Benchmark_(computing)http://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithmhttp://en.wikipedia.org/wiki/Kilobytehttp://en.wikipedia.org/wiki/Working_set

8/2/2019 Cache Data

13/30

Finally, note that between 64 KB and 1 MB there is a large difference between direct-mapped and

fully associative caches. This difference is the conflict miss rate. The insight from looking at conflict

miss rates is that secondary caches benefit a great deal from high associativity.

This benefit was well known in the late 80s and early 90s, when CPU designers could not fit large

caches on-chip, and could not get sufficient bandwidth to either the cache data memory or cache

tag memory to implement high associativity in off-chip caches. Desperate hacks were attempted:

theMIPSR8000 used expensive off-chip dedicated tag SRAMs, which had embedded tag

comparators and large drivers on the match lines, in order to implement a 4 MB 4-way associative

cache. The MIPS R10000used ordinary SRAM chips for the tags. Tag access for both ways took

two cycles. To reduce latency, the R10000 would guess which way of the cache would hit on each

access.

[edit]Address translation

Main article:Translation lookaside buffer

Most general purpose CPUs implement some form ofvirtual memory. To summarize, each program

running on the machine sees its own simplifiedaddress space, which contains code and data for

that program only. Each program uses this virtual address space without regard for where it exists

in physical memory.

Virtual memory requires the processor to translate virtual addresses generated by the program into

physical addresses in main memory. The portion of the processor that does this translation is

known as thememory management unit(MMU). The fast path through the MMU can perform thosetranslations stored in thetranslation lookaside buffer(TLB), which is a cache of mappings from the

operating system's page table.

For the purposes of the present discussion, there are three important features of address

translation:

Latency: The physical address is available from the MMU some time, perhaps a few

cycles, after the virtual address is available from the address generator.

Aliasing: Multiple virtual addresses can map to a single physical address. Most processors

guarantee that all updates to that single physical address will happen in program order. To

deliver on that guarantee, the processor must ensure that only one copy of a physical address

resides in the cache at any given time.
http://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/R8000http://en.wikipedia.org/wiki/Static_random_access_memoryhttp://en.wikipedia.org/wiki/Static_random_access_memoryhttp://en.wikipedia.org/wiki/R10000http://en.wikipedia.org/wiki/R10000http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=9http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Page_tablehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/R8000http://en.wikipedia.org/wiki/Static_random_access_memoryhttp://en.wikipedia.org/wiki/R10000http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=9http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Page_table

8/2/2019 Cache Data

14/30

Granularity: The virtual address space is broken up into pages. For instance, a 4 GB

virtual address space might be cut up into 1048576 pages of 4 KB size, each of which can be

independently mapped. There may be multiple page sizes supported; seevirtual memory for

elaboration.

A historical note: some early virtual memory systems were very slow, because they required an

access to the page table (held in main memory) before every programmed access to main memory.

[NB 1] With no caches, this effectively cut the speed of the machine in half. The first hardware cache

used in a computer system was not actually a data or instruction cache, but rather a TLB.

Caches can be divided into 4 types, based on whether the index or tag correspond to physical or

virtual addresses:

Physically indexed, physically tagged (PIPT) caches use the physical address for both

the index and the tag. While this is simple and avoids problems with aliasing, it is also slow, as

the physical address must be looked up (which could involve a TLB miss and access to main

memory) before that address can be looked up in the cache.

Virtually indexed, virtually tagged (VIVT) caches use the virtual address for both the

index and the tag. This caching scheme can result in much faster lookups, since the MMU

doesn't need to be consulted first to determine the physical address for a given virtual address.

However, VIVT suffers from aliasing problems, where several different virtual addresses may

refer to the same physical address. The result is that such addresses would be cached

separately despite referring to the same memory, causing coherency problems. Another

problem is homonyms, where the same virtual address maps to several different physical

addresses. It is not possible to distinguish these mappings by only looking at the virtual index,

though potential solutions include: flushing the cache after acontext switch, forcing address

spaces to be non-overlapping, tagging the virtual address with an address space ID (ASID), or

using physical tags. Additionally, there is a problem that virtual-to-physical mappings can

change, which would require flushing cache lines, as the VAs would no longer be valid.

Virtually indexed, physically tagged (VIPT) caches use the virtual address for the index

and the physical address in the tag. The advantage over PIPT is lower latency, as the cache

line can be looked up in parallel with the TLB translation, however the tag can't be compared

until the physical address is available. The advantage over VIVT is that since the tag has the

physical address, the cache can detect homonyms. VIPT requires more tag bits, as the index

bits no longer represent the same address.
http://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-7http://en.wikipedia.org/wiki/Context_switchhttp://en.wikipedia.org/wiki/Context_switchhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-7http://en.wikipedia.org/wiki/Context_switch

8/2/2019 Cache Data

15/30

Physically indexed, virtually tagged caches are only theoretical as they would basically

be useless.[8]

The speed of this recurrence (the load latency) is crucial to CPU performance, and so most

modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to

proceed in parallel with fetching the data from the cache RAM.

But virtual indexing is not the best choice for all cache levels. The cost of dealing with virtual aliases

grows with cache size, and as a result most level-2 and larger caches are physically indexed.

Caches have historically used both virtual and physical addresses for the cache tags, although

virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup, then

the physical address is available in time for tag compare, and there is no need for virtual tagging.

Large caches, then, tend to be physically tagged, and only small, very low latency caches are

virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded by vhints,

as described below.

[edit]Virtual indexing and virtual aliases

The usual way the processor guarantees that virtually aliased addresses act as a single storage

location is to arrange that only one virtual alias can be in the cache at any given time.

Whenever a new entry is added to a virtually indexed cache, the processor searches for any virtual

aliases already resident and evicts them first. This special handling happens only during a cache

miss. No special work is necessary during a cache hit, which helps keep the fast path fast.

The most straightforward way to find aliases is to arrange for them all to map to the same location

in the cache. This happens, for instance, if the TLB has e.g. 4 KB pages, and the cache is direct

mapped and 4 KB or less.

Modern level-1 caches are much larger than 4 KB, but virtual memory pages have stayed that size.

If the cache is e.g. 16 KB and virtually indexed, for any virtual address there are four cache

locations that could hold the same physical location, but aliased to different virtual addresses. If the

cache misses, all four locations must be probed to see if their corresponding physical addresses

match the physical address of the access that generated the miss.

These probes are the same checks that a set associative cache uses to select a particular match.

So if a 16 KB virtually indexed cache is 4-way set associative and used with 4 KB virtual memory

pages, no special work is necessary to evict virtual aliases during cache misses because the

checks have already happened while checking for a cache hit.
http://en.wikipedia.org/wiki/CPU_cache#cite_note-8http://en.wikipedia.org/wiki/CPU_cache#cite_note-8http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=10http://en.wikipedia.org/wiki/CPU_cache#cite_note-8http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=10

8/2/2019 Cache Data

16/30

Using the AMD Athlon as an example again, it has a 64 KB level-1 data cache, 4 KB pages, and 2-

way set associativity. When the level-1 data cache suffers a miss, 2 of the 16 (==64 KB/4 KB)

possible virtual aliases have already been checked, and seven more cycles through the tag check

hardware are necessary to complete the check for virtual aliases.

[edit]Homonym and synonym problems

The cache that relies on the virtual indexing and tagging becomes inconsistent after the same

virtual address is mapped into different physical addresses (homonym). This can be solved by

using physical address for tagging or by storing the address space id in the cache line. However

the latter of these two approaches does not help against thesynonym problem, where several

cache lines end up storing data for the same physical address. Writing to such location may update

only one location in the cache, leaving others with inconsistent data. Problem might be solved by

using non overlapping memory layouts for different address spaces or otherwise the cache (or part

of it) must be flushed when the mapping changes.[9]

[edit]Virtual tags and vhints

Virtual tagging is possible too. The great advantage of virtual tags is that, for associative caches,

they allow the tag match to proceed before the virtual to physical translation is done. However,

Coherence probes and evictions present a physical address for action. The hardware must

have some means of converting the physical addresses into a cache index, generally by storing

physical tags as well as virtual tags. For comparison, a physically tagged cache does not need

to keep virtual tags, which is simpler.

When a virtual to physical mapping is deleted from the TLB, cache entries with those virtual

addresses will have to be flushed somehow. Alternatively, if cache entries are allowed on

pages not mapped by the TLB, then those entries will have to be flushed when the access

rights on those pages are changed in the page table.

It is also possible for the operating system to ensure that no virtual aliases are simultaneously

resident in the cache. The operating system makes this guarantee by enforcing page coloring,

which is described below. Some early RISC processors (SPARC, RS/6000) took this approach. It

has not been used recently, as the hardware cost of detecting and evicting virtual aliases has fallen

and the software complexity and performance penalty of perfect page coloring has risen.

It can be useful to distinguish the two functions of tags in an associative cache: they are used to

determine which way of the entry set to select, and they are used to determine if the cache hit or
http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=11http://en.wikipedia.org/wiki/Homonymhttp://en.wikipedia.org/wiki/Synonymhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-9http://en.wikipedia.org/wiki/CPU_cache#cite_note-9http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=12http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=11http://en.wikipedia.org/wiki/Homonymhttp://en.wikipedia.org/wiki/Synonymhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-9http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=12

8/2/2019 Cache Data

17/30

8/2/2019 Cache Data

18/30

A programmer attempting to make maximum use of the cache may arrange his program's access

patterns so that only 1 MB of data need be cached at any given time, thus avoiding capacity

misses. But he should also ensure that the access patterns do not have conflict misses. One way to

think about this problem is to divide up the virtual pages the program uses and assign them virtual

colors in the same way as physical colors were assigned to physical pages before. The

programmer can then arrange the access patterns of his code so that no two pages with the same

virtual color are in use at the same time. There is a wide literature on such optimizations (e.g. loop

nest optimization), largely coming from the High Performance Computing (HPC) community.

The snag is that while all the pages in use at any given moment may have different virtual colors,

some may have the same physical colors. In fact, if the operating system assigns physical pages to

virtual pages randomly and uniformly, it is extremely likely that some pages will have the same

physical color, and then locations from those pages will collide in the cache (this is thebirthday

paradox).

The solution is to have the operating system attempt to assign different physical color pages to

different virtual colors, a technique calledpage coloring. Although the actual mapping from virtual to

physical color is irrelevant to system performance, odd mappings are difficult to keep track of and

have little benefit, so most approaches to page coloring simply try to keep physical and virtual page

colors the same.

If the operating system can guarantee that each physical page maps to only one virtual color, then

there are no virtual aliases, and the processor can use virtually indexed caches with no need for

extra virtual alias probes during miss handling. Alternatively, the O/S can flush a page from the

cache whenever it changes from one virtual color to another. As mentioned above, this approach

was used for some early SPARC and RS/6000 designs.

[edit]Cache hierarchy in a modern processor

Modern processors have multiple interacting caches on chip.

[edit]Specialized caches

Pipelined CPUs access memory from multiple points in the pipeline: instruction fetch,virtual-to-

physicaladdress translation, and data fetch (seeclassic RISC pipeline). The natural design is to

use different physical caches for each of these points, so that no one physical resource has to be

scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least

three separate caches (instruction,TLB, and data), each specialized to its particular role.

Pipelines with separate instruction and data caches, now predominant, are said to have aHarvard

architecture. Originally, this phrase referred to machines with separate instruction and data
http://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/High_Performance_Computinghttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=14http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=15http://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/High_Performance_Computinghttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=14http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=15http://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecture

8/2/2019 Cache Data

19/30

memories, which proved not at all popular. Most modern CPUs have a single-memoryvon

Neumann architecture.

[edit]Victim cache

A victim cache is a cache used to hold blocks evicted from a CPU cache upon replacement. The

victim cache lies between the main cache and its refill path, and only holds blocks that were evicted

from the main cache. The victim cache is usually fully associative, and is intended to reduce the

number of conflict misses. Many commonly used programs do not require an associative mapping

for all the accesses. In fact, only a small fraction of the memory accesses of the program require

high associativity. The victim cache exploits this property by providing high associativity to only

these accesses. It was introduced by Norman Jouppiin 1990.

[edit]Trace cache

One of the more extreme examples of cache specialization is the trace cache found in the Intel

Pentium 4 microprocessors. A trace cacheis a mechanism for increasing the instructionfetch

bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces

ofinstructionsthat have already been fetched and decoded.

The earliest widely acknowledged academic publication of trace cache was byEric

Rotenberg, Steve Bennett, andJim Smithin their 1996 paper"Trace Cache: a Low Latency

Approach to High Bandwidth Instruction Fetching."

An earlier publication is US Patent 5,381,533, "Dynamic flow instruction cache memory organized

around trace segments independent of virtual address line", byAlex Peleg and Uri Weiserof Intel

Corp., patent filed March 30, 1994, a continuation of an application filed in 1992, later abandoned.

A trace cache stores instructions either after they have been decoded, or as they are retired.

Generally, instructions are added to trace caches in groups representing either individualbasic

blocks or dynamic instruction traces. A dynamic trace ("trace path") contains only instructions

whose results are actually used, and eliminates instructions following taken branches (since they

are not executed); a dynamic trace can be a concatenation of multiple basic blocks. This allows the

instruction fetch unit of a processor to fetch several basic blocks, without having to worry about

branches in the execution flow.

Trace lines are stored in the trace cache based on the program counterof the first instruction in the

trace and a set of branch predictions. This allows for storing different trace paths that start on the

same address, each representing different branch outcomes. In the instruction fetch stage of

a pipeline, the current program counter along with a set of branch predictions is checked in the

trace cache for a hit. If there is a hit, a trace line is supplied to fetch which does not have to go to a
http://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=16http://en.wikipedia.org/w/index.php?title=Norman_Jouppi&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Norman_Jouppi&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=17http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Steve_Bennett_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Alex_Peleg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Alex_Peleg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Uri_Weiser&action=edit&redlink=1http://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Program_counterhttp://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=16http://en.wikipedia.org/w/index.php?title=Norman_Jouppi&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=17http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Steve_Bennett_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Alex_Peleg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Uri_Weiser&action=edit&redlink=1http://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Program_counterhttp://en.wikipedia.org/wiki/Instruction_pipeline

8/2/2019 Cache Data

20/30

regular cache or to memory for these instructions. The trace cache continues to feed the fetch unit

until the trace line ends or until there is amispredictionin the pipeline. If there is a miss, a new

trace starts to be built.

Trace caches are also used in processors like theIntelPentium 4to store already decoded micro-

operations, or translations of complex x86 instructions, so that the next time an instruction is

needed, it does not have to be decoded again.

See the full text ofSmith, Rotenberg and Bennett's paperatCiteseer.

[edit]Multi-level caches

Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have

better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of

cache, with small fast caches backed up by larger slower caches.

Multi-level caches generally operate by checking the smallest Level 1 (L1) cache first; if it hits, the

processor proceeds at high speed. If the smaller cache misses, the next larger cache (L2) is

checked, and so on, before external memory is checked.

As the latency difference between main memory and the fastest cache has become larger, some

processors have begun to utilize as many as three levels of on-chip cache. For example, theAlpha

21164(1995) had 1 to 64MB off-chip L3 cache; the IBMPOWER4 (2001) had a 256[citation needed]MB

L3 cache off-chip, shared among several processors; theItanium 2 (2003) had a 6 MB unified level

3 (L3) cache on-die; theItanium 2 (2003) MX 2 Module incorporates two Itanium2 processors along

with a shared 64 MB L4 cache on a MCM that was pin compatible with a Madison processor;

Intel's Xeon MP product code-named "Tulsa" (2006) features 16 MB of on-die L3 cache shared

between two processor cores; the AMD Phenom II (2008) has up to 6 MB on-die unified L3 cache;

and theIntel Core i7(2008) has an 8 MB on-die unified L3 cache that is inclusive, shared by all

cores. The benefits of an L3 cache depend on the application's access patterns.

Finally, at the other end of the memory hierarchy, the CPUregister file itself can be considered the

smallest, fastest cache in the system, with the special characteristic that it is scheduled in software

typically by a compiler, as it allocates registers to hold values retrieved from main memory. (See

especially loop nest optimization.) Register files sometimes also have hierarchy: The Cray-1(circa1976) had 8 address "A" and 8 scalar data "S" registers that were generally usable. There was also

a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster

than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a

data cache. (The Cray-1 did, however, have an instruction cache.)
http://en.wikipedia.org/wiki/Mispredictionhttp://en.wikipedia.org/wiki/Mispredictionhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Pentium_4http://citeseer.ist.psu.edu/rotenberg96trace.htmlhttp://en.wikipedia.org/wiki/Citeseerhttp://en.wikipedia.org/wiki/Citeseerhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=18http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/POWER4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Xeonhttp://en.wikipedia.org/wiki/Phenom_IIhttp://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Register_filehttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Cray-1http://en.wikipedia.org/wiki/Cray-1http://en.wikipedia.org/wiki/Mispredictionhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Pentium_4http://citeseer.ist.psu.edu/rotenberg96trace.htmlhttp://en.wikipedia.org/wiki/Citeseerhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=18http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/POWER4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Xeonhttp://en.wikipedia.org/wiki/Phenom_IIhttp://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Register_filehttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Cray-1

8/2/2019 Cache Data

21/30

[edit]Exclusive versus inclusive

Multi-level caches introduce new design decisions. For instance, in some processors, all data in the

L1 cache must also be somewhere in the L2 cache. These caches are called strictly inclusive.

Other processors (like the AMD Athlon) have exclusive caches data is guaranteed to be in at

most one of the L1 and L2 caches, never in both. Still other processors (like the IntelPentium II,III,

and 4), do not require that data in the L1 cache also reside in the L2 cache, although it may often

do so. There is no universally accepted name for this intermediate policy, although the termmainly

inclusive has been used.[citation needed]

The advantage of exclusive caches is that they store more data. This advantage is larger when the

exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times

larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line

in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just

copying a line from L2 to L1, which is what an inclusive cache does.

One advantage of strictly inclusive caches is that when external devices or other processors in a

multiprocessor system wish to remove a cache line from the processor, they need only have the

processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache

must be checked as well. As a drawback, there is a correlation between the associativities of L1

and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the

effective associativity of the L1 caches is restricted.

Another advantage of inclusive caches is that the larger cache can use larger cache lines, which

reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the

same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit). If the secondary

cache is an order of magnitude larger than the primary, and the cache data is an order of

magnitude larger than the cache tags, this tag area saved can be comparable to the incremental

area needed to store the L1 cache data in the L2.

[edit]Example: the K8

To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8 core

in the AMDAthlon 64 CPU.[10]
http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=19http://en.wikipedia.org/wiki/Pentium_IIhttp://en.wikipedia.org/wiki/Pentium_IIIhttp://en.wikipedia.org/wiki/Pentium_IIIhttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=20http://en.wikipedia.org/wiki/Athlon_64http://en.wikipedia.org/wiki/CPU_cache#cite_note-10http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=19http://en.wikipedia.org/wiki/Pentium_IIhttp://en.wikipedia.org/wiki/Pentium_IIIhttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=20http://en.wikipedia.org/wiki/Athlon_64http://en.wikipedia.org/wiki/CPU_cache#cite_note-10

8/2/2019 Cache Data

22/30

Example of hierarchy, the K8

The K8 has 4 specialized caches: an instruction cache, an instructionTLB, a data TLB, and a data

cache. Each of these caches is specialized:

The instruction cache keeps copies of 64-byte lines of memory, and fetches 16 bytes each

cycle. Each byte in this cache is stored in ten bits rather than 8, with the extra bits marking the

boundaries of instructions (this is an example of predecoding). The cache has

only parityprotection rather than ECC, because parity is smaller and any damaged data can be

replaced by fresh data fetched from memory (which always has an up-to-date copy of

instructions).

The instruction TLB keeps copies of page table entries (PTEs). Each cycle's instruction

fetch has its virtual address translated through this TLB into a physical address. Each entry is

either 4 or 8 bytes in memory. Because the K8 has a variable page size, each of the TLBs is

split into two sections, one to keep PTEs that map 4 KB pages, and one to keep PTEs that map

4 MB or 2 MB pages. The split allows the fully associative match circuitry in each section to be

simpler. The operating system maps different sections of the virtual address space with

different size PTEs.
http://en.wikipedia.org/wiki/CPU_cache#cite_note-10http://en.wikipedia.org/wiki/CPU_cache#cite_note-10http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Parity_bithttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/File:Cache,hierarchy-example.svghttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Parity_bithttp://en.wikipedia.org/wiki/Error-correcting_code

8/2/2019 Cache Data

23/30

The data TLB has two copies which keep identical entries. The two copies allow two data

accesses per cycle to translate virtual addresses to physical addresses. Like the instruction

TLB, this TLB is split into two kinds of entries.

The data cache keeps copies of 64-byte lines of memory. It is split into 8 banks (each

storing 8 KB of data), and can fetch two 8-byte data each cycle so long as those data are in

different banks. There are two copies of the tags, because each 64-byte line is spread among

all 8 banks. Each tag copy handles one of the two accesses per cycle.

The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which

store only PTEs mapping 4 KB. Both instruction and data caches, and the various TLBs, can fill

from the large unified L2 cache. This cache is exclusive to both the L1 instruction and data caches,

which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache,

or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in

one of the TLBsthe operating system is responsible for keeping the TLBs coherent by flushing

portions of them when the page tables in memory are updated.

The K8 also caches information that is never stored in memoryprediction information. These

caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly

complex branch prediction, with tables that help predict whether branches are taken and other

tables which predict the targets of branches and jumps. Some of this information is associated with

instructions, in both the level 1 instruction cache and the unified secondary cache.

The K8 uses an interesting trick to store prediction information with instructions in the secondary

cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by

an alpha particle strike) by eitherECCorparity, depending on whether those lines were evicted

from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC

code, lines from the instruction cache have a few spare bits. These bits are used to cache branch

prediction information associated with those instructions. The net result is that the branch predictor

has a larger effective history table, and so has better accuracy.

[edit]More hierarchies

Other processors have other kinds of predictors (e.g. the store-to-load bypass predictor in

theDECAlpha 21264), and various specialized predictors are likely to flourish in future processors.

These predictors are caches in that they store information that is costly to compute. Some of the

terminology used when discussing predictors is the same as that for caches (one speaks of ahit in

a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.
http://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Alpha_particlehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=21http://en.wikipedia.org/wiki/Digital_Equipment_Corporationhttp://en.wikipedia.org/wiki/Digital_Equipment_Corporationhttp://en.wikipedia.org/wiki/Alpha_21264http://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Alpha_particlehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=21http://en.wikipedia.org/wiki/Digital_Equipment_Corporationhttp://en.wikipedia.org/wiki/Alpha_21264

8/2/2019 Cache Data

24/30

8/2/2019 Cache Data

25/30

Read path for a 2-way associative cache

The diagram to the right is intended to clarify the manner in which the various fields of the address

are used. Address bit 31 is most significant, bit 0 is least significant. The diagram shows the

SRAMs, indexing, and multiplexing for a 4 KB, 2-way

cache data

Documents