cache data
TRANSCRIPT
-
8/2/2019 Cache Data
1/30
In computer engineering, a cache ( / k/kashorAust/NZ:/k e/kaysh) is a component
that transparently stores data so that future requests for that data can be served faster. The data
that is stored within a cache might be values that have been computed earlier or duplicates of
original values that are stored elsewhere. If requested data is contained in the cache (cache hit),
this request can be served by simply reading the cache, which is comparatively faster. Otherwise
(cache miss), the data has to be recomputed or fetched from its original storage location, which is
comparatively slower. Hence, the more requests can be served from the cache the faster the
overall system performance is.
To be cost efficient and to enable an efficient use of data, caches are relatively small. Nevertheless,
caches have proven themselves in many areas of computing because access patterns in
typicalcomputer applications have locality of reference. References exhibit temporal localityif data
is requested again that has been recently requested already. References exhibitspatial locality if
data is requested that is physically stored close to data that has been requested already.
Diagram of a CPU memory cache
Contents
[hide]
1 Operation
2 Applications
o 2.1 CPU cache
o 2.2 Disk cache
o 2.3 Web cache
o 2.4 Other caches
o 2.5 The difference between buffer and cache
3 See also
4 Further reading
5 References
[edit]Operation
http://en.wikipedia.org/wiki/Computer_engineeringhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Australiahttp://en.wikipedia.org/wiki/Australiahttp://en.wikipedia.org/wiki/New_Zealandhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Application_softwarehttp://en.wikipedia.org/wiki/Application_softwarehttp://en.wikipedia.org/wiki/Locality_of_referencehttp://en.wikipedia.org/wiki/Memory_localityhttp://en.wikipedia.org/wiki/Spatial_localityhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Cache#Operationhttp://en.wikipedia.org/wiki/Cache#Applicationshttp://en.wikipedia.org/wiki/Cache#CPU_cachehttp://en.wikipedia.org/wiki/Cache#Disk_cachehttp://en.wikipedia.org/wiki/Cache#Web_cachehttp://en.wikipedia.org/wiki/Cache#Other_cacheshttp://en.wikipedia.org/wiki/Cache#The_difference_between_buffer_and_cachehttp://en.wikipedia.org/wiki/Cache#See_alsohttp://en.wikipedia.org/wiki/Cache#Further_readinghttp://en.wikipedia.org/wiki/Cache#Referenceshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=1http://en.wikipedia.org/wiki/File:Cache,basic.svghttp://en.wikipedia.org/wiki/Computer_engineeringhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Australiahttp://en.wikipedia.org/wiki/New_Zealandhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Application_softwarehttp://en.wikipedia.org/wiki/Locality_of_referencehttp://en.wikipedia.org/wiki/Memory_localityhttp://en.wikipedia.org/wiki/Spatial_localityhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Cache#Operationhttp://en.wikipedia.org/wiki/Cache#Applicationshttp://en.wikipedia.org/wiki/Cache#CPU_cachehttp://en.wikipedia.org/wiki/Cache#Disk_cachehttp://en.wikipedia.org/wiki/Cache#Web_cachehttp://en.wikipedia.org/wiki/Cache#Other_cacheshttp://en.wikipedia.org/wiki/Cache#The_difference_between_buffer_and_cachehttp://en.wikipedia.org/wiki/Cache#See_alsohttp://en.wikipedia.org/wiki/Cache#Further_readinghttp://en.wikipedia.org/wiki/Cache#Referenceshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=1 -
8/2/2019 Cache Data
2/30
Hardware implements cache as a block of memory for temporary storage of data likely to be used
again.CPUs and hard drivesfrequently use a cache, as do web browsers and web servers.
A cache is made up of a pool of entries. Each entry has a datum (a nugget of data) - a copy of the
same datum in some backing store. Each entry also has a tag, which specifies the identity of the
datum in the backing store of which the entry is a copy.
When the cache client (a CPU, web browser, operating system) needs to access a datum
presumed to exist in the backing store, it first checks the cache. If an entry can be found with a tag
matching that of the desired datum, the datum in the entry is used instead. This situation is known
as a cache hit. So, for example, a web browser program might check its local cache on disk to see
if it has a local copy of the contents of a web page at a particular URL. In this example, the URL is
the tag, and the contents of the web page is the datum. The percentage of accesses that result in
cache hits is known as the hit rate orhit ratio of the cache.
The alternative situation, when the cache is consulted and found not to contain a datum with the
desired tag, has become known as acache miss. The previously uncached datum fetched from
the backing store during miss handling is usually copied into the cache, ready for the next access.
During a cache miss, the CPU usually ejects some other entry in order to make room for the
previously uncached datum. Theheuristic used to select the entry to eject is known as
thereplacement policy. One popular replacement policy, "least recently used" (LRU), replaces the
least recently used entry (see cache algorithms). More efficient caches compute use frequency
against the size of the stored contents, as well as the latenciesand throughputs for both the cache
and the backing store. While this works well for larger amounts of data, long latencies and slow
throughputs, such as experienced with a hard drive and the Internet, it is not efficient for use with a
CPU cache.[citation needed]
When a system writes a datum to the cache, it must at some point write that datum to the backing
store as well. The timing of this write is controlled by what is known as thewrite policy.
In a write-through cache, every write to the cache causes a synchronous write to the backing
store.
Alternatively, in a write-back (orwrite-behind) cache, writes are not immediately mirrored to the
store. Instead, the cache tracks which of its locations have been written over and marks these
locations as dirty. The data in these locations are written back to the backing store when those
data are evicted from the cache, an effect referred to as a lazy write. For this reason, a read miss
in a write-back cache (which requires a block to be replaced by another) will often require two
http://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Hard_drivehttp://en.wikipedia.org/wiki/Hard_drivehttp://en.wikipedia.org/wiki/Operating_systemhttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Page_replacement_algorithmhttp://en.wikipedia.org/wiki/Page_replacement_algorithmhttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Access_timehttp://en.wikipedia.org/wiki/Access_timehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Hard_drivehttp://en.wikipedia.org/wiki/Operating_systemhttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Page_replacement_algorithmhttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Access_timehttp://en.wikipedia.org/wiki/Wikipedia:Citation_needed -
8/2/2019 Cache Data
3/30
memory accesses to service: one to retrieve the needed datum, and one to write replaced data
from the cache to the store.
Other policies may also trigger data write-back. The client may make many changes to a datum in
the cache, and then explicitly notify the cache to write back the datum.
No-write allocation (a.k.a. write-no-allocate) is a cache policy which caches only processor reads,
i.e. on a write-miss:
Datum is written directly to memory,
Datum at the missed-write location is not added to cache.
This avoids the need for write-back or write-through when the old value of the datum was absent
from the cache prior to the write.
Entities other than the cache may change the data in the backing store, in which case the copy in
the cache may become out-of-date orstale. Alternatively, when the client updates the data in the
cache, copies of those data in other caches will become stale. Communication protocols between
the cache managers which keep the data consistent are known ascoherency protocols.
[edit]Applications
[edit]CPU cache
Main article:CPU cache
Small memories on or close to the CPU can operate faster than the much larger main memory.Most CPUs since the 1980s have used one or more caches, and modern high-end embedded,
desktop and servermicroprocessors may have as many as half a dozen, each specialized for a
specific function. Examples of caches with a specific function are the D-cache and I-cache (data
cache and instruction cache).
[edit]Disk cache
Main article:Page cache
While CPU caches are generally managed entirely by hardware, a variety of software manages
other caches. The page cache in main memory, which is an example of disk cache, is managed by
the operating system kernel.
While the hard drive's hardwaredisk bufferis sometimes misleadingly referred to as "disk cache",
its main functions are write sequencing and read prefetching. Repeated cache hits are relatively
rare, due to the small size of the buffer in comparison to the drive's capacity. However, high-
end disk controllersoften have their own on-board cache of hard diskdata blocks.
http://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=2http://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=3http://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Microprocessorshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=4http://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Kernel_(computer_science)http://en.wikipedia.org/wiki/Kernel_(computer_science)http://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_controllerhttp://en.wikipedia.org/wiki/Disk_controllerhttp://en.wikipedia.org/wiki/Block_(data_storage)http://en.wikipedia.org/wiki/Block_(data_storage)http://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=2http://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=3http://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Microprocessorshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=4http://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Kernel_(computer_science)http://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_controllerhttp://en.wikipedia.org/wiki/Block_(data_storage) -
8/2/2019 Cache Data
4/30
Finally, fast local hard disk can also cache information held on even slower data storage devices,
such as remote servers (web cache) or local tape drives oroptical jukeboxes. Such a scheme is the
main concept ofhierarchical storage management.
[edit]Web cache
Main article:Web cache
Web browsers andweb proxy servers employ web caches to store previous responses fromweb
servers, such as web pages. Web caches reduce the amount of information that needs to be
transmitted across the network, as information previously stored in the cache can often be re-used.
This reduces bandwidth and processing requirements of the web server, and helps to
improveresponsiveness for users of the web.
Web browsers employ a built-in web cache, but some internet service providersor organizations
also use a caching proxy server, which is a web cache that is shared among all users of that
network.
Another form of cache isP2P caching, where the files most sought for by peer-to-peerapplications
are stored in an ISP cache to accelerate P2P transfers. Similarly, decentralised equivalents exist,
which allow communities to perform the same task for P2P traffic, e.g. Corelli [1]
[edit]Other caches
The BIND DNSdaemon caches a mapping of domain names toIP addresses, as does a resolver
library.
Write-through operation is common when operating over unreliable networks (like an Ethernet
LAN), because of the enormous complexity of thecoherency protocolrequired between multiple
write-back caches when communication is unreliable. For instance, web page caches andclient-
sidenetwork file system caches (like those inNFS orSMB) are typically read-only or write-through
specifically to keep the network protocol simple and reliable.
Search engines also frequently make web pagesthey have indexed available from their cache. For
example,Googleprovides a "Cached" link next to each search result. This can prove useful when
web pages from a web serverare temporarily or permanently inaccessible.
Another type of caching is storing computed results that will likely be needed again,
ormemoization.ccache, a program that caches the output of the compilation to speed up the
second-time compilation, exemplifies this type.
Database cachingcan substantially improve the throughput ofdatabaseapplications, for example
in the processing ofindexes, data dictionaries, and frequently used subsets of data.
http://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Tape_drivehttp://en.wikipedia.org/wiki/Optical_jukeboxhttp://en.wikipedia.org/wiki/Optical_jukeboxhttp://en.wikipedia.org/wiki/Hierarchical_storage_managementhttp://en.wikipedia.org/wiki/Hierarchical_storage_managementhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=5http://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Web_browserhttp://en.wikipedia.org/wiki/Proxy_serverhttp://en.wikipedia.org/wiki/Proxy_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Responsivenesshttp://en.wikipedia.org/wiki/Responsivenesshttp://en.wikipedia.org/wiki/Internet_Service_Providerhttp://en.wikipedia.org/wiki/Internet_Service_Providerhttp://en.wikipedia.org/wiki/P2P_cachinghttp://en.wikipedia.org/wiki/Peer-to-peerhttp://en.wikipedia.org/wiki/ISPhttp://en.wikipedia.org/wiki/Cache#cite_note-0http://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=6http://en.wikipedia.org/wiki/Domain_Name_Systemhttp://en.wikipedia.org/wiki/Domain_Name_Systemhttp://en.wikipedia.org/wiki/IP_addresshttp://en.wikipedia.org/wiki/IP_addresshttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Network_File_Systemhttp://en.wikipedia.org/wiki/Network_File_System_(protocol)http://en.wikipedia.org/wiki/Network_File_System_(protocol)http://en.wikipedia.org/wiki/Server_Message_Blockhttp://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Memoizationhttp://en.wikipedia.org/wiki/Memoizationhttp://en.wikipedia.org/wiki/Ccachehttp://en.wikipedia.org/wiki/Ccachehttp://en.wikipedia.org/wiki/Database_cachinghttp://en.wikipedia.org/wiki/Database_cachinghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Index_(database)http://en.wikipedia.org/wiki/Data_dictionaryhttp://en.wikipedia.org/wiki/Data_dictionaryhttp://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Tape_drivehttp://en.wikipedia.org/wiki/Optical_jukeboxhttp://en.wikipedia.org/wiki/Hierarchical_storage_managementhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=5http://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Web_browserhttp://en.wikipedia.org/wiki/Proxy_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Responsivenesshttp://en.wikipedia.org/wiki/Internet_Service_Providerhttp://en.wikipedia.org/wiki/P2P_cachinghttp://en.wikipedia.org/wiki/Peer-to-peerhttp://en.wikipedia.org/wiki/ISPhttp://en.wikipedia.org/wiki/Cache#cite_note-0http://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=6http://en.wikipedia.org/wiki/Domain_Name_Systemhttp://en.wikipedia.org/wiki/IP_addresshttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Network_File_Systemhttp://en.wikipedia.org/wiki/Network_File_System_(protocol)http://en.wikipedia.org/wiki/Server_Message_Blockhttp://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Memoizationhttp://en.wikipedia.org/wiki/Ccachehttp://en.wikipedia.org/wiki/Database_cachinghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Index_(database)http://en.wikipedia.org/wiki/Data_dictionary -
8/2/2019 Cache Data
5/30
Distributed caching[2] uses caches spread across different networked hosts, e.g. Corelli
[edit]The difference between buffer and cache
The terms "buffer" and "cache" are not mutually exclusive and the functions are frequently
combined; however, there is a difference in intent.
Abufferis a temporary memory location, that is traditionally used because CPUinstructionscannot
directly address data stored in peripheral devices. Thus, addressable memory is used as
intermediate stage. Additionally such a buffer may be feasible when a large block of data is
assembled or disassembled (as required by a storage device), or when data may be delivered in a
different order than that in which it is produced. Also a whole buffer of data is usually transferred
sequentially (for example to hard disk), so buffering itself sometimes increases transfer
performance or reduce the variation or jitter of the transfer's latency as opposed to caching where
the intent is to reduce the latency. These benefits are present even if the buffered data are written
to thebufferonce and read from the buffer once.
A cache also increases transfer performance. A part of the increase similarly comes from the
possibility that multiple small transfers will combine into one large block. But the main performance-
gain occurs because there is a good chance that the same datum will be read from cache multiple
times, or that written data will soon be read. A cache's sole purpose is to reduce accesses to the
underlying slower storage. Cache is also usually anabstraction layerthat is designed to be invisible
from the perspective of neighbouring layers.
A CPU cache is a cacheused by thecentral processing unit of acomputerto reduce the average
time to accessmemory. The cache is a smaller, faster memory which stores copies of the data
from the most frequently usedmain memory locations. As long as most memory accesses are
cached memory locations, the averagelatency of memory accesses will be closer to the cache
latency than to the latency of main memory.
When the processor needs to read from or write to a location in main memory, it first checks
whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to
the cache, which is much faster than reading from or writing to main memory.
Most modern desktop and server CPUs have at least three independent caches: aninstruction
cache to speed up executable instruction fetch, a data cache to speed up data fetch and store,
http://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://en.wikipedia.org/wiki/Cache#cite_note-1http://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://eprints.comp.lancs.ac.uk/2044/1/MMCN09.pdfhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=7http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/RAM_latencyhttp://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://en.wikipedia.org/wiki/Cache#cite_note-1http://eprints.comp.lancs.ac.uk/2044/1/MMCN09.pdfhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit§ion=7http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/RAM_latency -
8/2/2019 Cache Data
6/30
and atranslation lookaside buffer(TLB) used to speed up virtual-to-physical address translation for
both executable instructions and data. Data cache is usually organized as a hierarchy of more
cache levels (L1, L2, etc.; seeMulti-level caches).
Details of operation
This section describes a typical data cache and some instruction caches; A TLB may have more
complexity and an instruction cache may be simpler. The diagram on the right shows two
memories. Each location in each memory contains data (acache line), which in different designs
may range in size from 8 to 512 bytes.[citation needed] The size of the cache line is usually larger than the
size of the usual access requested by a CPU instruction[citation needed], which ranges from 1 to 16
bytes[citation needed] (the largest addresses and data handled by current 32 bit and 64 bit architectures
being 128 bits long, i.e. 16 bytes).[citation needed] Each location in each memory also has an index, which
is a unique number used to refer to that location. The index for a location in main memory is called
anaddress. Each location in the cache has a tag that contains the index of the datum in main
memory that has been cached. In a CPU's data cache these entries are calledcache lines orcache
blocks.
When theprocessorneeds to read or write a location in main memory, it first checks whether that
memory location is in the cache. This is accomplished by comparing the address of the memory
location to all tags in the cache that might contain that address. If the processor finds that the
memory location is in the cache, we say that a cache hithas occurred; otherwise, we speak of
a cache miss. In the case of a cache hit, the processor immediately reads or writes the data in the
cache line. The proportion of accesses that result in a cache hit is known as thehit rate, and is a
measure of the effectiveness of the cache for a given program or algorithm.
In the case of a miss, the cache allocates a new entry, which comprises the tag just missed and acopy of the data. The reference can then be applied to the new entry just as in the case of a hit.
Read misses delay execution because they require data to be transferred from a much slower
memory than the cache itself. Write misses may occur without such penalty since the data can be
copied in the background. Instruction caches are similar to data caches but the CPU only performs
read accesses (instruction fetch) to the instruction cache. Instruction and data caches can be
http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/CPU_cache#Multi-level_cacheshttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Memory_addresshttp://en.wikipedia.org/wiki/File:Cache,basic.svghttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/CPU_cache#Multi-level_cacheshttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Memory_address -
8/2/2019 Cache Data
7/30
separated for higher performance withHarvard CPUs but they can also be combined to reduce the
hardware overhead.
In order to make room for the new entry on a cache miss, the cache has toevictone of the existing
entries. Theheuristic that it uses to choose the entry to evict is called thereplacement policy. The
fundamental problem with any replacement policy is that it must predict which existing cache entry
is least likely to be used in the future. Predicting the future is difficult, especially for hardware
caches that use simple rules amenable to implementation in circuitry, so there are a variety of
replacement policies to choose from and no perfect way to decide among them. One popular
replacement policy, LRU, replaces the least recently used entry. Defining some memory ranges
non cacheableavoids affecting performance by storing in caches information which are never re-
used or seldom used. Cache misses are simply ignored for not cacheable data. Cache entries may
also be disabled or locked depending on the context.
If data are written to the cache, they must at some point be written to main memory as well. The
timing of this write is controlled by what is known as thewrite policy. In a write-through cache, every
write to the cache causes a write to main memory. Alternatively, in a write-backorcopy-backcache,
writes are not immediately mirrored to the main memory. Instead, the cache tracks which locations
have been written over (these locations are marked dirty). The data in these locations are written
back to the main memory when that data is evicted from the cache. For this reason, a miss in a
write-back cache may sometimes require two memory accesses to service: one to first write the
dirty location to memory and then another to read the new location from memory.
There are intermediate policies as well. The cache may be write-through, but the writes may be
held in a store data queue temporarily, usually so that multiple stores can be processed together
(which can reducebus turnarounds and so improve bus utilization).
The data in main memory being cached may be changed by other entities (e.g. peripherals
using direct memory accessormulti-core processor), in which case the copy in the cache may
become out-of-date orstale. Alternatively, when the CPU in a multi-core processor updates the
data in the cache, copies of data in caches associated with other cores will become stale.
Communication protocols between the cache managers which keep the data consistent are known
as cache coherence protocols. Another possibility is to share non cacheable data.
The time taken to fetch one datum from memory (read latency) matters because the CPU will run
out of things to do while waiting for the datum. When a CPU reaches this state, it is called astall.
As CPUs become faster, stalls due to cache misses displace more potential computation; modern
CPUs can execute hundreds of instructions in the time taken to fetch a single datum from the main
memory. Various techniques have been employed to keep the CPU busy during this time.Out-of-
http://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Computer_bushttp://en.wikipedia.org/wiki/Computer_bushttp://en.wikipedia.org/wiki/Direct_memory_accesshttp://en.wikipedia.org/wiki/Direct_memory_accesshttp://en.wikipedia.org/wiki/Multi-core_processorhttp://en.wikipedia.org/wiki/Cache_coherencehttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Computer_bushttp://en.wikipedia.org/wiki/Direct_memory_accesshttp://en.wikipedia.org/wiki/Multi-core_processorhttp://en.wikipedia.org/wiki/Cache_coherencehttp://en.wikipedia.org/wiki/Out-of-order_execution -
8/2/2019 Cache Data
8/30
orderCPUs (Pentium Proand laterInteldesigns, for example) attempt to execute independent
instructions after the instruction that is waiting for the cache miss data. Another technology, used by
many processors, is simultaneous multithreading (SMT), or -in Intel's terminology- hyper-
threading (HT), which allows an alternate thread to use the CPU core while a first thread waits for
data to come from main memory.
[edit]Cache entry structure
Cache row entries usually have the following structure:
tag data blocks valid bit
The data blocks (cache line) contain the actual data fetched from the main memory. The valid bit
(dirty bit) denotes that this particular entry has valid data.
An effective memory address is split (MSB to LSB) into the tag, the index and the displacement
(offset),
tag index displacement
The index length is bits and describes which row the data has been put
in. The displacement length is and specifies which block of the ones we
have stored we need. The tag length
isaddress_length index_length displacement_length and contains the most significant bits of
the address, which are checked against the current row (the row has been retrieved by index) to
see if it is the one we need or another, irrelevant memory location that happened to have the same
index bits as the one we want.
[edit]Associativity
Which memory locations can be cached by which cache locations
http://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Pentium_Prohttp://en.wikipedia.org/wiki/Pentium_Prohttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Thread_(computer_science)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=2http://en.wikipedia.org/wiki/Most_significant_bithttp://en.wikipedia.org/wiki/Least_significant_bithttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=3http://en.wikipedia.org/wiki/File:Cache,associative-fill-both.pnghttp://en.wikipedia.org/wiki/File:Cache,associative-fill-both.pnghttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Pentium_Prohttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Thread_(computer_science)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=2http://en.wikipedia.org/wiki/Most_significant_bithttp://en.wikipedia.org/wiki/Least_significant_bithttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=3 -
8/2/2019 Cache Data
9/30
Associativity is a trade-off. If there are ten places to which the replacement policy could have
mapped a memory location, then to check if that location is in the cache, ten cache entries must be
searched. Checking more places takes more power, chip area, and potentially time. On the other
hand, caches with more associativity suffer fewer misses (see conflict misses, below), so that the
CPU wastes less time reading from the slow main memory. The rule of thumb is that doubling the
associativity, from direct mapped to 2-way, or from 2-way to 4-way, has about the same effect on
hit rate as doubling the cache size. Associativity increases beyond 4-way have much less effect on
the hit rate,[1]and are generally done for other reasons (see virtual aliasing, below).
In order of increasing (worse) hit times and decreasing (better) miss rates,
direct mapped cachethe best (fastest) hit times, and so the best tradeoff for "large"
caches
2-way set associative cache
2-way skewed associative cache "the best tradeoff for .... caches whose sizes are in the
range 4K-8K bytes" Andr Seznec[2]
4-way set associative cache
fully associative cache the best (lowest) miss rates, and so the best tradeoff when the
miss penalty is very high
[edit]2-way set associative cache
If each location in main memory can be cached in either of two locations in the cache, one logical
question is: which two? The simplest and most commonly used scheme, shown in the right-hand
diagram above, is to use the least significant bits of the memory location's index as the index for the
cache memory, and to have two entries for each index. One benefit of this scheme is that the tags
stored in the cache do not have to include that part of the main memory address which is implied by
the cache memory's index. Since the cache tags are fewer bits, they take less area on the
microprocessor chip and can be read and compared faster.
[edit]Speculative execution
One of the advantages of a direct mapped cache is that it allows simple and fastspeculation. Once
the address has been computed, the one cache index which might have a copy of that datum is
known. That cache entry can be read, and the processor can continue to work with that data before
it finishes checking that the tag actually matches the requested address.
The idea of having the processor use the cached data before the tag match completes can be
applied to associative caches as well. A subset of the tag, called ahint, can be used to pick just
http://en.wikipedia.org/wiki/Trade-offhttp://en.wikipedia.org/wiki/Trade-offhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-0http://en.wikipedia.org/wiki/CPU_cache#cite_note-0http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=4http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=5http://en.wikipedia.org/wiki/Speculative_executionhttp://en.wikipedia.org/wiki/Speculative_executionhttp://en.wikipedia.org/wiki/Trade-offhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-0http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=4http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=5http://en.wikipedia.org/wiki/Speculative_execution -
8/2/2019 Cache Data
10/30
one of the possible cache entries mapping to the requested address. This datum can then be used
in parallel with checking the full tag. The hint technique works best when used in the context of
address translation, as explained below.
[edit]2-way skewed associative cache
Other schemes have been suggested, such as the skewed cache,[2]where the index for way 0 is
direct, as above, but the index for way 1 is formed with ahash function. A good hash function has
the property that addresses which conflict with the direct mapping tend not to conflict when mapped
with the hash function, and so it is less likely that a program will suffer from an unexpectedly large
number of conflict misses due to a pathological access pattern. The downside is extra latency from
computing the hash function.[3] Additionally, when it comes time to load a new line and evict an old
line, it may be difficult to determine which existing line was least recently used, because the new
line conflicts with data at different indexes in each way; LRUtracking for non-skewed caches is
usually done on a per-set basis. Nevertheless, skewed-associative caches have major advantages
over conventional set-associative ones.[4]
[edit]Pseudo-associative cache
A true set-associative cache tests all the possible ways simultaneously, using something like
a content addressable memory. A pseudo-associative cache tests each possible way one at a time.
A hash-rehash cache is one kind of pseudo-associative cache.
In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as
a direct-mapped cache. But it has a much lower conflict miss rate than a direct-mapped cache,
closer to the miss rate of a fully associative cache. [3]
[edit]Cache misses
A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in
a main memory access with much longer latency. There are three kinds of cache misses:
instruction read miss, data read miss, and data write miss.
A cache read miss from an instruction cache generally causes the most delay, because the
processor, or at least the thread of execution, has to wait (stall) until the instruction is fetched from
main memory.
A cache read miss from a data cache usually causes less delay, because instructions not
dependent on the cache read can be issued and continue execution until the data is returned from
main memory, and the dependent instructions can resume execution.
http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=6http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/CPU_cache#cite_note-3http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=7http://en.wikipedia.org/wiki/Content_addressable_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=8http://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=6http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/CPU_cache#cite_note-3http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=7http://en.wikipedia.org/wiki/Content_addressable_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=8http://en.wikipedia.org/wiki/Simultaneous_multithreading -
8/2/2019 Cache Data
11/30
A cache write miss to a data cache generally causes the least delay, because the write can be
queued and there are few limitations on the execution of subsequent instructions. The processor
can continue until the queue is full.
In order to lower cache miss rate, a great deal of analysis has been done on cache behavior in an
attempt to find the best combination of size, associativity, block size, and so on. Sequences of
memory references performed by benchmark programs are saved asaddress traces. Subsequent
analyses simulate many different possible cache designs on these long address traces. Making
sense of how the many variables affect the cache hit rate can be quite confusing. One significant
contribution to this analysis was made byMark Hill, who separated misses into three categories
(known as the Three Cs):
Compulsory misses are those misses caused by the first reference to a datum. Cache size
and associativity make no difference to the number of compulsory misses. Prefetching can helphere, as can larger cache block sizes (which are a form of prefetching). Compulsory misses are
sometimes referred to as cold misses.
Capacity misses are those misses that occur regardless of associativity or block size,
solely due to the finite size of the cache. The curve of capacity miss rate versus cache size
gives some measure of the temporal locality of a particular reference stream. Note that there is
no useful notion of a cache being "full" or "empty" or "near capacity": CPU caches almost
always have nearly every line filled with a copy of some line in main memory, and nearly every
allocation of a new line requires the eviction of an old line.
Conflict misses are those misses that could have been avoided, had the cache not evicted
an entry earlier. Conflict misses can be further broken down intomapping misses, that are
unavoidable given a particular amount of associativity, andreplacement misses, which are due
to the particular victim choice of the replacement policy.
http://www.cs.wisc.edu/~markhillhttp://www.cs.wisc.edu/~markhill -
8/2/2019 Cache Data
12/30
Miss rate versus cache size on the Integer portion of SPEC CPU2000
The graph to the right summarizes the cache performance seen on the Integer portion of the SPEC
CPU2000 benchmarks, as collected by Hill and Cantin.[5]These benchmarks are intended to
represent the kind of workload that an engineering workstation computer might see on any given
day. The reader should keep in mind that finding benchmarks which are even usefully
representative of many programs has been very difficult, and there will always be important
programs with very different behavior than what is shown here.
We can see the different effects of the three Cs in this graph.
At the far right, with cache size labelled "Inf", we have the compulsory misses. If we wish to improve
a machine's performance on SpecInt2000, increasing the cache size beyond 1 MB is essentially
futile. That's the insight given by the compulsory misses.
The fully associative cache miss rate here is almost representative of the capacity miss rate. The
difference is that the data presented is from simulations assuming an LRU replacement policy.
Showing the capacity miss rate would require aperfect replacement policy, i.e. an oracle that looks
into the future to find a cache entry which is actually not going to be hit.
Note that our approximation of the capacity miss rate falls steeply between 32KB and 64 KB. This
indicates that the benchmark has aworking setof roughly 64 KB. A CPU cache designer examining
this benchmark will have a strong incentive to set the cache size to 64 KB rather than 32 KB. Note
that, on this benchmark, no amount of associativity can make a 32 KB cache perform as well as a
64 KB 4-way, or even a direct-mapped 128 KB cache.
http://en.wikipedia.org/wiki/CPU_cache#cite_note-4http://en.wikipedia.org/wiki/CPU_cache#cite_note-4http://en.wikipedia.org/wiki/Benchmark_(computing)http://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithmhttp://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithmhttp://en.wikipedia.org/wiki/Kilobytehttp://en.wikipedia.org/wiki/Kilobytehttp://en.wikipedia.org/wiki/Working_sethttp://en.wikipedia.org/wiki/File:Cache,missrate.svghttp://en.wikipedia.org/wiki/File:Cache,missrate.svghttp://en.wikipedia.org/wiki/CPU_cache#cite_note-4http://en.wikipedia.org/wiki/Benchmark_(computing)http://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithmhttp://en.wikipedia.org/wiki/Kilobytehttp://en.wikipedia.org/wiki/Working_set -
8/2/2019 Cache Data
13/30
Finally, note that between 64 KB and 1 MB there is a large difference between direct-mapped and
fully associative caches. This difference is the conflict miss rate. The insight from looking at conflict
miss rates is that secondary caches benefit a great deal from high associativity.
This benefit was well known in the late 80s and early 90s, when CPU designers could not fit large
caches on-chip, and could not get sufficient bandwidth to either the cache data memory or cache
tag memory to implement high associativity in off-chip caches. Desperate hacks were attempted:
theMIPSR8000 used expensive off-chip dedicated tag SRAMs, which had embedded tag
comparators and large drivers on the match lines, in order to implement a 4 MB 4-way associative
cache. The MIPS R10000used ordinary SRAM chips for the tags. Tag access for both ways took
two cycles. To reduce latency, the R10000 would guess which way of the cache would hit on each
access.
[edit]Address translation
Main article:Translation lookaside buffer
Most general purpose CPUs implement some form ofvirtual memory. To summarize, each program
running on the machine sees its own simplifiedaddress space, which contains code and data for
that program only. Each program uses this virtual address space without regard for where it exists
in physical memory.
Virtual memory requires the processor to translate virtual addresses generated by the program into
physical addresses in main memory. The portion of the processor that does this translation is
known as thememory management unit(MMU). The fast path through the MMU can perform thosetranslations stored in thetranslation lookaside buffer(TLB), which is a cache of mappings from the
operating system's page table.
For the purposes of the present discussion, there are three important features of address
translation:
Latency: The physical address is available from the MMU some time, perhaps a few
cycles, after the virtual address is available from the address generator.
Aliasing: Multiple virtual addresses can map to a single physical address. Most processors
guarantee that all updates to that single physical address will happen in program order. To
deliver on that guarantee, the processor must ensure that only one copy of a physical address
resides in the cache at any given time.
http://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/R8000http://en.wikipedia.org/wiki/Static_random_access_memoryhttp://en.wikipedia.org/wiki/Static_random_access_memoryhttp://en.wikipedia.org/wiki/R10000http://en.wikipedia.org/wiki/R10000http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=9http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Page_tablehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/R8000http://en.wikipedia.org/wiki/Static_random_access_memoryhttp://en.wikipedia.org/wiki/R10000http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=9http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Page_table -
8/2/2019 Cache Data
14/30
Granularity: The virtual address space is broken up into pages. For instance, a 4 GB
virtual address space might be cut up into 1048576 pages of 4 KB size, each of which can be
independently mapped. There may be multiple page sizes supported; seevirtual memory for
elaboration.
A historical note: some early virtual memory systems were very slow, because they required an
access to the page table (held in main memory) before every programmed access to main memory.
[NB 1] With no caches, this effectively cut the speed of the machine in half. The first hardware cache
used in a computer system was not actually a data or instruction cache, but rather a TLB.
Caches can be divided into 4 types, based on whether the index or tag correspond to physical or
virtual addresses:
Physically indexed, physically tagged (PIPT) caches use the physical address for both
the index and the tag. While this is simple and avoids problems with aliasing, it is also slow, as
the physical address must be looked up (which could involve a TLB miss and access to main
memory) before that address can be looked up in the cache.
Virtually indexed, virtually tagged (VIVT) caches use the virtual address for both the
index and the tag. This caching scheme can result in much faster lookups, since the MMU
doesn't need to be consulted first to determine the physical address for a given virtual address.
However, VIVT suffers from aliasing problems, where several different virtual addresses may
refer to the same physical address. The result is that such addresses would be cached
separately despite referring to the same memory, causing coherency problems. Another
problem is homonyms, where the same virtual address maps to several different physical
addresses. It is not possible to distinguish these mappings by only looking at the virtual index,
though potential solutions include: flushing the cache after acontext switch, forcing address
spaces to be non-overlapping, tagging the virtual address with an address space ID (ASID), or
using physical tags. Additionally, there is a problem that virtual-to-physical mappings can
change, which would require flushing cache lines, as the VAs would no longer be valid.
Virtually indexed, physically tagged (VIPT) caches use the virtual address for the index
and the physical address in the tag. The advantage over PIPT is lower latency, as the cache
line can be looked up in parallel with the TLB translation, however the tag can't be compared
until the physical address is available. The advantage over VIVT is that since the tag has the
physical address, the cache can detect homonyms. VIPT requires more tag bits, as the index
bits no longer represent the same address.
http://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-7http://en.wikipedia.org/wiki/Context_switchhttp://en.wikipedia.org/wiki/Context_switchhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-7http://en.wikipedia.org/wiki/Context_switch -
8/2/2019 Cache Data
15/30
Physically indexed, virtually tagged caches are only theoretical as they would basically
be useless.[8]
The speed of this recurrence (the load latency) is crucial to CPU performance, and so most
modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to
proceed in parallel with fetching the data from the cache RAM.
But virtual indexing is not the best choice for all cache levels. The cost of dealing with virtual aliases
grows with cache size, and as a result most level-2 and larger caches are physically indexed.
Caches have historically used both virtual and physical addresses for the cache tags, although
virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup, then
the physical address is available in time for tag compare, and there is no need for virtual tagging.
Large caches, then, tend to be physically tagged, and only small, very low latency caches are
virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded by vhints,
as described below.
[edit]Virtual indexing and virtual aliases
The usual way the processor guarantees that virtually aliased addresses act as a single storage
location is to arrange that only one virtual alias can be in the cache at any given time.
Whenever a new entry is added to a virtually indexed cache, the processor searches for any virtual
aliases already resident and evicts them first. This special handling happens only during a cache
miss. No special work is necessary during a cache hit, which helps keep the fast path fast.
The most straightforward way to find aliases is to arrange for them all to map to the same location
in the cache. This happens, for instance, if the TLB has e.g. 4 KB pages, and the cache is direct
mapped and 4 KB or less.
Modern level-1 caches are much larger than 4 KB, but virtual memory pages have stayed that size.
If the cache is e.g. 16 KB and virtually indexed, for any virtual address there are four cache
locations that could hold the same physical location, but aliased to different virtual addresses. If the
cache misses, all four locations must be probed to see if their corresponding physical addresses
match the physical address of the access that generated the miss.
These probes are the same checks that a set associative cache uses to select a particular match.
So if a 16 KB virtually indexed cache is 4-way set associative and used with 4 KB virtual memory
pages, no special work is necessary to evict virtual aliases during cache misses because the
checks have already happened while checking for a cache hit.
http://en.wikipedia.org/wiki/CPU_cache#cite_note-8http://en.wikipedia.org/wiki/CPU_cache#cite_note-8http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=10http://en.wikipedia.org/wiki/CPU_cache#cite_note-8http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=10 -
8/2/2019 Cache Data
16/30
Using the AMD Athlon as an example again, it has a 64 KB level-1 data cache, 4 KB pages, and 2-
way set associativity. When the level-1 data cache suffers a miss, 2 of the 16 (==64 KB/4 KB)
possible virtual aliases have already been checked, and seven more cycles through the tag check
hardware are necessary to complete the check for virtual aliases.
[edit]Homonym and synonym problems
The cache that relies on the virtual indexing and tagging becomes inconsistent after the same
virtual address is mapped into different physical addresses (homonym). This can be solved by
using physical address for tagging or by storing the address space id in the cache line. However
the latter of these two approaches does not help against thesynonym problem, where several
cache lines end up storing data for the same physical address. Writing to such location may update
only one location in the cache, leaving others with inconsistent data. Problem might be solved by
using non overlapping memory layouts for different address spaces or otherwise the cache (or part
of it) must be flushed when the mapping changes.[9]
[edit]Virtual tags and vhints
Virtual tagging is possible too. The great advantage of virtual tags is that, for associative caches,
they allow the tag match to proceed before the virtual to physical translation is done. However,
Coherence probes and evictions present a physical address for action. The hardware must
have some means of converting the physical addresses into a cache index, generally by storing
physical tags as well as virtual tags. For comparison, a physically tagged cache does not need
to keep virtual tags, which is simpler.
When a virtual to physical mapping is deleted from the TLB, cache entries with those virtual
addresses will have to be flushed somehow. Alternatively, if cache entries are allowed on
pages not mapped by the TLB, then those entries will have to be flushed when the access
rights on those pages are changed in the page table.
It is also possible for the operating system to ensure that no virtual aliases are simultaneously
resident in the cache. The operating system makes this guarantee by enforcing page coloring,
which is described below. Some early RISC processors (SPARC, RS/6000) took this approach. It
has not been used recently, as the hardware cost of detecting and evicting virtual aliases has fallen
and the software complexity and performance penalty of perfect page coloring has risen.
It can be useful to distinguish the two functions of tags in an associative cache: they are used to
determine which way of the entry set to select, and they are used to determine if the cache hit or
http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=11http://en.wikipedia.org/wiki/Homonymhttp://en.wikipedia.org/wiki/Synonymhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-9http://en.wikipedia.org/wiki/CPU_cache#cite_note-9http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=12http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=11http://en.wikipedia.org/wiki/Homonymhttp://en.wikipedia.org/wiki/Synonymhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-9http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=12 -
8/2/2019 Cache Data
17/30
-
8/2/2019 Cache Data
18/30
A programmer attempting to make maximum use of the cache may arrange his program's access
patterns so that only 1 MB of data need be cached at any given time, thus avoiding capacity
misses. But he should also ensure that the access patterns do not have conflict misses. One way to
think about this problem is to divide up the virtual pages the program uses and assign them virtual
colors in the same way as physical colors were assigned to physical pages before. The
programmer can then arrange the access patterns of his code so that no two pages with the same
virtual color are in use at the same time. There is a wide literature on such optimizations (e.g. loop
nest optimization), largely coming from the High Performance Computing (HPC) community.
The snag is that while all the pages in use at any given moment may have different virtual colors,
some may have the same physical colors. In fact, if the operating system assigns physical pages to
virtual pages randomly and uniformly, it is extremely likely that some pages will have the same
physical color, and then locations from those pages will collide in the cache (this is thebirthday
paradox).
The solution is to have the operating system attempt to assign different physical color pages to
different virtual colors, a technique calledpage coloring. Although the actual mapping from virtual to
physical color is irrelevant to system performance, odd mappings are difficult to keep track of and
have little benefit, so most approaches to page coloring simply try to keep physical and virtual page
colors the same.
If the operating system can guarantee that each physical page maps to only one virtual color, then
there are no virtual aliases, and the processor can use virtually indexed caches with no need for
extra virtual alias probes during miss handling. Alternatively, the O/S can flush a page from the
cache whenever it changes from one virtual color to another. As mentioned above, this approach
was used for some early SPARC and RS/6000 designs.
[edit]Cache hierarchy in a modern processor
Modern processors have multiple interacting caches on chip.
[edit]Specialized caches
Pipelined CPUs access memory from multiple points in the pipeline: instruction fetch,virtual-to-
physicaladdress translation, and data fetch (seeclassic RISC pipeline). The natural design is to
use different physical caches for each of these points, so that no one physical resource has to be
scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least
three separate caches (instruction,TLB, and data), each specialized to its particular role.
Pipelines with separate instruction and data caches, now predominant, are said to have aHarvard
architecture. Originally, this phrase referred to machines with separate instruction and data
http://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/High_Performance_Computinghttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=14http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=15http://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/High_Performance_Computinghttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=14http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=15http://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecture -
8/2/2019 Cache Data
19/30
memories, which proved not at all popular. Most modern CPUs have a single-memoryvon
Neumann architecture.
[edit]Victim cache
A victim cache is a cache used to hold blocks evicted from a CPU cache upon replacement. The
victim cache lies between the main cache and its refill path, and only holds blocks that were evicted
from the main cache. The victim cache is usually fully associative, and is intended to reduce the
number of conflict misses. Many commonly used programs do not require an associative mapping
for all the accesses. In fact, only a small fraction of the memory accesses of the program require
high associativity. The victim cache exploits this property by providing high associativity to only
these accesses. It was introduced by Norman Jouppiin 1990.
[edit]Trace cache
One of the more extreme examples of cache specialization is the trace cache found in the Intel
Pentium 4 microprocessors. A trace cacheis a mechanism for increasing the instructionfetch
bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces
ofinstructionsthat have already been fetched and decoded.
The earliest widely acknowledged academic publication of trace cache was byEric
Rotenberg, Steve Bennett, andJim Smithin their 1996 paper"Trace Cache: a Low Latency
Approach to High Bandwidth Instruction Fetching."
An earlier publication is US Patent 5,381,533, "Dynamic flow instruction cache memory organized
around trace segments independent of virtual address line", byAlex Peleg and Uri Weiserof Intel
Corp., patent filed March 30, 1994, a continuation of an application filed in 1992, later abandoned.
A trace cache stores instructions either after they have been decoded, or as they are retired.
Generally, instructions are added to trace caches in groups representing either individualbasic
blocks or dynamic instruction traces. A dynamic trace ("trace path") contains only instructions
whose results are actually used, and eliminates instructions following taken branches (since they
are not executed); a dynamic trace can be a concatenation of multiple basic blocks. This allows the
instruction fetch unit of a processor to fetch several basic blocks, without having to worry about
branches in the execution flow.
Trace lines are stored in the trace cache based on the program counterof the first instruction in the
trace and a set of branch predictions. This allows for storing different trace paths that start on the
same address, each representing different branch outcomes. In the instruction fetch stage of
a pipeline, the current program counter along with a set of branch predictions is checked in the
trace cache for a hit. If there is a hit, a trace line is supplied to fetch which does not have to go to a
http://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=16http://en.wikipedia.org/w/index.php?title=Norman_Jouppi&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Norman_Jouppi&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=17http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Steve_Bennett_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Alex_Peleg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Alex_Peleg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Uri_Weiser&action=edit&redlink=1http://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Program_counterhttp://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=16http://en.wikipedia.org/w/index.php?title=Norman_Jouppi&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=17http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Steve_Bennett_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Alex_Peleg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Uri_Weiser&action=edit&redlink=1http://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Program_counterhttp://en.wikipedia.org/wiki/Instruction_pipeline -
8/2/2019 Cache Data
20/30
regular cache or to memory for these instructions. The trace cache continues to feed the fetch unit
until the trace line ends or until there is amispredictionin the pipeline. If there is a miss, a new
trace starts to be built.
Trace caches are also used in processors like theIntelPentium 4to store already decoded micro-
operations, or translations of complex x86 instructions, so that the next time an instruction is
needed, it does not have to be decoded again.
See the full text ofSmith, Rotenberg and Bennett's paperatCiteseer.
[edit]Multi-level caches
Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have
better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of
cache, with small fast caches backed up by larger slower caches.
Multi-level caches generally operate by checking the smallest Level 1 (L1) cache first; if it hits, the
processor proceeds at high speed. If the smaller cache misses, the next larger cache (L2) is
checked, and so on, before external memory is checked.
As the latency difference between main memory and the fastest cache has become larger, some
processors have begun to utilize as many as three levels of on-chip cache. For example, theAlpha
21164(1995) had 1 to 64MB off-chip L3 cache; the IBMPOWER4 (2001) had a 256[citation needed]MB
L3 cache off-chip, shared among several processors; theItanium 2 (2003) had a 6 MB unified level
3 (L3) cache on-die; theItanium 2 (2003) MX 2 Module incorporates two Itanium2 processors along
with a shared 64 MB L4 cache on a MCM that was pin compatible with a Madison processor;
Intel's Xeon MP product code-named "Tulsa" (2006) features 16 MB of on-die L3 cache shared
between two processor cores; the AMD Phenom II (2008) has up to 6 MB on-die unified L3 cache;
and theIntel Core i7(2008) has an 8 MB on-die unified L3 cache that is inclusive, shared by all
cores. The benefits of an L3 cache depend on the application's access patterns.
Finally, at the other end of the memory hierarchy, the CPUregister file itself can be considered the
smallest, fastest cache in the system, with the special characteristic that it is scheduled in software
typically by a compiler, as it allocates registers to hold values retrieved from main memory. (See
especially loop nest optimization.) Register files sometimes also have hierarchy: The Cray-1(circa1976) had 8 address "A" and 8 scalar data "S" registers that were generally usable. There was also
a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster
than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a
data cache. (The Cray-1 did, however, have an instruction cache.)
http://en.wikipedia.org/wiki/Mispredictionhttp://en.wikipedia.org/wiki/Mispredictionhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Pentium_4http://citeseer.ist.psu.edu/rotenberg96trace.htmlhttp://en.wikipedia.org/wiki/Citeseerhttp://en.wikipedia.org/wiki/Citeseerhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=18http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/POWER4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Xeonhttp://en.wikipedia.org/wiki/Phenom_IIhttp://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Register_filehttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Cray-1http://en.wikipedia.org/wiki/Cray-1http://en.wikipedia.org/wiki/Mispredictionhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Pentium_4http://citeseer.ist.psu.edu/rotenberg96trace.htmlhttp://en.wikipedia.org/wiki/Citeseerhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=18http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/POWER4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Xeonhttp://en.wikipedia.org/wiki/Phenom_IIhttp://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Register_filehttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Cray-1 -
8/2/2019 Cache Data
21/30
[edit]Exclusive versus inclusive
Multi-level caches introduce new design decisions. For instance, in some processors, all data in the
L1 cache must also be somewhere in the L2 cache. These caches are called strictly inclusive.
Other processors (like the AMD Athlon) have exclusive caches data is guaranteed to be in at
most one of the L1 and L2 caches, never in both. Still other processors (like the IntelPentium II,III,
and 4), do not require that data in the L1 cache also reside in the L2 cache, although it may often
do so. There is no universally accepted name for this intermediate policy, although the termmainly
inclusive has been used.[citation needed]
The advantage of exclusive caches is that they store more data. This advantage is larger when the
exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times
larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line
in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just
copying a line from L2 to L1, which is what an inclusive cache does.
One advantage of strictly inclusive caches is that when external devices or other processors in a
multiprocessor system wish to remove a cache line from the processor, they need only have the
processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache
must be checked as well. As a drawback, there is a correlation between the associativities of L1
and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the
effective associativity of the L1 caches is restricted.
Another advantage of inclusive caches is that the larger cache can use larger cache lines, which
reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the
same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit). If the secondary
cache is an order of magnitude larger than the primary, and the cache data is an order of
magnitude larger than the cache tags, this tag area saved can be comparable to the incremental
area needed to store the L1 cache data in the L2.
[edit]Example: the K8
To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8 core
in the AMDAthlon 64 CPU.[10]
http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=19http://en.wikipedia.org/wiki/Pentium_IIhttp://en.wikipedia.org/wiki/Pentium_IIIhttp://en.wikipedia.org/wiki/Pentium_IIIhttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=20http://en.wikipedia.org/wiki/Athlon_64http://en.wikipedia.org/wiki/CPU_cache#cite_note-10http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=19http://en.wikipedia.org/wiki/Pentium_IIhttp://en.wikipedia.org/wiki/Pentium_IIIhttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=20http://en.wikipedia.org/wiki/Athlon_64http://en.wikipedia.org/wiki/CPU_cache#cite_note-10 -
8/2/2019 Cache Data
22/30
Example of hierarchy, the K8
The K8 has 4 specialized caches: an instruction cache, an instructionTLB, a data TLB, and a data
cache. Each of these caches is specialized:
The instruction cache keeps copies of 64-byte lines of memory, and fetches 16 bytes each
cycle. Each byte in this cache is stored in ten bits rather than 8, with the extra bits marking the
boundaries of instructions (this is an example of predecoding). The cache has
only parityprotection rather than ECC, because parity is smaller and any damaged data can be
replaced by fresh data fetched from memory (which always has an up-to-date copy of
instructions).
The instruction TLB keeps copies of page table entries (PTEs). Each cycle's instruction
fetch has its virtual address translated through this TLB into a physical address. Each entry is
either 4 or 8 bytes in memory. Because the K8 has a variable page size, each of the TLBs is
split into two sections, one to keep PTEs that map 4 KB pages, and one to keep PTEs that map
4 MB or 2 MB pages. The split allows the fully associative match circuitry in each section to be
simpler. The operating system maps different sections of the virtual address space with
different size PTEs.
http://en.wikipedia.org/wiki/CPU_cache#cite_note-10http://en.wikipedia.org/wiki/CPU_cache#cite_note-10http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Parity_bithttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/File:Cache,hierarchy-example.svghttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Parity_bithttp://en.wikipedia.org/wiki/Error-correcting_code -
8/2/2019 Cache Data
23/30
The data TLB has two copies which keep identical entries. The two copies allow two data
accesses per cycle to translate virtual addresses to physical addresses. Like the instruction
TLB, this TLB is split into two kinds of entries.
The data cache keeps copies of 64-byte lines of memory. It is split into 8 banks (each
storing 8 KB of data), and can fetch two 8-byte data each cycle so long as those data are in
different banks. There are two copies of the tags, because each 64-byte line is spread among
all 8 banks. Each tag copy handles one of the two accesses per cycle.
The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which
store only PTEs mapping 4 KB. Both instruction and data caches, and the various TLBs, can fill
from the large unified L2 cache. This cache is exclusive to both the L1 instruction and data caches,
which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache,
or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in
one of the TLBsthe operating system is responsible for keeping the TLBs coherent by flushing
portions of them when the page tables in memory are updated.
The K8 also caches information that is never stored in memoryprediction information. These
caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly
complex branch prediction, with tables that help predict whether branches are taken and other
tables which predict the targets of branches and jumps. Some of this information is associated with
instructions, in both the level 1 instruction cache and the unified secondary cache.
The K8 uses an interesting trick to store prediction information with instructions in the secondary
cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by
an alpha particle strike) by eitherECCorparity, depending on whether those lines were evicted
from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC
code, lines from the instruction cache have a few spare bits. These bits are used to cache branch
prediction information associated with those instructions. The net result is that the branch predictor
has a larger effective history table, and so has better accuracy.
[edit]More hierarchies
Other processors have other kinds of predictors (e.g. the store-to-load bypass predictor in
theDECAlpha 21264), and various specialized predictors are likely to flourish in future processors.
These predictors are caches in that they store information that is costly to compute. Some of the
terminology used when discussing predictors is the same as that for caches (one speaks of ahit in
a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.
http://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Alpha_particlehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=21http://en.wikipedia.org/wiki/Digital_Equipment_Corporationhttp://en.wikipedia.org/wiki/Digital_Equipment_Corporationhttp://en.wikipedia.org/wiki/Alpha_21264http://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Alpha_particlehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit§ion=21http://en.wikipedia.org/wiki/Digital_Equipment_Corporationhttp://en.wikipedia.org/wiki/Alpha_21264 -
8/2/2019 Cache Data
24/30
-
8/2/2019 Cache Data
25/30
Read path for a 2-way associative cache
The diagram to the right is intended to clarify the manner in which the various fields of the address
are used. Address bit 31 is most significant, bit 0 is least significant. The diagram shows the
SRAMs, indexing, and multiplexing for a 4 KB, 2-way