cache - swarthmore collegecpu chip cache recall: how memory read works (1)cpu places address a on...
TRANSCRIPT
Cache10/27/16
TheMemoryHierarchy
Localsecondarystorage(disk)
LargerSlowerCheaperperbyte
Remotesecondarystorage(tapes,thecloud)
~100Mcyclestoaccess
OnChip
Storage
SmallerFasterCostlierperbyte
Mainmemory(DRAM)
~100cyclestoaccess
CPUinstrscan
directlyaccess
evenslowerthandisk
Registers1cycletoaccess
Cache(s)(SRAM)
~10’sofcyclestoaccess
FlashSSD/Localnetwork
0.0
0.1
1.0
10.0
100.0
1,000.0
10,000.0
100,000.0
1,000,000.0
10,000,000.0
100,000,000.0
1980 1985 1990 1995 2000 2003 2005 2010
ns (1
0-9 s
ec)
Year
Disk seek timeFlash SSD access timeDRAM access timeSRAM access timeCPU cycle timeEffective CPU cycle time
3
DataAccessTimeoverYearsOvertime,gapwidensbetweenDRAM,disk,andCPUspeeds.
Disk
DRAM
CPU
SSD
SRAM
multicore
Reallywanttoavoidgoingtodiskfordata
WanttoavoidgoingtoMainMemoryfordata
Recall
• Acacheisasmaller,fastermemory,thatholdsasubsetofalarger(slower)memory
• Wetakeadvantageoflocality tokeepdataincacheasoftenaswecan!
• Whenaccessingmemory,wecheckcachetoseeifithasthedatawe’relookingfor.
Whycachemissesoccur
• Compulsory(cold-start)miss:• Firsttimeweusedata,loaditintocache.
• Capacitymiss:• Cacheistoosmalltostoreallthedatawe’reusing.
• Conflictmiss:• Tobringinnewdatatothecache,weevictedotherdatathatwe’restillusing.
Cachedesign
Questions:• Whatdatashouldbebroughtintothecache?• Whereinthecacheshoulditgo?• Whatdatashouldbeevictedfromthecache?
Goals:• Maximizehitrate.• Takeadvantageoftemporalandspatiallocality.• Minimizehardwarecomplexity.
CachingTerminology• Block:thesizeofasinglecachedatastorageunit
• Datagetstransferredintocacheinentireblocks(nopartialblocks).• Lowerlevelsmayhavelargerblocksizes.
• Line:asinglecacheentry:• data(block)+identifyinginformation+otherstate
• Hit:thesoughtdataarefoundinthecache.• L1:typically~95%hitrate
• Miss:thesoughtdataarenotfoundinthecache.• Fetchfromlowerlevels.
• Replacement:Movingavalueoutofacachetomakeroomforanewvalueinitsplace
7
Blockissome#ofbytes
(fromcontiguousmem.addrs)
CachebasicsLine metadata addressinfo datablock
0
1
2
3
… …
1021
1022
1023
Eachlinestoressomedata,plusinformationaboutwhatmemoryaddressthedatacamefrom.
SupposetheCPUasksfordata,it’snotincache.Weneedtomoveinintocachefrommemory.Whereinthecacheshoulditbeallowedtogo?
A. Inexactlyoneplace.
B. Inafewplaces.
C. Inmostplaces,butnotall.
D. Anywhereinthecache.
ALURegs
Cache
MainMemory
MemoryBus
CPU
? ?
?
A. Inexactlyoneplace.(“Direct-mapped”)• Everylocationinmemoryisdirectlymappedtooneplace
inthecache.Easytofinddata.
B. Inafewplaces.(“Setassociative”)• Amemorylocationcanbemappedto(2,4,8)locationsin
thecache.Middleground.
C. Inmostplaces,butnotall.
D. Anywhereinthecache.(“Fullyassociative”)• Norestrictionsonwherememorycanbeplacedinthe
cache.Fewerconflictmisses,moresearching.
Alargerblocksize(cachingmemoryinlargerchunks)islikelytoexhibit…A. Bettertemporallocality
B. Betterspatiallocality
C. Fewermisses(betterhitrate)
D. Moremisses(worsehitrate)
E. Morethanoneoftheabove.(Which?)
BlockSizeImplications• Smallblocks
• Roomformoreblocks• Fewerconflictmisses
• Largeblocks• Fewertripstomemory• Longertransfertime• Fewercold-startmisses
MainMemory MainMemory
Cache Cache
ALURegs ALURegs
Trade-offs
• Thereisnosinglebestdesignforallpurposes!
• Commonsystemsquestion:whichpointinthedesignspaceshouldwechoose?
• Givenaparticularscenario:• Analyzeneeds• Choosedesignthatfitsthebill
RealCPUs• Goals:generalpurposeprocessing
• balanceneedsofmanyusecases• middleoftheroad:jackofalltrades,masterofnone
• Someassociativity• 8-wayassociative(memoryinoneofeightplaces)
• Mediumsizeblocks• 16or32-byteblocks
Whatshouldweusetodeterminewhetherornotdataisinthecache?A. Thememoryaddressofthedata.
B. Thevalueofthedata.
C. Thesizeofthedata.
D. Someotheraspectofthedata.
Recall:HowMemoryReadWorks
(1) CPUplacesaddressAonthememorybus.
ALU
Registerfile
Businterface
0
Ax
MainmemoryI/Obridge
%eax
Loadoperation: movl (A), %eaxCPUchip
Cache
Recall:HowMemoryReadWorks
(1) CPUplacesaddressAonthememorybus.(2)Memorysendsbackthevalue
ALU
Registerfile
Businterface
0
Ax
MainmemoryI/Obridge
%eax
Loadoperation: movl (A), %eaxCPUchip
Cache
MemoryAddressTellsUs…
• Istheblockcontainingthebyte(s)youwantalreadyinthecache?
• Ifnot,whereshouldweputthatblock?• Doweneedtokickout(“evict”)anotherblock?
• Whichbyte(s)withintheblockdoyouwant?
MemoryAddresses
• Likeeverythingelse:seriesofbits(32or64)
• Keepinmind:• Nbitsgivesus2N uniquevalues.
• 32-bitaddress:• 10110001011100101101010001010110
Divideintoregions,eachwithdistinctmeaning.
FirstDirect-Mapped
• Oneplacedatacanbe.
• Example:let’sassumesomeparameters:• 1024cachelocations(everyblockmappedtoone)• Blocksizeof8bytes
Direct-MappedLine V D Tag Data(8Bytes)
0
1
2
3
4
… …
1020
1021
1022
1023
Metadata
CacheMetadata
• Validbit:istheentryvalid?• Ifset:dataiscorrect,useitifwe‘hit’incache• Ifnot set:ignore‘hits’,thedataisgarbage
• Dirtybit:hasthedatabeenwritten?• Usedbywrite-backcaches• Ifset,needtoupdatememorybeforeeviction
Direct-Mapped• Addressdivision:
• Identifybyteinblock• Howmanybits?
• Identifywhichrow(line)• Howmanybits?
Line V D Tag Data(8Bytes)
0
1
2
3
4
… …
1020
1021
1022
1023
Direct-Mapped• Addressdivision:
• Identifybyteinblock• Howmanybits?3
• Identifywhichrow(line)• Howmanybits?10
Line V D Tag Data(8Bytes)
0
1
2
3
4
… …
1020
1021
1022
1023
Direct-Mapped• Addressdivision: Line V D Tag Data(8Bytes)
0
1
2
3
4
… …
1020
1021
1022
1023
Index:Whichline(row)shouldwecheck?Wherecoulddatabe?
Tag(19bits) Index(10bits) Byteoffset (3bits)
Direct-Mapped• Addressdivision: Line V D Tag Data(8Bytes)
0
1
2
3
4
… …
1020
1021
1022
1023
Index:Whichline(row)shouldwecheck?Wherecoulddatabe?
Tag(19bits) Index(10bits) Byteoffset (3bits)
4
Direct-Mapped• Addressdivision: Line V D Tag Data(8Bytes)
0
1
2
3
4 1 4217
… …
1020
1021
1022
1023
Inparallel,check:
Tag:Doesthecacheholdthedatawe’relookingfor,orsomeotherblock?
Validbit:Ifentryisnotvalid,don’ttrustgarbageinthatline(row).
Tag(19bits) Index(10bits) Byteoffset (3bits)
4217 4
Iftagdoesn’tmatch,orlineisinvalid,it’samiss!
Direct-Mapped• Addressdivision: Line V D Tag Data(8Bytes)
0
1
2
3
4 1 4217
… …
1020
1021
1022
1023
Byteoffsettellsuswhichsubsetofblocktoretrieve.
Tag(19bits) Index(10bits) Byteoffset (3bits)
4217 4
0 1 2 3 4 5 6 7
Direct-Mapped• Addressdivision: Line V D Tag Data(8Bytes)
0
1
2
3
4 1 4217
… …
1020
1021
1022
1023
Byteoffsettellsuswhichsubsetofblocktoretrieve.
Tag(19bits) Index(10bits) Byteoffset (3bits)
4217 4 2
0 1 2 3 4 5 6 7
V D Tag Data
…
=
Tag Index Byteoffset
0:miss1:hit
SelectByte(s)
Data
Input:MemoryAddress
Direct-MappedExample• Supposeouraddressesare16bitslong.
• Ourcachehas16entries,blocksizeof16bytes• 4bitsinaddressfortheindex• 4bitsinaddressforbyteoffset• Remainingbits(8):tag
Direct-MappedExample
• Let’ssayweaccessmemoryataddress:
• 0110101100110100
• Step1:• Partitionaddressintotag,index,offset
Line V D Tag Data(16Bytes)
0
1
2
3
4
5
…
15
Direct-MappedExample
• Let’ssayweaccessmemoryataddress:
• 0110101100110100
• Step1:• Partitionaddressintotag,index,offset
Line V D Tag Data(16Bytes)
0
1
2
3
4
5
…
15
Direct-MappedExample
• Let’ssayweaccessmemoryataddress:
• 011010110011 0100
• Step2:• Useindextofindline(row)
• 0011->3
Line V D Tag Data(16Bytes)
0
1
2
3
4
5
…
15
Line V D Tag Data(16Bytes)
0
1
2
3
4
5
…
15
Direct-MappedExample
• Let’ssayweaccessmemoryataddress:
• 011010110011 0100
• Step2:• Useindextofindline(row)
• 0011->3
Line V D Tag Data(16Bytes)
0
1
2
3
4
5
…
15
Direct-MappedExample
• Let’ssayweaccessmemoryataddress:
• 011010110011 0100
• Note:• ANYaddresswith0011(3)asthemiddlefourindexbitswillmaptothiscacheline.
• e.g.111111110011 0000
So,whichdataishere?
Datafromaddress0110101100110100OR1111111100110000?
Usetagtostorehigh-orderbits.Let’susdeterminewhichdataishere!(manyaddressesmaphere)
Line V D Tag Data(16Bytes)
0
1
2
3 01101011
4
5
…
15
Direct-MappedExample
• Let’ssayweaccessmemoryataddress:
• 011010110011 0100
• Step3:• Checkthetag• Isit01101011(hit)?• Somethingelse(miss)?• (Mustalsoensurevalid)
Eviction
• Ifwedon’tfindwhatwe’relookingfor(miss),weneedtobringinthedatafrommemory.
• Makeroombykickingsomethingout.• Iflinetobeevictedisdirty,writeittomemoryfirst.
• Anotherimportantsystemsdistinction:• Mechanism:Anabilityorfeatureofthesystem.Whatyoucan do.
• Policy:Governsthedecisionsmakingforusingthemechanism.Whatyoushould do.
Evictionfordirect-mappedcache
• Mechanism:overwritebitsincacheline,updating• Validbit• Tag• Data
• Policy:notmanyoptionsfordirect-mapped• Overwriteattheonlylocationitcouldbe!
Eviction:Direct-Mapped• Addressdivision: Line V D Tag Data(8Bytes)
0
1
2
3
4
… …
1020 1 0 1323 57883
1021
1022
1023
Findline:
Tagdoesn’tmatch,bringinfrommemory.
Ifdirty,writebackfirst!
Tag(19bits) Index(10bits) Byteoffset (3bits)
3941 1020
Eviction:Direct-Mapped• Addressdivision: Line V D Tag Data(8Bytes)
0
1
2
3
4
… …
1020 1 0 1323 57883
1021
1022
1023
Tag(19bits) Index(10bits) Byteoffset (3bits)
3941 1020
MainMemory
1.Sendaddresstoreadmainmemory.
Eviction:Direct-Mapped• Addressdivision: Line V D Tag Data(8Bytes)
0
1
2
3
4
… …
1020 1 0 3941 92
1021
1022
1023
Tag(19bits) Index(10bits) Byteoffset (3bits)
3941 1020
MainMemory
1.Sendaddresstoreadmainmemory.
2.Copydatafrommemory.Updatetag.
Supposewehad8-bitaddresses,acachewith8lines,andablocksizeof4bytes.
• Howmanybitswouldweusefor:• Tag?• Index?• Offset?
Howmanyoftheseoperationschangethecache?Howmanyaccessmemory?
Read01000100(Value:5)Read11100010(Value:17)Write01110000(Value:7)Read10101010(Value:12)Write01101100(Value:2)
Line V D Tag Data(4Bytes)
01 0 111 17
11 0 011 9
20 0 101 15
31 1 001 8
41 0 011 4
50 0 111 6
60 0 101 32
71 0 110 3
A. 1B. 2C. 3
D. 4E. 5
Steppingthrough…
Read01000100(Value:5)Read11100010(Value:17)Write01110000(Value:7)Read10101010(Value:12)Write01101100(Value:2)
Line V D Tag Data(4Bytes)
01 0 111 17
11 0 011 010 9 5
20 0 101 15
31 1 001 8
41 0 011 4
50 0 111 6
60 0 101 32
71 0 110 3
Steppingthrough…
Read01000100(Value:5)Read11100010(Value:17)Write01110000(Value:7)Read10101010(Value:12)Write01101100(Value:2)
Line V D Tag Data(4Bytes)
01 0 111 17
11 0 011 010 9 5
20 0 101 15
31 1 001 8
41 0 011 4
50 0 111 6
60 0 101 32
71 0 110 3Nochangenecessary.
Steppingthrough…
Read01000100(Value:5)Read11100010(Value:17)Write01110000(Value:7)Read10101010(Value:12)Write01101100(Value:2)
Line V D Tag Data(4Bytes)
01 0 111 17
11 0 011 010 9 5
20 0 101 15
31 1 001 8
41 0
1011 4 7
50 0 111 6
60 0 101 32
71 0 110 3
Steppingthrough…
Read01000100(Value:5)Read11100010(Value:17)Write01110000(Value:7)Read10101010(Value:12)Write01101100(Value:2)
Line V D Tag Data(4Bytes)
01 0 111 17
11 0 011 010 9 5
201
0 101 101 15 12
31 1 001 8
41 0
1011 4 7
50 0 111 6
60 0 101 32
71 0 110 3
Note:taghappenedtomatch,butlinewasinvalid.
Steppingthrough…
Read01000100(Value:5)Read11100010(Value:17)Write01110000(Value:7)Read10101010(Value:12)Write01101100(Value:2)
Line V D Tag Data(4Bytes)
01 0 111 17
11 0 011 010 9 5
201
0 101 101 15 12
31 1
1001 011 8 2
41 0
1011 4 7
50 0 111 6
60 0 101 32
71 0 110 3
1. Writedirtylinetomemory.2. Loadnewvalue,setitto2,
markitdirty(write).
Question…
Whenmightdirect-mappedcachebeabadidea?
Whentwoblocksweusealothavethesameindex.
Theotherextreme:fullyassociative
+Anyblockcangoinanycacheline.+Reducescachemisses.
- Havetocheckeverylineformatchingaddress.- Needtostoremorebitsoftheaddress.- Evictiondecisionsareharder.
Compromise:setassociative
• EachlinecanholdNblocks.• Addressesaremappedtoaline,butcangoinanyofthatline’sNblocks.
Comparison:1024Lines(Forthesamecachesize,inbytesofdata.)
Direct-mapped
1024indices(10bits)
2-waysetassociative
512sets(9bits)Tagis1bitlarger.V D Tag Data(8Bytes)
…
Set # V D Tag Data(8Bytes)
0
1
2
3
4
… …
508
509
510
511
2-WaySetAssociative
V D Tag Data(8Bytes)
1 0 3941
…
Set # V D Tag Data(8Bytes)
0
1
2
3
4 1 1 4063
… …
508
509
510
511
Tag(20bits) Set(9bits) Byteoffset (3bits)
3941 4Samecapacityaspreviousexample:1024rowswith1entryvs.512rowswith2entries
2-WaySetAssociative
V D Tag Data(8Bytes)
1 0 3941
…
Set # V D Tag Data(8Bytes)
0
1
2
3
4 1 1 4063
… …
508
509
510
511
Tag(20bits) Set(9bits) Byteoffset (3bits)
3941 4
Checkalllocationsintheset,inparallel.
2-WaySetAssociative
V D Tag Data(8Bytes)
1 0 3941
…
Set # V D Tag Data(8Bytes)
0
1
2
3
4 1 1 4063
… …
508
509
510
511
Tag(20bits) Set(9bits) Byteoffset (3bits)
3941 4
0 1 2 3 4 5 6 70 1 2 3 4 5 6 7
Multiplexer Selectcorrectvalue.
4-WaySetAssociativeCache
Clearly,morecomplexityhere!
Eviction• Mechanismisthesame…
• Overwritebitsincacheline:updatetag,valid,data
• Policy:choosewhichlineinthesettoevict• Option1:Pickarandomlineinset• Option2:Chooseaninvalidlinefirst• Option3:Choosetheleastrecentlyusedblock
• Hasexhibitedtheleastlocality,kickitout!• Option4:first2then3
LeastRecentlyUsed(LRU)
• Intuition:ifithasn’tbeenusedinawhile,wehavenoreasontobelieveitwillbeusedsoon.
• NeedextrastatetokeeptrackofLRUinfo.
V D Tag Data(8Bytes)
1 0 3941
…
Set # LRU V D Tag Data(8Bytes)
0 0
1 1
2 1
3 0
4 1 1 1 4063
… …
LeastRecentlyUsed(LRU)• Intuition:ifithasn’tbeenusedinawhile,wehavenoreasontobelieveitwillbeusedsoon.
• NeedextrastatetokeeptrackofLRUinfo.
• ForperfectLRUinfo:• 2-way:1bit• 4-way:8bits• N-way:N*log2 Nbits
Anotherreasonwhyassociativityoftenmaxesoutat8or16.
Thesearemetadatabits,not“useful”programdatastorage.
(Approximationsmakeitnotquiteasbad.)
Howwouldthecachechangeifweperformedthefollowingmemoryoperations?(2-wayset)Read01000100(Value:5)Read11100010(Value:17)Write01100100(Value:7)Read01000110(Value:5)Write01100000(Value:2)
V D Tag Data(4Bytes)
1 0 001 17
1 0 010 5
… …
Set # LRU V D Tag Data(4Bytes)
0 1 0 0 111 4
1 0 1 1 111 9
2 … …
3
4
5
6
7
LRUof0meanstheleftlineinthesetwasleastrecentlyused.1meanstherightlinewasusedleastrecently.
CacheConsciousProgramming• Knowingaboutcachinganddesigningcodearounditcansignificantlyeffectperformance(ex)2Darrayaccesses
Algorithmically,bothO(N*M).
Isonefasterthantheother?
for(i=0; i < N; i++) {for(j=0; j< M; j++) {
sum += arr[i][j];}}
for(j=0; j < M; j++) {for(i=0; i< N; i++) {
sum += arr[i][j];}}
A.isfaster. B.isfaster.
C.Bothwouldexhibitroughlyequalperformance.
CacheConsciousProgrammingThefirstnestedloopismoreefficientifthecacheblocksizeislargerthanasinglearraybucket(forarraysofbasicCtypes,itwillbe).
(ex)1missevery4bucketsvs.1misseverybucket
for(i=0; i < N; i++) {for(j=0; j< M; j++) {
sum += arr[i][j];}}
for(j=0; j < M; j++) {for(i=0; i< N; i++) {
sum += arr[i][j];}}
1 2 3 4 5 6 7 8 9 10
11
12
13
14
15
16
...
.
.
.
1 . ..
2
3
4
.
.
.
Acaveat:Amdahl’sLawIdea:anoptimizationcanimprovetotalruntimeatmostbythefractionitcontributestototalruntime
Ifprogramtakes100secs torun,andyouoptimizeaportionofthecodethataccountsfor2%oftheruntime,thebestyouroptimizationcandoisimprovetheruntimeby2secs.
Amdahl’sLawtellsustofocusouroptimizationeffortsonthecodethatmatters:
Speed-upwhatisaccountingforthelargestportionofruntime togetthelargestbenefit.And,don’twastetimeonthesmallstuff.
“Prematureoptimizationistherootofallevil.”–DonaldKnuth