lecture 17: memory hierarchy and cache coherence · 2018. 2. 12. · lecture 17: memory hierarchy...

63
Lecture 17: Memory Hierarchy and Cache Coherence Concurrent and Mul7core Programming Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1

Upload: others

Post on 10-Mar-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Lecture17:MemoryHierarchyandCacheCoherence

ConcurrentandMul7coreProgramming

DepartmentofComputerScienceandEngineeringYonghongYan

[email protected]/~yan

1

Page 2: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

ParallelisminHardware

•  Instruc7on-LevelParallelism–  Pipeline–  Out-of-orderexecu7on,and–  Superscalar

•  Thread-LevelParallelism–  Chipmul7threading,mul7core–  Coarse-grainedandfine-grainedmul7threading–  SMT

•  Data-LevelParallelism–  SIMD/Vector–  GPU/SIMT

2

ComputerArchitecture,AQuan7ta7veApproach.5THEdi7on,TheMorganKaufmann,September30,2011byJohnL.Hennessy(Author),DavidA.PaWerson

Page 3: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Topics(Part2)

•  Parallelarchitecturesandhardware–  Parallelcomputerarchitectures–  Memoryhierarchyandcachecoherency

•  ManycoreGPUarchitecturesandprogramming–  GPUsarchitectures–  CUDAprogramming–  IntroducGontooffloadingmodelinOpenMPandOpenACC

•  Programmingonlargescalesystems(Chapter6)–  MPI(pointtopointandcollec7ves)–  IntroducGontoPGASlanguages,UPCandChapel

•  Parallelalgorithms(Chapter8,9&10)–  Densematrix,andsor7ng

3

Page 4: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Outline

•  Memory,LocalityofreferenceandCaching•  Cachecoherenceinsharedmemorysystem

4

Page 5: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Memoryun7lnow…

•  We’vereliedonaverysimplemodelofmemoryformostthisclass–  MainMemoryisalineararrayofbytesthatcanbeaccessed

givenamemoryaddress–  Alsousedregisterstostorevalues

•  Realityismorecomplex.ThereisanenGrememorysystem.–  Differentmemoriesexistatdifferentlevelsofthecomputer–  Eachvaryintheirspeed,size,andcost

5

Page 6: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Random-Access Memory (RAM)

•  Keyfeatures–  RAMispackagedasachip.–  Basicstorageunitisacell(onebitpercell).–  MulGpleRAMchipsformamemory.

•  Sta7cRAM(SRAM)–  Eachcellstoresbitwithasix-transistorcircuit.–  Retainsvalueindefinitely,aslongasitiskeptpowered.–  RelaGvelyinsensiGvetodisturbancessuchaselectricalnoise.–  FasterandmoreexpensivethanDRAM.

6

Page 7: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Random-Access Memory (RAM)

•  DynamicRAM(DRAM)–  Eachcellstoresbitwithacapacitorandtransistor.–  Valuemustberefreshedevery10-100ms.–  SensiGvetodisturbances.–  SlowerandcheaperthanSRAM.

7

Page 8: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Memory Modules… real lifeDRAM

•  Inreality,–  SeveralDRAMchipsarebundledintoMemoryModules

•  SIMMS-SingleInlineMemoryModule•  DIMMS-DualInlineMemoryModule•  DDR-DualdataRead

–  Readstwiceeveryclockcycle•  QuadPump:SimultaneousR/W

Source for Pictures: http://en.kioskea.net/contents/pc/ram.php3

8

Page 9: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

SDR, DDR,QuadPump

9

Page 10: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

MemorySpeeds

•  ProcessorSpeeds:1GHzprocessorspeedis1nseccycleGme.

•  MemorySpeeds(50nsec)•  AccessSpeedgap

–  InstrucGonsthatstoreorloadfrommemory

10

DIMMModuleChipType ClockSpeed(MHz) BusSpeed(MHz) TransferRate(MB/s)

PC1600DDR200 100 200 1600

PC2100DDR266 133 266 2133

PC2400DDR300 150 300 2400

Page 11: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

registers

on-chip L1 cache (SRAM)

main memory (DRAM)

local secondary storage (local disks)

Larger, slower,

and cheaper

(per byte) storage devices

remote secondary storage (distributed file systems, Web servers)

Local disks hold files retrieved from disks on remote network servers.

Main memory holds disk blocks retrieved from local disks.

off-chip L2 cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache memory.

CPU registers hold words retrieved from L1 cache.

L2 cache holds cache lines retrieved from main memory.

L0:

L1:

L2:

L3:

L4:

L5:

Smaller, faster, and

costlier (per byte) storage devices

MemoryHierarchy(Review)

11

Page 12: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

main memory I/O

bridge bus interface L2 cache

ALU register file

cache bus system bus memory bus

L1 cache

CacheMemories(SRAM)

•  Cachememoriesaresmall,fastSRAM-basedmemoriesmanagedautomaGcallyinhardware.–  Holdfrequentlyaccessedblocksofmainmemory

•  CPUlooksfirstfordatainL1,theninL2,theninmainmemory.

•  Typicalbusstructure:

12

Page 13: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Processor

HowtoExploitMemoryHierarchy

•  Availabilityofmemory–  Cost,size,speed

•  Principleoflocality–  Memoryreferencesarebunchedtogether–  AsmallporGonofaddressspaceisaccessedatanygivenGme

•  Thisspaceinhighspeedmemory–  Problem:notallofitmayfit

13

Page 14: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Typesoflocality

•  Temporallocality–  TendencytoaccesslocaGonsrecentlyreferenced

•  SpaGallocality

–  TendencytoreferencelocaGonsaroundrecentlyreferenced–  LocaGonx,thenotherswillbex-korx+k

14

X X X t

Page 15: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Sourcesoflocality

•  Temporallocality–  Codewithinaloop–  SameinstrucGonsfetchedrepeatedly

•  SpaGallocality–  Dataarrays–  Localvariablesinstack–  Dataallocatedinchunks(conGguousbytes)

for(i=0;i<N;i++){A[i]=B[i]+C[i]*a;}

15

Page 16: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Whatdoeslocalitybuy?

•  AddressthegapbetweenCPUspeedandRAMspeed•  SpaGalandtemporallocalityimpliesasubsetofinstrucGonscanfitinhighspeedmemoryfromGmetoGme

•  CPUcanaccessinstrucGonsanddatafromthishighspeedmemory

•  Smallhighspeedmemorycanmakecomputerfasterandcheaper

•  Speedof1-20nsecatcostof$50to$100perMbyte•  ThisisCaching!!

16

Page 17: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Inser7nganL1CacheBetweenCPUandMainMemory

17

a b c d block 10

p q r s block 21

...

...

w x y z block 30

...

The big slow main memory has room for many 4-word blocks.

The small fast L1 cache has room for two 4-word blocks.

The tiny, very fast CPU register file has room for four 4-byte words. The transfer unit between

the CPU register file and the cache is a 4-byte block.

line 0

line 1 The transfer unit between the cache and main memory is a 4-word block (16 bytes).

Page 18: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Whatinfo.Doesacacheneed

•  Cache:Asmaller,fasterstoragedevicethatactsasastagingareaforasubsetofthedatainalarger,slowerdevice.

•  YouessenGallyallowasmallerregionofmemorytoholddatafromalargerregion.Nota1-1mapping.

•  WhatkindofinformaGondoweneedtokeep:–  Theactualdata–  Wherethedataactuallycomesfrom–  Ifdataisevenconsideredvalid

18

Page 19: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

CacheOrganiza7on

•  Mapeachregionofmemorytoasmallerregionofcache•  Discardaddressbits

–  Discardlowerorderbits(a)–  Discardhigherorderbits(b)

•  Cacheaddresssizeis4bits•  Memoryaddresssizeis8bits•  Incaseof a)

–  0000xxxxismappedto0000incache•  Incaseofb)

–  xxxx0001ismappedto0001incache

19

cache

memory

(b)

(a)

Page 20: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Findingdataincache

•  Partofmemoryaddressappliedtocache•  Remainingisstoredastagincache•  Lowerorderbitsdiscarded•  Needtocheckif00010011

–  Cacheindexis0001–  Tagis0011

•  Iftagmatches,hit,usedata•  Nomatch,miss,fetchdatafrommemory

20

address tag

Page 21: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

valid

valid

tag

tag set 0:

B = 2b bytes per cache block

E lines per set

S = 2s sets

t tag bits per line

1 valid bit per line

Cache size: C = B x E x S data bytes

• • •

valid

valid

tag

tag set 1: • • •

valid

valid tag

tag

set S-1: • • •

• • •

Cache is an array of sets.

Each set contains one or more lines.

Each line holds a block of data.

0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1

GeneralOrgofaCacheMemory

21

Page 22: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

t bits s bits b bits

0

<set index> <block offset>

m-1

<tag>

Address A:

v

v

tag

tag set 0: • • •

v

v

tag

tag set 1: •

v

v

tag

tag set S-1: • • •

• • •

The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>.

The word contents begin at offset <block offset> bytes from the beginning of the block.

AddressingCaches

22

0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1 • • 0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1

Page 23: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

set 0: valid tag cache block

Direct-MappedCache

•  Simplestkindofcache•  Characterizedbyexactlyonelineperset.

23

valid tag

valid tag

• • •

set 1:

set S-1:

E=1 lines per set

cache block

cache block

Page 24: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

set 0: valid tag

valid tag

valid tag

• • •

set 1:

set S-1: t bits s bits

set index block offset0 m-1

b bits

tag

selected set

cache block

cache block

cache block 0 0 0 0 1

AccessingDirect-MappedCaches

•  SetselecGon–  Usethesetindexbitstodeterminethesetofinterest.

24

Page 25: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

=1? (1) The valid bit must be set

1 0110

t bits s bits

set index block offset0 m-1

b bits

tag

selected set (i):

(3) If (1) and (2), then cache hit,

and block offset selects

starting byte.

(2) The tag bits in the cache = ? line must match the

tag bits in the address

AccessingDirect-MappedCaches

•  LinematchingandwordselecGon–  Linematching:Findavalidlineintheselectedsetwitha

matchingtag –  WordselecGon:Thenextracttheword

25

3 0 1 2 7 4 5 6

0110 i 100

w0 w1 w2 w3

Page 26: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

valid

valid

valid

Example:Directmappedcache

•  32bitaddress,64KBcache,32byteblock•  Howmanysets,howmanybitsforthetag,howmanybitsfortheoffset?

26

tag

tag

tag

• • •

set 0:

set 1:

cache block

cache block

cache block set n-1:

Page 27: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Write-throughvswrite-back

•  Whattodowhenanupdateoccurs?•  Write-through:immediately

–  Simpletoimplement,synchronouswrite–  Uniformlatencyonmisses

•  Write-back:writewhenblockisreplaced–  RequiresaddiGonaldirtybitormodifiedbit–  Asynchronouswrites–  Non-uniformmisslatency–  Cleanmiss:readfromlowerlevel–  Dirtymiss:writetolowerlevelandread(fill)

27

Page 28: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

WritesandCache

•  ReadinginformaGonfromacacheisstraightforward.•  WhataboutwriGng?

–  Whatifyou’rewriGngdatathatisalreadycached(write-hit)?–  Whatifthedataisnotinthecache(write-miss)?

•  Dealingwithawrite-hit.–  Write-through-immediatelywritedatabacktomemory–  Write-back-deferthewritetomemoryforaslongaspossible

•  Dealingwithawrite-miss.–  write-allocate-loadtheblockintomemoryandupdate–  no-write-allocate-writesdirectlytomemory

•  Benefits?Disadvantages?•  Write-througharetypicallyno-write-allocate.•  Write-backaretypicallywrite-allocate.

28

Page 29: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

size: speed: $/Mbyte: line size:

200 B 3 ns

8 B 32 B larger, slower, cheaper

8-64 KB 3 ns

1-4MB SRAM 128 MB DRAM 60 ns $1.50/MB 8 KB

30 GB 8 ms $0.05/MB

Memory

L1 d-cache

Regs Unified

L2 Cache

Processor

6 ns $100/MB 32 B

L1 i-cache

disk

Mul7-LevelCaches

•  OpGons:separatedataandinstrucGoncaches,oraunifiedcache

29

Page 30: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

CachePerformanceMetrics

•  MissRate–  FracGonofmemoryreferencesnotfoundincache(misses/

references)–  Typicalnumbers:

•  3-10%forL1•  canbequitesmall(e.g.,<1%)forL2,dependingonsize,etc.

•  HitTime–  Timetodeliveralineinthecachetotheprocessor(includesGmeto

determinewhetherthelineisinthecache)–  Typicalnumbers:

•  1clockcycleforL1•  3-8clockcyclesforL2

•  MissPenalty–  AddiGonalGmerequiredbecauseofamiss

•  Typically25-100cyclesformainmemory

30

Page 31: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

int sumarrayrows(int a[M][N]) {

int i, j, sum = 0;

for (i = 0; i < M; i++) for (j = 0; j < N; j++)

sum += a[i][j]; return sum;

}

int sumarraycols(int a[M][N]) {

int i, j, sum = 0;

for (j = 0; j < N; j++) for (i = 0; i < M; i++)

sum += a[i][j]; return sum;

}

Miss rate = 1/4 = 25% Miss rate = 100%

Wri7ngCacheFriendlyCode

•  Repeatedreferencestovariablesaregood(temporallocality)

•  Stride-1referencepaxernsaregood(spaGallocality)•  Examples:

–  coldcache,4-bytewords,4-wordcacheblocks

31

Page 32: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

MatrixMul7plica7onExample

•  MajorCacheEffectstoConsider–  Totalcachesize

•  Exploittemporallocalityandblocking)–  Blocksize

•  ExploitspaGallocality

•  DescripGon:–  MulGplyNxNmatrices–  O(N3)totaloperaGons–  Accesses

•  Nreadspersourceelement•  NvaluessummedperdesGnaGon

–  butmaybeabletoholdinregister

/* ijk */ for (i=0; i<n; i++) {

for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++)

sum += a[i][k] * b[k][j]; c[i][j] = sum;

} }

Variable sum held in register

32

Page 33: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

MissRateAnalysisforMatrixMul7ply

•  Assume:–  Linesize=32BYTES(bigenoughfor464-bitwords)–  Matrixdimension(N)isverylarge

•  Approximate1/Nas0.0–  CacheisnotevenbigenoughtoholdmulGplerows

•  AnalysisMethod:–  Lookataccesspaxernofinnerloop

33

Page 34: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

LayoutofCArraysinMemory(review)

•  Carraysallocatedinrow-majororder–  eachrowinconGguousmemorylocaGons

•  Steppingthroughcolumnsinonerow:–  for(i = 0; i < N; i++)sum+= a[0][i];–  accessessuccessiveelements–  ifblocksize(B)>4bytes,exploitspaGallocality

•  compulsorymissrate=4bytes/B•  Steppingthroughrowsinonecolumn:

–  for(i = 0; i < n; i++)sum += a[i][0];

•  accessesdistantelements•  nospaGallocality!

–  compulsorymissrate=1(i.e.100%)

34

Page 35: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

MatrixMul7plica7on(ijk)

35

Page 36: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

MatrixMul7plica7on(jik)

36

Page 37: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

MatrixMul7plica7on(kij)

37

Page 38: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

MatrixMul7plica7on(ikj)

38

Page 39: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

MatrixMul7plica7on(jki)

39

Page 40: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

MatrixMul7plica7on(kji)

40

Page 41: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

SummaryofMatrixMul7plica7on

41

for (i=0; i<n; i++) { for (j=0; j<n; j++)

{

sum = 0.0;

for (k=0; k<n; k++) sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

}

ijk (& jik): kij (& ikj): jki (& kji): • 2 loads, 0 stores • misses/iter = 1.25

for (k=0; k<n; k++) { for (i=0; i<n; i++)

{

r = a[i][k];

for (j=0; j<n; j++) c[i]

[j] += r * b[k][j];

}

}

for (j=0; j<n; j++) { for (k=0; k<n; k++)

{ r = b[k][j];

for (i=0; i<n; i++) c[i]

[j] += a[i][k] * r; }

}

• 2 loads, 1 store • misses/iter = 0.5

• 2 loads, 1 store • misses/iter = 2.0

Page 42: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Outline

•  Memory,LocalityofreferenceandCaching•  Cachecoherenceinsharedmemorysystem

42

Page 43: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Sharedmemorysystems

•  Allprocesseshaveaccesstothesameaddressspace–  E.g.PCwithmorethanoneprocessor

•  DataexchangebetweenprocessesbywriGng/readingsharedvariables–  Sharedmemorysystemsareeasytoprogram–  CurrentstandardinscienGficprogramming:OpenMP

•  Twoversionsofsharedmemorysystemsavailabletoday–  CentralizedSharedMemoryArchitectures–  DistributedSharedMemoryarchitectures

Page 44: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

CentralizedSharedMemoryArchitecture

•  AlsoreferredtoasSymmetricMulG-Processors(SMP)•  Allprocessorssharethesamephysicalmainmemory

•  MemorybandwidthperprocessorislimiGngfactorforthistypeofarchitecture

•  Typicalsize:2-32processors

Memory

CPU CPU

CPU CPU

Page 45: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Centralizedsharedmemorysystem(I)

•  IntelX7350quad-core(Tigerton)–  PrivateL1cache:32KBinstrucGon,32KBdata–  SharedL2cache:4MBunifiedcache

CoreL1

CoreL1

sharedL2

CoreL1

CoreL1

sharedL2

1066MHzFSB

Page 46: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Centralizedsharedmemorysystems(II)

•  IntelX7350quad-core(Tigerton)mulG-processorconfiguraGon

C0

C1

L2

C8

C9

L2

C2

C3

L2

C10

C11

L2

C4

C5

L2

C12

C13

L2

C6

C7

L2

C14

C15

L2

Socket0 Socket1 Socket2 Socket3

MemoryControllerHub(MCH)

Memory Memory Memory Memory

8GB/s8GB/s8GB/s8GB/s

Page 47: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

DistributedSharedMemoryArchitectures

•  AlsoreferredtoasNon-UniformMemoryArchitectures(NUMA)

•  Somememoryisclosertoacertainprocessorthanothermemory–  ThewholememoryissGlladdressablefromallprocessors–  Dependingonwhatdataitemaprocessorretrieves,theaccess

Gmemightvarystrongly

Memory

CPU CPU

Memory

CPU CPU

Memory

CPU CPU

Memory

CPU CPU

Page 48: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

NUMAarchitectures(II)

•  ReducesthememoryboxleneckcomparedtoSMPs•  Moredifficulttoprogramefficiently

–  E.g.firsttouchpolicy:dataitemwillbelocatedinthememoryoftheprocessorwhichusesadataitemfirst

•  Toreduceeffectsofnon-uniformmemoryaccess,cachesareo}enused–  ccNUMA:cache-coherentnon-uniformmemoryaccess

architectures•  Largestexampleasoftoday:SGIOriginwith512processors

Page 49: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

DistributedSharedMemorySystems

Page 50: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

CacheCoherence

•  Real-worldsharedmemorysystemshavecachesbetweenmemoryandCPU

•  CopiesofasingledataitemcanexistinmulGplecaches•  ModificaGonofashareddataitembyoneCPUleadstooutdatedcopiesinthecacheofanotherCPU

Memory

CPU0

Cache

CPU1

Cache

Originaldataitem

CopyofdataitemincacheofCPU0 Copyofdataitem

incacheofCPU1

Page 51: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Cachecoherence(II)

•  TypicalsoluGon:–  Cacheskeeptrackonwhetheradataitemissharedbetween

mulGpleprocesses–  UponmodificaGonofashareddataitem,‘noGficaGon’of

othercacheshastooccur–  Othercacheswillhavetoreloadtheshareddataitemonthe

nextaccessintotheircache•  CachecoherenceisonlyanissueincasemulGpletasksaccessthesameitem–  MulGplethreads–  MulGpleprocesseshaveajointsharedmemorysegment–  ProcessisbeingmigratedfromoneCPUtoanother

Page 52: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

CacheCoherenceProtocols

•  SnoopingProtocols–  Sendallrequestsfordatatoallprocessors–  Processorssnoopabustoseeiftheyhaveacopyandrespondaccordingly–  Requiresbroadcast,sincecachinginformaGonisatprocessors–  Workswellwithbus(naturalbroadcastmedium)–  Dominatesforcentralizedsharedmemorymachines

•  Directory-BasedProtocols–  KeeptrackofwhatisbeingsharedincentralizedlocaGon–  Distributedmemory=>distributeddirectoryforscalability

(avoidsboxlenecks)–  Sendpoint-to-pointrequeststoprocessorsvianetwork–  ScalesbexerthanSnooping–  Commonlyusedfordistributedsharedmemorymachines

Page 53: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

Categoriesofcachemisses

•  Uptonow:–  CompulsoryMisses:firstaccesstoablockcannotbeinthecache(cold

startmisses)–  CapacityMisses:cachecannotcontainallblocksrequiredfortheexecuGon–  ConflictMisses:cacheblockhastobediscardedbecauseofblock

replacementstrategy•  InmulG-processorsystems:

–  CoherenceMisses:cacheblockhastobediscardedbecauseanotherprocessormodifiedthecontent•  truesharingmiss:anotherprocessormodifiedthecontentoftherequestelement

•  falsesharingmiss:anotherprocessorinvalidatedtheblock,althoughtheactualitemofinterestisunchanged.

Page 54: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

BusSnoopingTopology

Page 55: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

LargerSharedMemorySystems

•  TypicallyDistributedSharedMemorySystems•  Localorremotememoryaccessviamemorycontroller•  Directorypercachethattracksstateofeveryblockineverycache

–  Whichcacheshaveacopyofblock,dirtyvs.clean,...•  Infopermemoryblockvs.percacheblock?

–  PLUS:Inmemory=>simplerprotocol(centralized/onelocaGon)–  MINUS:Inmemory=>directoryisƒ(memorysize)vs.ƒ(cachesize)

•  Preventdirectoryasboxleneck?distributedirectoryentrieswithmemory,eachkeepingtrackofwhichprocessorshavecopiesoftheirblocks

Page 56: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

DistributedDirectoryMPs

Page 57: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

•  Falsesharing–  Whenatleastonethreadwritetoa

cachelinewhileothersaccessit•  Thread0:=A[1](read)•  Thread1:A[0]=…(write)

•  SoluGon:usearraypadding

int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i] +=i;

int a[max_threads][cache_line_size]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i][0] +=i;

FalseSharinginOpenMP

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

False Sharing

CPUs Caches Memory

A store into a shared cache line invalidates the other copies of that line:

The system is not able to distinguish between changes

within one individual line

57

A

T0

T1

Page 58: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

NUMAandFirstTouchPolicy

•  DataplacementpolicyonNUMAarchitectures

•  FirstTouchPolicy

–  Theprocessthatfirsttouchesapageofmemorycausesthatpagetobeallocatedinthenodeonwhichtheprocessisrunning

58

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

A generic cc-NUMA architecture

Page 59: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

NUMAFirst-touchplacement/1

59

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About “First Touch” placement/1

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[99]

First TouchAll array elements are in the memory of

the processor executing this thread

int a[100]; Onlyreservethevm

address

Page 60: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

NUMAFirst-touchplacement/2

60

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About “First Touch” placement/2

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[49]

#pragma omp parallel for num_threads(2)

First TouchBoth memories each have “their half” of

the array

a[50] :a[99]

Page 61: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

WorkwithFirst-TouchinOpenMP

•  First-touchinpracGce–  IniGalizedataconsistentlywiththecomputaGons

61

#pragmaompparallelforfor(i=0;i<N;i++){a[i]=0.0;b[i]=0.0;c[i]=0.0;}readfile(a,b,c);#pragmaompparallelforfor(i=0;i<N;i++){a[i]=b[i]+c[i];}

Page 62: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

ConcludingObserva7ons

•  ProgrammercanopGmizeforcacheperformance–  Howdatastructuresareorganized–  Howdataareaccessed

•  Nestedloopstructure•  Blockingisageneraltechnique

•  Allsystemsfavor“cachefriendlycode”–  Ge�ngabsoluteopGmumperformanceisverypla�orm

specific•  Cachesizes,linesizes,associaGviGes,etc.

–  Cangetmostoftheadvantagewithgenericcode•  Keepworkingsetreasonablysmall(temporallocality)•  Usesmallstrides(spaGallocality)

–  WorkwithcachecoherenceprotocolandNUMAfirsttouchpolicy

62

Page 63: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks

References

•  ComputerArchitecture,AQuanGtaGveApproach.5THEdiGon,TheMorganKaufmann,September30,2011byJohnL.Hennessy(Author),DavidA.Paxerson

•  APrimeronMemoryConsistencyandCacheCoherenceDanielJ.SorinMarkD.HillDavidA.Wood,SYNTHESISLECTURESONCOMPUTERARCHITECTUREMarkD.Hill,SeriesEditor,2011

63