praccallycostlesscoherence forgpusand( manycore...

Prac%cally Costless Coherence for GPUs and Manycore Accelerators

1

Prac%cally Costless Coherence for GPUs and Manycore

Accelerators PEGPUM/HiPEAC 2013

1

Stefanos Kaxiras & Alberto Ros

2/3/13 Uppsala University/Uppsala Innova%on 1

1. Ros & Kaxiras, "Complexity-‐Effec%ve Mul%core Coherence” PACT2012 2. Kaxiras & Ros "Efficient, Snoopless, System-‐On-‐Chip Coherence" IEEE System on Chip Conference (SOCC) 2012


2

Mo%va%on: Why Simplify? Directory/Snooping Coherence is a relic of NUMA/SMP systems of the past: •  “Invisible” to memory consistency à supports SC in the

presence of data races (which may not be needed) –  Invalida%ons, directories, snoops, broadcasts …

•  Does not take into account LLC –  e.g., Owned state for $-‐2-‐$

•  MESI is already complex (30+ hidden and stable states) –  any new op%miza%on à verify from scratch

•  but what the heck, we know how to do it … •  Cost (area + power):

–  Directories, indirec%on, … –  Snooping on busses, dual tags, …

2/3/13 2 Uppsala University/Uppsala Innova%on


3

Mo%va%on: Why Simplify?

•  Not a big issue for few fat cores … •  But: a new class of manycores with many simple cores: GPUs, accelerators, … –  Desire to make them coherent with GP cores on the same chip (coherent Shared Virtual Memory – cSVM)

–  Coherence overhead (area, power) is a big issue à don’t want to pay much per core

–  Coherence is sporadically needed à why pay always?

•  Simple coherence à simple memory system & NoC



4

Simplifying Coherence: Write Through

•  Write-‐through protocols are simple –  Only Valid and Invalid states in the caches –  BUT they are terrible because of many write misses

•  Most of the write misses due to private data (≈ 90%)


Barnes

CholeskyFFT

FMM LUOcean

Radiosity

Raytrace

Volrend

Water-Nsq

Water-Sp

Em3d

Tomcatv

Swaptions

x264

Average

0.010.020.030.040.050.060.070.080.090.0

100.0

Pe

rce

nta

ge

of

write

mis

ses

Private Shared


5

VIPS

•  Dynamic write policy in the L1s (private caches, in general) based on a classifica%on of data to Private/Shared:

•  Write-‐back (WB) for Private lines –  Simple (no coherence required) as in uniprocessors –  Efficient: no extra misses

•  Write-‐through (WT) for Shared lines –  Simple (only two states, VI) –  Efficient: WT à coherence misses

•  VIPS: Valid/Invalid Private/Shared 2/3/13 5 Uppsala University/Uppsala Innova%on


6

Private/Shared Classifica%on

•  Classify data (cache lines) into Private and Shared –  Private are accessed by one core only and DO NOT NEED COHERENCE

–  Shared are (poten%ally) accessed by more than one core and need coherence

–  Not a temporal classifica%on! but for the life %me of the program (think of variable declara%on as private or shared)

•  Private/Shared classifica%on orthogonal to the rest of the protocol. –  Several ways to do it; here: based on Page Table/TLBs



7

Private/Shared Classifica%on (cont.)

•  Page-‐level classifica%on by the OS: –  Used in various contexts (Hardavellas et al., ISCA’09, NUCA, Cuesta et al., ISCA’11, directory reduc%on)

•  Technique: –  Each page (PTE/TLB) has a P/S bit –  The first access of a page by a core sets it to P –  Subsequent accesses by other cores set it to S, interrupt the single core that first set it to P and correct it



8

Delayed Write-‐Throughs

S%ll some write misses for shared blocks (~20% of a full WT policy) •  Delay WTs to coalesce as many writes as possible on the same cache line for performance

•  WTs are delayed in the MSHRs •  End result:

MESI WB Traffic ~ 6.5% of FULL WT traffic VIPS (Private WB + Shared Delayed WT) ~ 9% of FULL WT traffic (difference of 2.5%)



9

VIPS Protocol (cont.)

•  Simplifies the protocol to just two states (VI) •  Write-‐throughs eliminate the need to track writers at the directory → Area reduc%on

•  No indirec%on for read misses (performance win) → Correct shared data always at the LLC

•  Supports sequen%al consistency –  Same consistency model as the more complex MESI –  But s%ll … requires invalida%ons, directory, directory blocking … L



10

VIPS-‐M: directorlyless/invalida%onless

•  But what if we have weak consistency? –  Provide SC for DATA-‐RACE-‐FREE (DRF) programs [Adve & Hill]

–  Significant simplifica%ons are possible!

•  Self-‐Invalida%on (SI) for the shared data: –  Selec%vely Self-‐Invalidate shared data upon synchroniza%on points (fences, atomic instruc%ons, LL/SC)

–  Eliminates INVALIDATION –  No need to track readers anymore à no directory –  BUT is only compa%ble with Data-‐Race-‐Free (DRF) opera%on: offers SC for DRF



11

VIPS-‐M: WT Diffs

•  SI works for DRF @ cache-‐line granularity –  Does not work w/ false sharing because there are no invalida%ons

•  Solved by Wri%ng Through DIFFs –  Send only the words (or bytes) that are modified – Merge them in the LLC à Mul%ple Writers & Merge (–M)

•  NO NEED for per-‐word(byte) dirty bits for the cache! –  Dirty bits are only needed while a Delayed WT is

outstanding (V* state) à Delayed WTs & dirty-‐bits exist only while in MSHRs



12

VIPS-‐M: WT Diffs

•  SI works for DRF @ cache-‐line granularity –  Does not work w/ false sharing because there are no invalida%ons

•  Solved by Wri%ng Through DIFFs –  Send only the words (or bytes) that are modified – Merge them in the LLC à Mul%ple Writers & Merge (–M)

•  NO NEED for per-‐word(byte) dirty bits for the cache! –  Dirty bits are only needed while a Delayed WT is

outstanding à Delayed WTs & dirty-‐bits exist only while in MSHRs



13

•  Eliminate invalida%ons/directory, allow mul%ple writers à magic happens

•  Shared read/write protocol SAME as private data –  Only difference is WHEN are data put back in the LLC

•  Private data à on replacement (Write_BACK) •  Shared data à on writes (Write_THROUGH) / sync.

V PrW

DIFF

V


VIPS-‐M: Mul%ple-‐Writer-‐Merge (cont.)


14

What about data races? •  Private & DRF data OK. What about synchroniza%on

data accessed through atomic RMW (Test&Set, Compare&Swap, LL/SC, etc.)?

•  The problem is that readers are NOT invalidated –  Readers never “see” a new write (unless they get flushed from the cache)

•  A synchroniza%on (inten%onal) data-‐race must “see” new writes: For data accessed via atomic instruc%ons: 1.  Do not create an L1 copy à Go directly to the LLC where the

Write_throughs are visible 2.  For atomic RMW block the LLC line un%l the write_through



15

Blocking in the LLC is finite: number of cores that can RMW simultaneously

Lock Acquired Lock Released

Lock Acquired Lock Released


Synchroniza%on (data race) protocol

L1 Core0

L1 Core1

LLC


16

Evalua%on (does it work?)

•  Simulated a 16 %led mul%core (16K-‐64K, 8MB L2, 512K %les), SIMICS GEMS –  NoC based design –  Also have Bus results

•  Automa%c classifica%on of data to private, shared (& read-‐only) through the page table/TLB –  Page granularity (1 word shared à whole page shared) –  One way only (Priv. à Shared, RO à W) –  Be|er results with finer grain classifica%on

2/3/13 16 Uppsala University/Uppsala Innova%on 16


17

Performance Results

•  VIPS-‐M be|er than MESI (~5%)


Barnes

CholeskyFFT

FMM LUOcean

Radiosity

Raytrace

Volrend

Water-Nsq

Water-Sp

Em3d

Tomcatv

Cannealx264

Average

0.00.20.40.60.81.01.21.41.61.82.0

No

rma

lize

d e

xecu

tion

tim

e

Hammer Directory Write-through VIPS VIPS-M

2.83 2.42 2.85 2.19 2.13 2.12 2.21

faster writes (no write-misses), faster reads (no directory indirection), and less traffic in the NoC.


18

Power Results

•  Power for LLC & NoC •  VIPS-‐M be|er than MESI (25%): no superfluous coherence traffic, no directory access, WT DIFFs


Barnes

CholeskyFFT

FMM LUOcean

Radiosity

Raytrace

Volrend

Water-Nsq

Water-Sp

Em3d

Tomcatv

Cannealx264

Average

0.0

0.5

1.0

1.5

2.0

2.5

3.0

No

rma

lize

d e

ne

rgy

con

sum

ptio

n

LLC Network

8.3 5.3 4.2 5.0 8.8 22.9 5.1 15.5 9.6 56.2 3.3 7.0

1. Hammer 2. Directory 3. Write-through 4. VIPS 5. VIPS-M


19

VIPS Unique Proper%es

•  VIPS-‐M à Self-‐Invalida%on AND Self-‐Downgrade: –  VIPS coherence is truly distributed. Coherence decisions are taken independently without any inter-‐core interac%on à Simplifies whole system design

•  Strictly Request-‐Response from the L1s to the LLC –  No requests from LCC to L1s –  No traffic among L1s –  Only L1àLLC

•  Implica%ons …



20

Implica%ons

•  NOC simplifica%on: –  No need to talk to other nodes, only to LLC –  Simple topologies: cores “around” the LLC, fat trees with the LLC at the root, …

•  Seamless scaling to Mul%ple Buses •  Simplifies all aspects of the memory hierarchy, e.g., NoC or virtual caches



21

Mul%ple Buses


C

L1

C

L1

C

L1

C

L1

Bus

LLC


22

Mul%ple Buses (cont.)


C

L1

C

L1

C

L1

C

L1

Bus LL0

LL1

LL2

LL3

Bus

Bus

Bus

Snooping on 4 buses ? à 5 tag arrays or à 5-ported tags …

VIPS does not have Snooping … à Multiple buses for free!

BW Bottleneck à Snooping Bottleneck


23

Physical Address Coherence


C

T

L1

C

T

L1

C

T

L1

C

T

L1

Network

LLC/Directory (for MESI)

C: Core T: TLB

Virtual Address

Physical Address


24

Virtual Address Coherence


C

L1

C

L1

C

L1

C

L1

Network

LLC/Directory (for MESI)

Virtual Address

Physical Address

T/B T/B T/B T/B VA à PA Reverse PA à VA

VA à PA à VA

VIPS-M has only L1 à LLC req.-resp. traffic

C: Core T/B: TLB and BLT (reverse TLB)


25


ARM Nov. 2 2012 25 Uppsala University/Uppsala Innova%on

C

L1

C

L1

C

L1

C

L1

LLC

Virtual Address

Physical Address

T

Network

VIPS allows a single TLB (not practical for MESI)

NO TLB COHERENCE NEDDED!


26



C

L1

C

L1

C

L1

C

L1

LLC

Virtual Address

Physical Address

T

Network ARM

L1

T

System Network

Can Freely mix VA & PA coherence


27



C

L1

C

L1

C

L1

C

L1

LLC

Virtual Address

Physical Address

T

Network

ARM

L1

T

System Network

ARM

L1

T


28

Summary •  Simplest, cost-‐less protocols, exceeding MESI:

–  VIPS: Simpler than MESI, directory protocol, SC –  VIPS-‐M: directory-‐less (broadcast-‐less/snoop-‐less) DRF –  Support for synchroniza%on (DR) without invalida%ons –  On a NoC: Eliminates directories, (coherent) write misses, Read indirec%on (always find correct data in the LLC)

–  On a bus: Eliminates all snooping •  Significant implica%ons on NoC, Bus and VM

–  Allows coherence of virtual caches (even with synonyms) •  A la Carte Coherence: there when you need it

(mul%threaded workloads); gone when you don’t (mul%programmed workloads or message passing)



29

Thank you!



30

Memory Consistency Models Programs

Non-‐DRF (data races)

DRF-‐word (Non-‐DRF cache line)

DRF-‐cache-‐line

VIPS SC, Coherent

SC, Coherent SC, Coherent

VIPS-‐SF Incoherent Incoherent SC for DRF_cache-‐line

VIPS-‐M Incoherent SC for DRF_word SC for DRF_cache-‐line

Synchroniza%on protocol

SC, Coherent

SC, Coherent

SC, Coherent



31

Selec%ve Flush

Barnes

CholeskyFFT

FMM LUOcean

Radiosity

Raytrace

Volrend

Water-Nsq

Water-Sp

Em3d

Tomcatv

Cannealx264

Average

0.010.020.030.040.050.060.070.080.090.0

100.0

Pe

rce

nta

ge

of

blo

cks

Invalid Private Shared-Read-Only Shared-Written



32

Misses


Barnes

CholeskyFFT

FMM LUOcean

Radiosity

Raytrace

Volrend

Water-Nsq

Water-Sp

Em3d

Tomcatv

Cannealx264

Average

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Ca

che

mis

s cl

ass

ifica

tion

Cold-cap-confCoherence

Selective-flushingWrite-through

5.2 1.71. Directory2. VIPS-M


33

In directory-‐based coherence:

•  WT eliminates the need to track the writer @ the directory –  Correct data are always @ the LLC

•  SF eliminates the need to track readers @ the directory –  No need to invalidate anyone

è  The directory is gone

(Data classifica%on is handled by the Page Table & TLBs so no directory is needed) 2/3/13 33 Uppsala University/Uppsala Innova%on


34

In snooping-‐based coherence:

•  Data classifica%on + WT eliminate snoops on reads: –  No need to see who else is reading to determine the degree of sharing (S state, from M,E)

–  No need to see who is reading to supply it the data à no need to snoop on reads

•  SF eliminates invalida%ons à no need to snoop on writes

è Snooping is gone


praccallycostlesscoherence forgpusand( manycore...

Documents