lucía g. menezo valentín puente josé Ángel gregorio university of cantabria (spain) mosaic :

33
The Case for a Scalable Coherence Protocol for Complex On-Chip Cache Hierarchies in Many-Core Systems Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

Upload: rigoberto-selvidge

Post on 31-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

The Case for a Scalable Coherence Protocol for

Complex On-Chip Cache Hierarchies in Many-Core

SystemsLucía G. Menezo

Valentín PuenteJosé Ángel Gregorio

University of Cantabria (Spain)

MOSAIC :

Page 2: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

University of CantabriaEdinburgh - PACT 2013

Motivation Directory Schemas

◦ In-cache ◦ Sparse

MOSAIC Coherence Protocol◦ Examples

Evaluation Results Conclusions

Outline

Page 3: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

3University of CantabriaEdinburgh - PACT 2013

Performance improvement: more processors per chip

Major challenges: off-chip bandwidth wall Introduce cache into the chip Complex on-chip cache hierarchies

Coherence protocol: fundamental role to play

Motivation

Page 4: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

4University of CantabriaEdinburgh - PACT 2013

What coherence protocol to use with large number of cores: ◦ Broadcast-based protocols high energy

requirements◦ Directory-based protocols more storage

necessities for sharing information

MOSAIC: new coherence protocol◦ Directory without inclusiveness◦ Token Coherence to guarantee correctness

Motivation

Page 5: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

University of CantabriaEdinburgh - PACT 2013

Motivation Directory Schemas

◦ In-cache ◦ Sparse

MOSAIC Coherence Protocol◦ Examples

Evaluation Results Conclusions

Outline

Page 6: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

6University of CantabriaEdinburgh - PACT 2013

Each block in LLC includes tag, data and the sharers information

LLC receives requests needs precise knowledge

Inclusiveness is necessary: any block in the private levels needs to be allocated in LLC

Advantage: coherence protocol less complex Disadvantage: all LLC blocks has storage

overhead

Directory schemas: In-cache

Page 7: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

7University of CantabriaEdinburgh - PACT 2013

@ data

sharers

@ data

@ data

@ data

@ data

P

Pro

cess

ors

an

d p

rivate

ca

ches

LLC + in-cache directory

P

P

P

Inte

rconnect

ion n

etw

ork

Overhead!!!

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

Directory schemas: In-cache

Page 8: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

8University of CantabriaEdinburgh - PACT 2013

Directory schemas: In-cache@ dat

asharers @ dat

asharers

LLC + in-cache directory

Inte

rconnect

ion n

etw

ork

Overhead!!!

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

Overhead!!!

Pro

cess

ors

an

d p

rivate

ca

ches

Page 9: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

9University of CantabriaEdinburgh - PACT 2013

Directory entries separated from data Allocated under demand Overhead proportional to the aggregate

private levels size (not LLC) Capacity and associativity has to be

sufficient to keep private-level cache tags

Directory schemas: Sparse

Page 10: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

10University of CantabriaEdinburgh - PACT 2013

@ data

sharers @ data

Directory schemas: Sparse

Inte

rconnect

ion n

etw

ork

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

@ dataP

@@ sharers

LLCSparse dir

Pro

cess

ors

an

d p

rivate

ca

ches

Page 11: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

11University of CantabriaEdinburgh - PACT 2013

Duplicate-tag directory: holding all the tags of private levels

Example: 16 cores with 4-way 32KB L1 64-way

Directory schemas: SparseAssociativity = # cores * private caches associativity

# sets = # private

caches sets

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

Page 12: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

12University of CantabriaEdinburgh - PACT 2013

Directory schemas: Sparse

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

Decrease Associativity: now << # cores * private caches associativity

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

sharers sharers

sharers

sharers

sharers

sharers

sharerssharers

sharers

sharers

sharers

sharers

sharers sharers

sharers

sharers

sharers

sharers

sharerssharers

sharers

sharers

sharers

sharers

tagtagtagtagtagtag

tagtagtagtagtagtag

One tag may be in various private caches

More than 1 tag per entry conflicts

Inclusiveness needed invalidate private data (recalls messages)

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

tagtagtagtagtagtag

Increasenumber of sets

Page 13: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

13University of CantabriaEdinburgh - PACT 2013

Motivation Directory Schemas

◦ In-cache ◦ Sparse

MOSAIC Coherence Protocol◦ Examples

Evaluation Results Conclusions

Outline

Page 14: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

14University of CantabriaEdinburgh - PACT 2013

In-cache or sparse it doesn’t matter No inclusiveness No invalidations of data in private caches Reconstruction of sharing information under

demand Uses token counting to avoid extra traffic and

guarantee correctness

Token Coherence protocol:◦ Initially each block := # tokens (==#procs) ◦ Read request: data and 1 token◦ Write request: data and all tokens

MOSAIC Protocol

Page 15: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

15University of CantabriaEdinburgh - PACT 2013

MOSAIC Conceptual Approach

I 0 N/A

P0

O 2 DATA

P1

S 1 DATA

P2

SharersI

Last Level Cache

I 0 N/A

Data_sliceDir_slice Memory

Controller

On-chip network

Pri

vate

Cach

es

1

2

3

4

5

State Num. Tokens

Data

V

2

3

1

Page 16: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

16University of CantabriaEdinburgh - PACT 2013

When data not present in LLC broadcast for reconstruction

Private caches inform of num. of held tokens

Token counting avoids negative acknowledgements or timeouts

Reconstruction message piggybacks type of request and requestor

Key: directory may replace silently no invalidations

MOSAIC Key Facts

Page 17: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

17University of CantabriaEdinburgh - PACT 2013

MOSAIC Read RequestP0 P1 P2

Invalid

State IS

Read

P3 Dir LLC

State SState OState C

Data + token

State A

Reconstruction

Info 1 tokenInfo 2 tokensOwnerUnblock (info 1 token)

Read

Forward GETS to Owner

Sharers [P2]Owner: ¿?Sharers [P2, P1]Owner: P1

Sharers [P2, P1, P0]Owner: P1

Data + token

3 tokens 1 token

Unblock Sharers [P2, P1, P0, P3]Owner: P1

Page 18: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

18University of CantabriaEdinburgh - PACT 2013

MOSAIC Write RequestP0 P1 P2

Invalid

State IS

Write

P3 Dir LLC

State SState O

State C

Data + 3 tokens

State A

Reconstruction

Sharers [P0]Owner: P0

3 tokens 1 token

State IM

State M

1 token

Unblock (info all tokens)

Directory Eviction

Page 19: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

19University of CantabriaEdinburgh - PACT 2013

Motivation Directory Schemas

◦ In-cache ◦ Sparse

MOSAIC Coherence Protocol◦ Examples

Evaluation Results Conclusions

Outline

Page 20: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

20University of CantabriaEdinburgh - PACT 2013

Evaluation methodologyConfig 1 Config 2

Number of cores 8 @3GHz 16 @3GHz

IWin size/Issue Width 128, 4-way

Block size 64B

Private

L1 Size /

Associativity32KB I/D, 2-way

L2 Size /

Associativity

64KB, 4-way(exclusive with L1)

L3 Shared

Size / Associativity

16MB 16-way

32MB16-way

NUCA MappingStatic, interleaved across

slices

Memory Capacity 4GB

Max. Outstanding Mem. Operations 16

Topology 4×4 Mesh 6×6 Mesh

Core 0 Core 1 Core 2 Core 3

Core 4 Core 5 Core 6 Core 7

R R R R

R R R R

R R R R

R R R R

Slice 0 Slice 2Slice 1 Slice 3

Slice 4 Slice 6Slice 5 Slice 7

Slice 8 Slice 10Slice 9 Slice 11

Slice 12 Slice 14Slice 13 Slice 15

Core 0 Core 1 Core 2 Core 3

R R R R

R R R R

R R R R

R R R R

Slice 0 Slice 2Slice 1 Slice 3

Slice 5 Slice 7Slice 6 Slice 8

Slice 11 Slice 13Slice 12 Slice 14

Slice 17 Slice 19Slice 18 Slice 20

R

R

R

R

Slice 9

Slice 15

Slice 21

R

R

R

R

Slice 4

Slice 10

Slice 16

R R R R

Slice 23 Slice 25Slice 24 Slice 26

R

Slice 27

R

Slice 22

R R R R

Slice 28 Slice 30Slice 29 Slice 31

RR

Core

7C

ore

5C

ore

6C

ore

4

Core 11 Core 10 Core 9 Core 8C

ore

1

2C

ore

14

Core

13

Core

15

Page 21: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

21University of CantabriaEdinburgh - PACT 2013

GEMS: full-system evaluation

◦SLICC: Specification Language for Implementing Cache Coherence

Simulation stack and Workloads

Multithreaded Workloads

4 Wisconsin Commercial Workload

3 NAS Parallel Bench.

Multiprogrammed Workloads

3 Spec 2006 (Rate Mode)

Page 22: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

22University of CantabriaEdinburgh - PACT 2013

Asta

r

Hmm

er

Omne

tpp FT IS LU

Apac

he Jbb

OLTP

Zeus

Gmea

n0.5

0.6

0.7

0.8

0.9

1

1.164w128KB 32w128KB 2w128KB 1w128KB

MOSAIC PerformanceReducing associativity

Norm

aliz

ed

exe

cuti

on t

ime

128KB 16K entries (8 bytes per entry)

Page 23: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

23University of CantabriaEdinburgh - PACT 2013

Number of misses6

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 1

BASE MO-SAIC

BASE MO-SAIC

BASE MO-SAIC

BASE MO-SAIC

BASE MO-SAIC

BASE MO-SAIC

BASE MO-SAIC

BASE MO-SAIC

BASE MO-SAIC

BASE MO-SAIC

Astar Hmmer Omnetpp FT IS LU Apache Jbb OLTP Zeus

00.20.40.60.8

11.21.41.61.8

2Misses L2 Misses L1I Misses L1D

Norm

aliz

ed

num

. m

isse

s

x2

Page 24: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

24University of CantabriaEdinburgh - PACT 2013

Asta

r

Hmm

er

Omne

tpp FT IS LU

Apac

he Jbb

OLTP

Zeus

Gmea

n0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

64w16KB 32w16KB 2w16KB 1w16KB

MOSAIC Performance Reducing associativity and capacity

Norm

aliz

ed

exe

cuti

on t

ime

128KB 16K entries (8 bytes per entry) 16KB 2K entries

Page 25: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

25University of CantabriaEdinburgh - PACT 2013

MOSAIC Latency6

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 16

43

2 2 1

BASE

MOSAIC

Astar Hmmer Omnetpp FT IS LU Apache Jbb OLTP Zeus

0

2

4

6

8

10

12

L3 Other L2 Other L1 Private L2 Local L1

Late

ncy (

Pro

cessor

Cycle

s)

16KB 2K entries

Page 26: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

26University of CantabriaEdinburgh - PACT 2013

Avera

ge n

etw

ork

lin

k uti

lizati

on

MOSAIC Link Utilization

Asta

r

Hmm

er

Omne

tpp FT IS LU

Apac

he Jbb

OLTP

Zeus

Gmea

n0

0.2

0.4

0.6

0.8

1

1.2

1.4 64w128KB 64w64KB 64w32KB 64w8KB 2w128KB 2w64KB

2w16KB

Page 27: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

27University of CantabriaEdinburgh - PACT 2013

MOSAIC Link Utilization vs. Dir

Asta

r

Hmm

er

Omne

tpp FT IS LU

Apac

he Jbb

OLTP

Zeus

Gmea

n0

0.2

0.4

0.6

0.8

1

1.2

1.4

2w128KB 2w64KB 2w16KBN

orm

aliz

ed n

etw

ork

link

utili

zatio

n

40%!!

Page 28: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

28University of CantabriaEdinburgh - PACT 2013

MOSAIC Scalability

Asta

r

Hmm

er

Omne

tpp FT IS LU

Apac

he Jbb

OLTP

Zeus

Gmea

n0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 128w256KB 128w128KB 128w64KB 128w32KB 2w256KB 2w128KB2w64KB 2w32KB

Norm

aliz

ed

lin

k uti

lizati

on

16 cores configuration

Page 29: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

29University of CantabriaEdinburgh - PACT 2013

Low complexity and great scalability Very low storage overhead No noticeable energy cost Alternative for future many-core cache

coherent CMPs

Conclusions

Bandwidth scalability of a directory Elegancy of Token Coherence

MOSAIC Coherence Protocol

Page 30: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

30University of CantabriaEdinburgh - PACT 2013

Thank you for your attention

Page 31: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

31University of CantabriaEdinburgh - PACT 2013

Page 32: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

32University of CantabriaEdinburgh - PACT 2013

Realistic Cache Configuration

Asta

r

Hmm

er

Omne

tpp FT IS LU

Apac

he Jbb

OLTP

Zeus

Gmea

n0

0.2

0.4

0.6

0.8

1

1.2

16w512KB 16w256KB 16w128KB 16w64KB 16w32KB

Norm

aliz

ed e

xecu

tion t

ime

- Same experiment with BASE: 20% impact in some cases

L1: 4-way 32KB / L2: 8-way 256KBx2 full dir 1/10 full dir

Page 33: Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

33University of CantabriaEdinburgh - PACT 2013

MOSAIC Energy1

28

64

16

12

86

41

61

28

64

16

12

86

41

61

28

64

16

12

86

41

61

28

64

16

12

86

41

61

28

64

16

12

86

41

61

28

64

16

12

86

41

61

28

64

16

12

86

41

61

28

64

16

12

86

41

61

28

64

16

12

86

41

61

28

64

16

12

86

41

6

BASE

MO-SAIC

BASE

MO-SAIC

BASE

MO-SAIC

BASE

MO-SAIC

BASE

MO-SAIC

BASE

MO-SAIC

BASE

MO-SAIC

BASE

MO-SAIC

BASE

MO-SAIC

BASE

MOSAIC

Astar Hmmer Om-netpp

FT IS LU Apache Jbb OLTP Zeus

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Network Sparse directory L3 L2 L1

Norm

aliz

ed

Dynam

ic E

nerg

y