scorpio: 36-core shared memory processor - hot · pdf filescorpio: 36-core shared memory...

20
SCORPIO: 36-Core Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett Wilkerson, John Arends, Anantha Chandrakasan, Li-Shiuan Peh Chia-Hsin Owen Chen MTL Contributions: Core integration (Bhavya and Owen), Cache coherence protocol design (Bhavya and Woo Cheol) L2 cache controller implementation (Bhavya) Memory interface controller implementation (Owen) High-level idea of notification network (Woo-Cheol) Network architecture (Woo-Cheol, Bhavya, Owen, Tushar, Suvinay) Network implementation (Suvinay) DDR2 and PHY integration (Sunghyun and Owen) Backend of entire chip (Owen) FPGA interfaces, on-chip testers and scan chains (Tushar) RTL functional simulations (Bhavya, Owen, Suvinay) Full-system GEMS simulations (Woo-Cheol) Board Design (Sunghyun) Software Stack (Bhavya and Owen) Package Design (Freescale)

Upload: phambao

Post on 06-Mar-2018

243 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

SCORPIO: 36-Core Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect

Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett Wilkerson, John Arends, Anantha Chandrakasan, Li-Shiuan Peh

Chia-Hsin Owen Chen

MTL

Contributions: Core integration (Bhavya and Owen), Cache coherence protocol design (Bhavya and Woo Cheol) L2 cache controller implementation (Bhavya) Memory interface controller implementation (Owen) High-level idea of notification network (Woo-Cheol) Network architecture (Woo-Cheol, Bhavya, Owen, Tushar, Suvinay) Network implementation (Suvinay)

DDR2 and PHY integration (Sunghyun and Owen) Backend of entire chip (Owen) FPGA interfaces, on-chip testers and scan chains (Tushar) RTL functional simulations (Bhavya, Owen, Suvinay) Full-system GEMS simulations (Woo-Cheol) Board Design (Sunghyun) Software Stack (Bhavya and Owen) Package Design (Freescale)

Page 2: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

SCORPIO Overview

Owen Chen / MIT 2

Memory controller 1

Memory controller 0

Tile7

Tile8

Tile9

Tile6

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15

Tile 12

Tile 16

Tile 17

Tile 19

Tile 20

Tile 21

Tile 18

Tile 22

Tile 23

Tile 25

Tile 26

Tile 27

Tile 24

Tile 28

Tile 29

Tile 31

Tile 32

Tile 33

Tile 30

Tile 34

Tile 35

Tile0

Tile1

Tile3

Tile4

Tile5

Tile2

11 mm

13

mm

Dual channel DDR2 memory controller

IBM 45nm SOI, 143mm2

600M transistors

36 cores with total 4.5MB L2

6×6 mesh on-chip network supporting snoopy coherence

Focus on how snoopy coherence is enabled on a mesh interconnect

Page 3: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Tile7

Tile8

Tile9

Tile6

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15

Tile 12

Tile 16

Tile 17

Tile 19

Tile 20

Tile 21

Tile 18

Tile 22

Tile 23

Tile 25

Tile 26

Tile 27

Tile 24

Tile 28

Tile 29

Tile 31

Tile 32

Tile 33

Tile 30

Tile 34

Tile 35

Tile0

Tile1

Tile3

Tile4

Tile5

Tile2

Core

L1-I L1-D

L2

Network

Tile Architecture

Owen Chen / MIT 3

Core • Freescale e200 z760n3 • In-order • Dual-issue

Private L1 cache • Split 16KB for Inst / Data • 4-way set associative

Private L2 cache • 128KB • 4-way set associative • Inclusive

Write-through

Back- invalidate

Page 4: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Tile 0 Tile 1 Tile 35

REQX DATA

Back- invalidate

Scalable On-Chip Network

Write-through

Low overhead directory entry(2 bits per cacheline)

Directory Cache

Memory Controller

Network

Directory Cache

Memory Controller

Network

Snoopy Coherence

Owen Chen / MIT 4

MOSI coherence protocol • Additional Odirty state • Data forwarding list • Destination filtering

Page 5: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Scalable On-Chip Network

Owen Chen / MIT 5

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

21 43

6

5

87 109

12

11

1413 1615

18

17

2019 2221

24

23

2625 2827

30

29

3231 3433 35

0

6×6 mesh interconnect • 137b wide data-path • One network node / tile

Page 6: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

6×6 mesh interconnect • 137b wide data-path • One network node / tile

Scalable On-Chip Network

Owen Chen / MIT 6

5×5 Crossbar

East Input Port

Req Virtual Network

Resp Virtual Network

VC Allocators

Switch Allocators

East

Local

East

South

West

North

Other Controls

Router

Deadlock avoidance • Two virtual networks • Dimensional X-Y routing

Optimizations • Multiple virtual channels / VN • In-network broadcast support • Virtual router pipeline bypass Buffer Write (BW)

Switch Arbitration Inport (SA-I)

Buffer Read (BR)Switch Allocation Outport (SA-O)

VC Allocation (VA)Lookahead/Header Generation

Switch Traversal(ST)

RegularPipeline Stages

Bypass Intermediate PipelinesBypass Pipeline Stages

Switch Traversal(ST)

East

Local

South

West

North

East

North

3 cycle 1 cycle

Page 7: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Globally-Ordered Virtual Network

Owen Chen / MIT 7

Problem: Broadcast Messages delivered to different nodes in different orders on unordered networks We want: Every node to see all messages in the same global order

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Unordered Interconnect

Tile 0 Tile 1 Tile 35

Req1

Req2

Req1

Req2 Req1

Req2

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Unordered Interconnect

Tile 0 Tile 1 Tile 35

Req1

Req2

Req1

Req2

Req1

Req2

Solution: Decouple message delivery from ordering

Page 8: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Globally-Ordered Virtual Network

Owen Chen / MIT 8

Problem: Broadcast Messages delivered to different nodes in different orders on unordered networks We want: Every node to see all messages in the same global order

Solution: Decouple message delivery from ordering

21 43

6

5

87 109

12

11

1413 1615

18

17

2019 2221

24

23

2625 2827

30

29

3231 3433 35

0

Main network • Message delivery

Notification network • Message ordering

Page 9: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Notification Network

Owen Chen / MIT 9

Bounded latency ( ≤ 12 cycle ) • Non-blocking • 1 cycle / hop broadcast mesh • Dedicated 1 bit / tile

Low cost • Only DFF + ORs

All tiles determine the global order locally

TimelineTime Window

Inject corresponding notifications

All tiles receive the same notifications

Broadcast messages on main network

Notification

Bitwise-OR

DFF

Outeast

Ineast Insouth Inwest Innorth Innic

Outsouth

Outwest

Outnorth

Notification Router

Outlocal

Page 10: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Walkthrough

Owen Chen / MIT 10

21 43

65 87

109 1211

1413 1615Main network

Notification network

M1T1

GETX Addr1

M2T2

GETS Addr2

Timeline

M2

M2

M2 M2

M2

M2

M1

M1 M1 M1

M1

Core 1, 2, 3, 5, 6, 9 receive M2

T1. Core 11 injects M1

T2. Core 1 injects M2

T3. Both cores injectnotification N1 N2

N1Broadcast

notification for M1

0 0 0 ••• •••1 0 0

1 2 3 ••• ••• 11 15 16

N2

N1

T3

T3Broadcast

notification for M2

1 0 0 ••• •••0 0 0

1 2 3 ••• ••• 11 15 16

N2

Page 11: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Walkthrough

Owen Chen / MIT 11

21 43

65 87

109 1211

1413 1615

Core

Timeline

Core 1, 2, 3, 5, 6, 9 receive M2

T3. Both cores injectnotification N1 N2

T4. Notifications guaranteed to reach all nodes now

N1 N2

M2

M2

M2 M2

M2

M2

M1

M1 M1 M1

M1

T5. Core 1, 2, 3, 5, 6, 9 processed M2

Merged Notification

1 0 0 ••• •••1 0 0

1 2 3 ••• ••• 11 15 16

N1 N2T4

T5

Page 12: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Walkthrough

Owen Chen / MIT 12

Timeline

T5. Core 1, 2, 3, 5, 6, 9 processed M2

21 43

65 87

109 1211

1413 1615M2

M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

Cores receive in any order, and process followed by

M2M1

M2 M1

R2

R1

Data Addr2

Data Addr1

T6

T7

T6. Core 13, owner of Addr2,responds with data to Core 1 R2

T7. Core 6, owner of Addr1,responds with data to Core 11 R1

Page 13: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Synchronization Primitives

Owen Chen / MIT 13

lwarx, stwcx • Link in L2 cacheline granularity • Detect modifications after load-link using coherence protocol

msync • Broadcast sync requests • Gather acks from all cores when they complete the sync request

Page 14: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Evaluation Setup

Owen Chen / MIT 14

Simulator GEMS + GARNET

Access times L1 – 1 cycle; L2 – 10 cycles; DRAM 90 cycles

LPD Limited Pointer Directory Coherence

HT AMD HyperTransport Coherence

SCORPIO Snoopy Coherence: MOSI

LPD HT SCORPIO

What is tracked? Few sharers Presence of

owner Presence of

owner

Who orders requests?

Directory Directory Network

Storage overhead

Indirection latency

Isolate

Page 15: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Runtime Comparison

Owen Chen / MIT 15

24% better than Limited Pointer Directory

13% better than Hyper-Transport

0

0.2

0.4

0.6

0.8

1

1.2

1.4

No

rmal

ize

d R

un

tim

e

LPD HT SCORPIO

Page 16: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

7.5% lower than LPD

4.2% higher than HT

L2 Service Latency

Owen Chen / MIT 16

Requests served by other caches

Requests served by directory -- MC

19% lower than LPD

18% lower than HT

90% requests served by other

caches

0

20

40

60

80

100

120

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

barnes fft lu blackscholes canneal fluidanimate average

Cyc

les

Network: Req to Dir Dir Access Network: Dir to Sharer Network: Bcast Req Req Ordering Sharer Access Network: Resp

0

50

100

150

200

250

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

barnes fft lu blackscholes canneal fluidanimate average

Cyc

les

Network: Req to Dir Network: Bcast Req Dir Access Req Ordering Network: Resp

Average L2 service latency

17% lower than LPD

14% lower than HT

Page 17: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Network Cost

Owen Chen / MIT 17

Power

Network consumes 20% of the power

Area

Network occupies only 10% of the area

Post-layout frequency: 833 MHz

Page 18: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

• SCORPIO: A 36-core shared-memory processor Snoopy coherency on a mesh interconnect:

– Runtime: 24% better than LPD, 13% better than HT

– Cost: 28.8W @ 833MHz

• Novel network-on-chip for scalable snoopy coherence New ideas:

– Distributed in-network ordering mechanism

– Decouple message delivery from message ordering

Contributions

Owen Chen / MIT 18

Page 19: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

Ongoing Work

Owen Chen / MIT 19

Software stack development • Boot Linux • Run PARSEC, SPLASH, …, etc

Chip measurement • Power, timing • Performance

Page 20: SCORPIO: 36-Core Shared Memory Processor - Hot  · PDF fileSCORPIO: 36-Core Shared Memory Processor ... on a Mesh Interconnect Collaborators: Sunghyun Park, ... 3 cycle 1 cycle

THANK YOU! MTL