Download - SCORPIO: 36-Core Shared Memory Processor

SCORPIO: 36-Core Shared Memory ProcessorDemonstrating Snoopy Coherence on a Mesh Interconnect

Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett Wilkerson, John Arends, Anantha Chandrakasan, Li-Shiuan Peh

Chia-Hsin Owen Chen

MTL

Contributions:Core integration (Bhavya and Owen), Cache coherence protocol design (Bhavya and Woo Cheol)L2 cache controller implementation (Bhavya)Memory interface controller implementation (Owen)High-level idea of notification network (Woo-Cheol)Network architecture (Woo-Cheol, Bhavya, Owen, Tushar, Suvinay)Network implementation (Suvinay)

DDR2 and PHY integration (Sunghyun and Owen)Backend of entire chip (Owen)FPGA interfaces, on-chip testers and scan chains (Tushar)RTL functional simulations (Bhavya, Owen, Suvinay)Full-system GEMS simulations (Woo-Cheol)Board Design (Sunghyun)Software Stack (Bhavya and Owen)Package Design (Freescale)

SCORPIO Overview

Owen Chen / MIT 2

Memory controller 1

Memory controller 0

Tile7

Tile8

Tile9

Tile6

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15

Tile 12

Tile 16

Tile 17

Tile 19

Tile 20

Tile 21

Tile 18

Tile 22

Tile 23

Tile 25

Tile 26

Tile 27

Tile 24

Tile 28

Tile 29

Tile 31

Tile 32

Tile 33

Tile 30

Tile 34

Tile 35

Tile0

Tile1

Tile3

Tile4

Tile5

Tile2

11 mm

13

mm

IBM 45nm SOI, 143mm2

600M transistors

SCORPIO Overview

Owen Chen / MIT 2

Memory controller 1

Memory controller 0

Tile7

Tile8

Tile9

Tile6

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15

Tile 12

Tile 16

Tile 17

Tile 19

Tile 20

Tile 21

Tile 18

Tile 22

Tile 23

Tile 25

Tile 26

Tile 27

Tile 24

Tile 28

Tile 29

Tile 31

Tile 32

Tile 33

Tile 30

Tile 34

Tile 35

Tile0

Tile1

Tile3

Tile4

Tile5

Tile2

11 mm

13

mm


600M transistors

36 cores with total 4.5MB L2

SCORPIO Overview

Owen Chen / MIT 2

Memory controller 1

Memory controller 0

Tile7

Tile8

Tile9

Tile6

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15

Tile 12

Tile 16

Tile 17

Tile 19

Tile 20

Tile 21

Tile 18

Tile 22

Tile 23

Tile 25

Tile 26

Tile 27

Tile 24

Tile 28

Tile 29

Tile 31

Tile 32

Tile 33

Tile 30

Tile 34

Tile 35

Tile0

Tile1

Tile3

Tile4

Tile5

Tile2

11 mm

13

mm


600M transistors


6×6 mesh on-chip network supporting snoopy coherence

SCORPIO Overview

Owen Chen / MIT 2

Memory controller 1

Memory controller 0

Tile7

Tile8

Tile9

Tile6

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15

Tile 12

Tile 16

Tile 17

Tile 19

Tile 20

Tile 21

Tile 18

Tile 22

Tile 23

Tile 25

Tile 26

Tile 27

Tile 24

Tile 28

Tile 29

Tile 31

Tile 32

Tile 33

Tile 30

Tile 34

Tile 35

Tile0

Tile1

Tile3

Tile4

Tile5

Tile2

11 mm

13

mm

Dual channel DDR2 memory controller


600M transistors


6×6 mesh on-chip network supporting snoopy coherence

SCORPIO Overview

Owen Chen / MIT 2

Memory controller 1

Memory controller 0

Tile7

Tile8

Tile9

Tile6

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15

Tile 12

Tile 16

Tile 17

Tile 19

Tile 20

Tile 21

Tile 18

Tile 22

Tile 23

Tile 25

Tile 26

Tile 27

Tile 24

Tile 28

Tile 29

Tile 31

Tile 32

Tile 33

Tile 30

Tile 34

Tile 35

Tile0

Tile1

Tile3

Tile4

Tile5

Tile2

11 mm

13

mm

Focus on how snoopy coherence is enabled on a mesh interconnect

Tile7

Tile8

Tile9

Tile6

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15

Tile 12

Tile 16

Tile 17

Tile 19

Tile 20

Tile 21

Tile 18

Tile 22

Tile 23

Tile 25

Tile 26

Tile 27

Tile 24

Tile 28

Tile 29

Tile 31

Tile 32

Tile 33

Tile 30

Tile 34

Tile 35

Tile0

Tile1

Tile3

Tile4

Tile5

Tile2

Core

L1-I L1-D

L2

Network

Tile Architecture

Owen Chen / MIT 3

Core• Freescale e200 z760n3• In-order• Dual-issue

Private L1 cache• Split 16KB for Inst / Data• 4-way set associative

Private L2 cache• 128KB• 4-way set associative• Inclusive

Tile7

Tile8

Tile9

Tile6

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15

Tile 12

Tile 16

Tile 17

Tile 19

Tile 20

Tile 21

Tile 18

Tile 22

Tile 23

Tile 25

Tile 26

Tile 27

Tile 24

Tile 28

Tile 29

Tile 31

Tile 32

Tile 33

Tile 30

Tile 34

Tile 35

Tile0

Tile1

Tile3

Tile4

Tile5

Tile2

Core

L1-I L1-D

L2

Network

Tile Architecture

Owen Chen / MIT 3

Core• Freescale e200 z760n3• In-order• Dual-issue

Private L1 cache• Split 16KB for Inst / Data• 4-way set associative

Private L2 cache• 128KB• 4-way set associative• Inclusive

Write-through

Back-invalidate

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Tile 0 Tile 1 Tile 35

REQX DATA

Back- invalidate

Scalable On-Chip Network

Write-through

Low overhead directory entry(2 bits per cacheline)

Directory Cache

Memory Controller

Network

Directory Cache

Memory Controller

Network

Snoopy Coherence

Owen Chen / MIT 4

MOSI coherence protocol• Additional Odirty state• Data forwarding list• Destination filtering


Owen Chen / MIT 5

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

21 43

6

5

87 109

12

11

1413 1615

18

17

2019 2221

24

23

2625 2827

30

29

3231 3433 35

0

6×6 mesh interconnect• 137b wide data-path• One network node / tile



Owen Chen / MIT 6

5×5 Crossbar

East Input Port

Req Virtual Network

Resp Virtual Network

VC Allocators

Switch Allocators

East

Local

East

South

West

North

Other Controls

Router

Buffer Write (BW)Switch Arbitration Inport (SA-I)

Buffer Read (BR)Switch Allocation Outport (SA-O)

VC Allocation (VA)Lookahead/Header Generation

Switch Traversal(ST)

RegularPipeline Stages



Owen Chen / MIT 6

5×5 Crossbar

East Input Port

Req Virtual Network


VC Allocators

Switch Allocators

East

Local

East

South

West

North

Other Controls

Router

Deadlock avoidance• Two virtual networks• Dimensional X-Y routing








Owen Chen / MIT 6

5×5 Crossbar

East Input Port

Req Virtual Network


VC Allocators

Switch Allocators

East

Local

East

South

West

North

Other Controls

Router


Optimizations• Multiple virtual channels / VN








Owen Chen / MIT 6

5×5 Crossbar

East Input Port

Req Virtual Network


VC Allocators

Switch Allocators

East

Local

East

South

West

North

Other Controls

Router


Optimizations• Multiple virtual channels / VN• In-network broadcast support






East

Local

South

West

North



Owen Chen / MIT 6

5×5 Crossbar

East Input Port

Req Virtual Network


VC Allocators

Switch Allocators

East

Local

East

South

West

North

Other Controls

Router


Optimizations• Multiple virtual channels / VN• In-network broadcast support• Virtual router pipeline bypass






Bypass Intermediate PipelinesBypass Pipeline Stages


East

North

3 cycle 1 cycle

Globally-Ordered Virtual Network

Owen Chen / MIT 7

Problem: Broadcast Messages delivered to different nodes in different orders on unordered networks

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Unordered Interconnect


Req1

Req2

Req1

Req2 Req1

Req2


Owen Chen / MIT 7


We want: Every node to see all messages in the same global order

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network



Req1

Req2

Req1

Req2

Req1

Req2


Owen Chen / MIT 7


We want: Every node to see all messages in the same global order

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network

Core

L1-I L1-D

L2

Network



Req1

Req2

Req1

Req2

Req1

Req2

Solution: Decouple message delivery from ordering


Owen Chen / MIT 8


We want: Every node to see all messages in the same global orderSolution: Decouple message delivery from ordering

21 43

6

5

87 109

12

11

1413 1615

18

17

2019 2221

24

23

2625 2827

30

29

3231 3433 35

0

Main network• Message delivery

Notification network• Message ordering

Notification Network

Owen Chen / MIT 9

Bounded latency ( ≤ 12 cycle )• Non-blocking• 1 cycle / hop broadcast mesh• Dedicated 1 bit / tile

Notification


Owen Chen / MIT 9


Low cost• Only DFF + ORs

Bitwise-OR

DFF

Outeast

Ineast Insouth Inwest Innorth Innic

Outsouth

Outwest

Outnorth

Notification Router

Outlocal


Owen Chen / MIT 9


Low cost• Only DFF + ORs

All tiles determine the global order locally

TimelineTime Window

Inject corresponding notifications

All tiles receive the same notifications

Broadcast messages on main network

Notification

Walkthrough

Owen Chen / MIT 10

21 43

65 87

109 1211

1413 1615Main network

Notification network

Timeline

Walkthrough

Owen Chen / MIT 10

21 43

65 87

109 1211



M1T1

GETX Addr1

Timeline

T1. Core 11 injects M1

Walkthrough

Owen Chen / MIT 10

21 43

65 87

109 1211



M1T1

GETX Addr1

M2T2

GETS Addr2

Timeline



Walkthrough

Owen Chen / MIT 10

21 43

65 87

109 1211



M1T1

GETX Addr1

M2T2

GETS Addr2

Timeline



T3. Both cores injectnotification N1 N2

N1Broadcast

notification for M1

0 0 0 ••• •••1 0 01 2 3 ••• ••• 11 15 16

N2

N1

T3

T3Broadcast

notification for M2

1 0 0 ••• •••0 0 01 2 3 ••• ••• 11 15 16

N2

Walkthrough

Owen Chen / MIT 10

21 43

65 87

109 1211



M1T1

GETX Addr1

M2T2

GETS Addr2

Timeline

M2

M2

M2 M2

M2

M2

M1

M1 M1 M1

M1

Core 1, 2, 3, 5, 6, 9 receive M2




N1Broadcast

notification for M1

0 0 0 ••• •••1 0 01 2 3 ••• ••• 11 15 16

N2

N1

T3

T3Broadcast

notification for M2

1 0 0 ••• •••0 0 01 2 3 ••• ••• 11 15 16

N2

Walkthrough

Owen Chen / MIT 11

21 43

65 87

109 1211

1413 1615

Core

Timeline

Core 1, 2, 3, 5, 6, 9 receive M2


M2

M2

M2 M2

M2

M2

M1

M1 M1 M1

M1

Walkthrough

Owen Chen / MIT 11

21 43

65 87

109 1211

1413 1615

Core

Timeline

Core 1, 2, 3, 5, 6, 9 receive M2


T4. Notifications guaranteed to reach all nodes now

N1 N2

M2

M2

M2 M2

M2

M2

M1

M1 M1 M1

M1

Merged Notification

1 0 0 ••• •••1 0 01 2 3 ••• ••• 11 15 16

N1 N2T4

Walkthrough

Owen Chen / MIT 11

21 43

65 87

109 1211

1413 1615

Core

Timeline

Core 1, 2, 3, 5, 6, 9 receive M2


T4. Notifications guaranteed to reach all nodes now

N1 N2

M2

M2

M2 M2

M2

M2

M1

M1 M1 M1

M1

T5. Core 1, 2, 3, 5, 6, 9 processed M2

Merged Notification

1 0 0 ••• •••1 0 01 2 3 ••• ••• 11 15 16

N1 N2T4

T5

Walkthrough

Owen Chen / MIT 12

Timeline


21 43

65 87

109 1211

1413 1615M2

M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

Cores receive in any order, and process followed by

M2M1

M2 M1

Walkthrough

Owen Chen / MIT 12

Timeline


21 43

65 87

109 1211

1413 1615M2

M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

M2M1

Cores receive in any order, and process followed by

M2M1

M2 M1

R2

R1

Data Addr2

Data Addr1

T6

T7

T6. Core 13, owner of Addr2,responds with data to Core 1 R2

T7. Core 6, owner of Addr1,responds with data to Core 11 R1

Synchronization Primitives

Owen Chen / MIT 13

lwarx, stwcx• Link in L2 cacheline granularity• Detect modifications after load-link using coherence protocol

msync• Broadcast sync requests• Gather acks from all cores when they complete the sync request

Evaluation Setup

Owen Chen / MIT 14

Simulator GEMS + GARNET

Access times L1 – 1 cycle; L2 – 10 cycles; DRAM 90 cycles

LPD Limited Pointer Directory Coherence

HT AMD HyperTransport Coherence

SCORPIO Snoopy Coherence: MOSI

Evaluation Setup

Owen Chen / MIT 14

Simulator GEMS + GARNET

Access times L1 – 1 cycle; L2 – 10 cycles; DRAM 90 cycles

LPD Limited Pointer Directory Coherence

HT AMD HyperTransport Coherence

SCORPIO Snoopy Coherence: MOSI

LPD HT SCORPIO

What is tracked? Few sharersPresence of

ownerPresence of

owner

Who ordersrequests?

Directory Directory Network

Storage overhead

Indirection latency

Isolate

Runtime Comparison

Owen Chen / MIT 15

24% better than Limited Pointer Directory

13% better than Hyper-Transport

0

0.2

0.4

0.6

0.8

1

1.2

1.4

No

rmal

ize

d R

un

tim

e

LPD HT SCORPIO

L2 Service Latency

Owen Chen / MIT 16

Requests served by other caches

0

20

40

60

80

100

120

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

barnes fft lu blackscholes canneal fluidanimate average

Cyc

les

Network: Req to Dir Dir Access Network: Dir to Sharer Network: Bcast Req Req Ordering Sharer Access Network: Resp

L2 Service Latency

Owen Chen / MIT 16


19% lower than LPD

18% lower than HT

0

20

40

60

80

100

120

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO


Cyc

les


L2 Service Latency

Owen Chen / MIT 16


Requests served by directory -- MC

19% lower than LPD

18% lower than HT

0

20

40

60

80

100

120

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO


Cyc

les


0

50

100

150

200

250

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO


Cyc

les

Network: Req to Dir Network: Bcast Req Dir Access Req Ordering Network: Resp

7.5% lower than LPD

4.2% higher than HT

L2 Service Latency

Owen Chen / MIT 16



19% lower than LPD

18% lower than HT

0

20

40

60

80

100

120

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO


Cyc

les


0

50

100

150

200

250

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO


Cyc

les


7.5% lower than LPD

4.2% higher than HT

L2 Service Latency

Owen Chen / MIT 16



19% lower than LPD

18% lower than HT

90% requests served by other

caches

0

20

40

60

80

100

120

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO


Cyc

les


0

50

100

150

200

250

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO


Cyc

les


7.5% lower than LPD

4.2% higher than HT

L2 Service Latency

Owen Chen / MIT 16



19% lower than LPD

18% lower than HT

90% requests served by other

caches

0

20

40

60

80

100

120

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO


Cyc

les


0

50

100

150

200

250

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO

LPD HT

SCO

RP

IO


Cyc

les


Average L2 service latency

17% lower than LPD

14% lower than HT

Network Cost

Owen Chen / MIT 17

Area

Network occupies only 10% of the area

Post-layout frequency: 833 MHz

Network Cost

Owen Chen / MIT 17

Power

Network consumes 20% of the power

Area

Network occupies only 10% of the area

Post-layout frequency: 833 MHz

• SCORPIO: A 36-core shared-memory processorSnoopy coherency on a mesh interconnect:

– Runtime: 24% better than LPD, 13% better than HT

– Cost: 28.8W @ 833MHz

• Novel network-on-chip for scalable snoopy coherenceNew ideas:

– Distributed in-network ordering mechanism

– Decouple message delivery from message ordering

Contributions

Owen Chen / MIT 18

Ongoing Work

Owen Chen / MIT 19

Software stack development• Boot Linux• Run PARSEC, SPLASH, …, etc

Chip measurement• Power, timing• Performance

THANK YOU!MTL

Download - SCORPIO: 36-Core Shared Memory Processor

Top Related