3d winoc architecturesmatutani/papers/matsuta... · 2014. 9. 19. · – each tile has router and...

Post on 03-Mar-2021

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D

3D WiNoC Architectures

Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14 1

Hiroki Matsutani Keio University, Japan

Outline: Wireless 3D NoC architectures

• Wireless 3D NoCs (3D WiNoCs) [8min]

– Wireless 3D IC technology – 3D WiNoC design example (65nm)

• Routing & topology exploration [8min]

– Spanning tree optimization for irregular WiNoCs – Power management via vertical ON/OFF links

• Adding random NoC chip for 3D WiNoCs [5min]

– Adding randomness induces small-world effects – Adding random NoC chip to NoC-less 3D ICs

• Summary [1min] 2 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

[Matsutani,ASPDAC’13]

[Matsutani,DATE’14]

[Miura,IEEE Micro 13] [Matsutani,NOCS’11]

Design cost of LSI is increasing

3

By changing the chips in a package, we can provide a wider range of chip family with modest design cost

• System-on-Chip (SoC) – Required components are integrated on a single chip – Different LSI must be developed for each application

• System-in-Package (SiP) or 3D IC – Required components are stacked for each application

3D IC technology for going vertical

4

Two

chip

s (f

ace-

to-f

ace)

Microbump

Through silicon via

Capacitive coupling

Inductive coupling

Wired Wireless M

ore

than

th

ree

chip

s

Scalability

Flexibility

Wireless 3D NoC (3D WiNoC) Three chips are stacked wirelessly

Each router has a wireless transceiver

Each router has TX/RX coils (inductors)

Coils are implemented with metal layers

5

Chip#2

Chip#1

Chip#0

TX

RX

Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

An example: Cube-1 (2012)

• Test chips for building-block 3D systems – Two chip types: Host CPU chip & Accelerator chip – We can customize number & types of chips in SiP

• Cube-1 Host CPU chip – Two 3D wireless routers – MIPS-like CPU

• Cube-1 Accelerator chip – Two 3D wireless routers – Processing element array

6

[Miura, IEEE Micro 13]

MIP CPU Core

8x8 PE Array

Inductors

Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

An example: Cube-1 (2012)

7

• Microphotographs of test chips

Host CPU Chip

Accelerator Chip

Host CPU + 3 Accelerators

[Miura, IEEE Micro 13]

We can change the number and types of chips in a package according to application requirements

An example: Cube-0 (2010)

• Test chip for vertical communication schemes – Vertical point-to-point link between adjacent chips – Vertical shared bus (broadcast)

8

2.1mm x 2.1mm

Core 0 & 1

Inductors (bus)

Inductors (P2P)

Router 0 & 1

TX

Stacking for Ring network

RX

TX/RX

Stacking for Vertical bus

[Matsutani, NOCS’11]

Either vertical P2P links or broadcast bus can be formed for 3D WiNoC

Outline: Wireless 3D NoC architectures

9 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

• Wireless 3D NoCs (3D WiNoCs) [8min]

– Wireless 3D IC technology – 3D WiNoC design example (65nm)

• Routing & topology exploration [8min]

– Spanning tree optimization for irregular WiNoCs – Power management via vertical ON/OFF links

• Adding random NoC chip for 3D WiNoCs [5min]

– Adding randomness induces small-world effects – Adding random NoC chip to NoC-less 3D ICs

• Summary [1min]

[Matsutani,ASPDAC’13]

[Matsutani,DATE’14]

[Miura,IEEE Micro 13] [Matsutani,NOCS’11]

Big picture: Wireless 3D NoC

• Arbitrary chips are stacked after fabrication – Each chip has vertical links at pre-specified locations, but

we do not know internal topology of each chip – Some chips may not have horizontal NoC (vertical link only)

10

Memory chip from C

CPU chip from A

Application chip from B

Required chips are stacked for each application

An example (4 chips)

Inductor (vertical link)

Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14 11

Chip 0

Chip 1

Chip 3

Chip 2

6 7

4 5

2 3

0 1

Root

[Schroeder, JSAC’91]

Wireless 3D NoC: Up*/down* routing

• Up*/down* (UD) routing – Irregular network routing – A root node is selected – Direction (up or down) is assigned – Packets go up and then go down

• Example: 4 chips

NG

OK

The best spanning tree root is selected by exhaustive or heuristic using communication traces (9sec for 64-tile)

• UD routing with multiple VCs – Each layer (VC) has its own spanning tree – Packets can transit multiple layers in descent order

12

Chip 0

Chip 1

Chip 3

Chip 2

6 7

4 5

2 3

0 1

Root

OK

Chip 0

Chip 1

Chip 3

Chip 2

Root’

[Koibuchi,ICPP’03] [Lysne,TPDS’06]

VC1 VC0

You can use either VC0 or VC1

6 7

4 5

2 3

0 1

OK

Wireless 3D NoC: UD routing w/ VCs

Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

Full-system simulations: Topology

• We assume 64 tiles (four 4x4 chips) • The following iteration is performed 1,000 times

– Each tile has router and core (e.g., processor or caches) – Each horizontal link appears with 50% (H50) – Each vertical link appears with 100% (V100)

13

Among 1,000 random topologies, one with the most typical hop count value is selected for the full-system evaluation

14

Processor architecture X86-64

L1$ size & latency 32K / 1cycle

L2$ size & latency 256K / 6cycle

Memory size & latency 4G / 160cycle

Router latency [BW] [VSA] [ST] [LT]

Router buffer size 5-flit per VC

Protocol MOESI directory (3VC)

Table 1: Simulation parameters

NPB (BT, CG, EP, FT, IS, LU, MG, SP, UA) Table 2: Application programs

CPU CPU

L2 cache banks

Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

Full-system simulations: Gem5

• H50-V100 topology (most typical one among 1,000) – Each horizontal link appears with 50% (H50) – Each vertical link appears with 100% (V100)

8 CPUs, 32 L2$ banks, and 4 memory controllers

15 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

Full-system simulations: Gem5

• H50-V100 topology (most typical one among 1,000) – Each horizontal link appears with 50% (H50) – Each vertical link appears with 100% (V100)

• The best spanning tree root is selected for each message class (i.e., 3 roots for 3 message classes)

Message class 0 Message class 1

Hop count: H50-V100 topology

16

• Hop count w/ the worst case spanning tree root • Hop count w/ the best case spanning tree root • Hop count w/ two spanning tree roots (2 VCs) • Ideal hop count (not deadlock-free)

Spanning tree optimization also improves performance by 12.0% compared to the worst

Spanning tree optimization always takes the best case

App exec time: H50-V100 topology

17

• H50-V100 w/ the worst case • H50-V100 w/ the best case • H100-V100 (all horizontal links available = 3D Mesh)

(50% horizontal links)

Spanning tree optimization improves performance by 12.0% compared to the worst

Spanning tree optimization always takes the best case

Energy per flit: H50-V100 topology

18

• H50-V100 w/ the worst case • H50-V100 w/ the best case • H100-V100 (all horizontal links available = 3D Mesh)

(50% horizontal links)

Energy for routers, horizontal links, and vertical links

The next slides will explore vertical ON/OFF links (e.g., H100V50)

Power management: Vertical links

• 5.8mW per 2Gbps channel • Stop power supply to inductors for low-power

– “H100-Vn” topology (n =100, 50, 25, 15) – Performance/power is improved by spanning tree

optimization

19 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

Cube-1

Motherboard

5.8mW/ch @0.92V

[Miura, IEEE Micro 13]

Fujitsu 65nm CMOS

• H100-Vn topology (n =100, 50, 25, 15) – All horizontal links are available (H100) – (100-n)% of vertical links are power-gated (Vn) – Pick up most typical H100-Vn among 1,000 trials

20 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

Chip#3

Chip#2

Chip#1

Chip#0

H100-V100

Chip#3

Chip#2

Chip#1

Chip#0

H100-V50

Power mgmt: Vertical ON/OFF links

• Random+SPopt (computation cost: low) – Stop (100-n)% of vertical links randomly – Then, spanning tree optimization is performed to mitigate

the performance penalty • Exhaustive search (computation cost: HUGE)

– Find the least important vertical links and stop them until (100-n)% of vertical links are removed

21

Random+SPopt Exhaustive

Random+SPopt Exhaustive

Hop count (Average) App exec time (Average)

Performance penalty is 2.6% w/ Random+SPopt on H100V50

+2.6%

Power mgmt: Vertical off link selection

Outline: Wireless 3D NoC architectures

22 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

• Wireless 3D NoCs (3D WiNoCs) [8min]

– Wireless 3D IC technology – 3D WiNoC design example (65nm)

• Routing & topology exploration [8min]

– Spanning tree optimization for irregular WiNoCs – Power management via vertical ON/OFF links

• Adding random NoC chip for 3D WiNoCs [5min]

– Adding randomness induces small-world effects – Adding random NoC chip to NoC-less 3D ICs

• Summary [1min]

[Matsutani,ASPDAC’13]

[Matsutani,DATE’14]

[Miura,IEEE Micro 13] [Matsutani,NOCS’11]

Big picture: 3D WiNoC w/ Random • Router IP macros have the same # of ports

– Unused ports can be used for long-range links • In addition, redundant links can be implemented

– Statically multiplexed by FPGA-like switch boxes – By reconfiguring the switch boxes, an unique random

topology is generated

23

Adding randomness induces small-world effects

Case 1: Random NoC to NoC-less 3D ICs

• Each chip has inductors but does not have 2D NoC – Inductors in the same pillar form a vertical broadcast bus – Horizontal connectivity is not provided at all (i.e., NoC-less)

24

Chip#0 NoC-less

Chip#1 NoC-less

Chip#2 NoC-less

Side view (3 chips)

Chip#2

Chip#1

Chip#0

“― ― ―” Configuration

Vertical broadcast bus

Case 1: Random NoC to NoC-less 3D ICs

• Each chip has inductors but does not have 2D NoC – Inductors in the same pillar form a vertical broadcast bus – Adding a 2D Mesh NoC to such NoC-less 3D IC

25

Chip#0 NoC-less

Chip#1 NoC-less

Chip#2 2D Mesh

Side view (3 chips)

Chip#2

Chip#1

Chip#0

“m ― ―” Configuration

Source

Destination

Case 1: Random NoC to NoC-less 3D ICs

• Each chip has inductors but does not have 2D NoC – Inductors in the same pillar form a vertical broadcast bus – Adding a Random NoC to such NoC-less 3D IC

26

Chip#0 NoC-less

Chip#1 NoC-less

Side view (3 chips)

Chip#2

Chip#1

Chip#0

“r ― ―” Configuration

Chip#2 Random

Source

Destination

Case 2: Replacing regular NoC w/ random

• 3D WiNoC that consists of three 2D Mesh NoC layers – Inductors in neighboring chips form a vertical P2P link – E.g., regular 3D Mesh topology (i.e., regular 3D NoC)

27

Chip#0 2D Mesh

Chip#1 2D Mesh

Chip#2 2D Mesh

Side view (3 chips)

Chip#2

Chip#1

Chip#0

“m m m” Configuration

Vertical point-to-point link

Case 2: Replacing regular NoC w/ random

• 3D WiNoC that consists of three 2D Mesh NoC layers – Inductors in neighboring chips form a vertical P2P link – Replacing regular 2D Mesh with random NoC

28

Chip#0 2D Mesh

Chip#1 2D Mesh

Chip#2 2D Mesh

“m m m” Configuration

Chip#0 2D Mesh

Chip#1 Random

Chip#2 2D Mesh

“m r m” Configuration

Q1: How many random chips do we need?

Number of random chips vs. Average latency

29 # of random chips (16-node) # of random chips (64-node)

P2P Bus

P2P Bus

1 random chip drastically reduces latency

1 or 2 random chips are enough in the P2P case

Q2: How should we design random chip?

30 Max random link length [tile] # of Horizontal router ports

P2P Bus

P2P Bus

Double-length link is enough (equivalent to folded torus)

Max. link length (left) & Horizontal degree (right)

4 ports are enough (equivalent to 2D mesh/torus)

Packet latency: P2P (mmmm mrrm rrrr)

31

m m m m Mesh Mesh Mesh Mesh P2P m r r m Mesh Rand Rand Mesh P2P r r r r Rand Rand Rand Rand P2P - - - m None None None Mesh Bus - - - r None None None Rand Bus m - - r Mesh None None Rand Bus

Aver

age

pack

et la

tenc

y [c

ycle

s]

m m m m m r r m r r r r

rrrr reduces latency by 20.7% compared to mmmm

Packet latency: Bus (---m ---r m--r)

32

m m m m Mesh Mesh Mesh Mesh P2P m r r m Mesh Rand Rand Mesh P2P r r r r Rand Rand Rand Rand P2P - - - m None None None Mesh Bus - - - r None None None Rand Bus m - - r Mesh None None Rand Bus

Aver

age

pack

et la

tenc

y [c

ycle

s]

- - - m - - - r m - - r

---r (adding a random NoC) reduces latency by 26.2% compared to ---m (adding a mesh NoC)

Outline: Wireless 3D NoC architectures

33 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

• Wireless 3D NoCs (3D WiNoCs) [8min]

– Wireless 3D IC technology – 3D WiNoC design example (65nm)

• Routing & topology exploration [8min]

– Spanning tree optimization for irregular WiNoCs – Power management via vertical ON/OFF links

• Adding random NoC chip for 3D WiNoCs [5min]

– Adding randomness induces small-world effects – Adding random NoC chip to NoC-less 3D ICs

• Summary [1min]

[Matsutani,ASPDAC’13]

[Matsutani,DATE’14]

[Miura,IEEE Micro 13] [Matsutani,NOCS’11]

Summary: 3D WiNoC Architectures

34 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

• Inductive-coupling 3D SiP – A low cost alternative to build low-volume custom

systems by stacking off-the-shelf known-good-dies – No special process technology is required;

inductors are implemented with metal layers • Cube-1: A practical 3D WiNoC system

– Two types: Host CPU chip & Accelerator chips – We can customize number & types of chips in SiP

Summary: 3D WiNoC Architectures

35 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

• Routing & topology exploration – Spanning tree optimization for adhoc 3D WiNoCs – Power reduction by randomly stopping vertical links – Performance/power is improved by spanning tree

optimization • Adding randomness shortens comm. latency

– Adding random NoC chip to NoC-less 3D ICs – Replacing regular 2D NoC with random 2D NoC

References (1/2) • Cube-0: The first real 3D WiNoC

– H. Matsutani, et.al., "A Vertical Bubble Flow Network using Inductive-Coupling for 3-D CMPs", NOCS 2011.

– Y. Take, et.al., "3D NoC with Inductive-Coupling Links for Building-Block SiPs", IEEE Trans on Computers (2014).

• Cube-1: The heterogeneous 3D WiNoC – N. Miura, et.al., "A Scalable 3D Heterogeneous Multicore

with an Inductive ThruChip Interface", IEEE Micro (2013).

• MuCCRA-Cube: Dynamically reconfigurable processor – S. Saito, et.al., "MuCCRA-Cube: a 3D Dynamically

Reconfigurable Processor with Inductive-Coupling Link", FPL 2009.

36 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

References (2/2) • Vertical bubble flow control on Cube-0

– H. Matsutani, et.al., "A Vertical Bubble Flow Network using Inductive-Coupling for 3-D CMPs", NOCS 2011.

– Y. Take, et.al., "3D NoC with Inductive-Coupling Links for Building-Block SiPs", IEEE Trans on Computers (2014).

• Spanning trees optimization for 3D WiNoCs – H. Matsutani, et.al., "A Case for Wireless 3D NoCs for

CMPs", ASP-DAC 2013 (Best Paper Award).

• Small-world effects (randomness) – H. Matsutani, et.al., "Low-Latency Wireless 3D NoCs via

Randomized Shortcut Chips", DATE 2014. – M. Koibuchi, et.al., "A Case for Random Shortcut

Topologies for HPC Interconnects", ISCA 2012. 37 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

Backup slides

38 Hiroki Matsutani, "3D WiNoC Architectures", Special Session at NOCS'14

top related