the kill rule for multicore

The Kill Rule for Multicore

Anant Agarwal

MIT and Tilera Corp.

3

Corollary of Moore’s LawNumber of cores will double every 18 months

What must change to enable this growth?

Multicore is Moving Fast

1

10

100

1000

10000

Co

res

Cores

4

Multicore Drivers Suggest Three Directions

• Diminishing returns– Smaller structures

• Power efficiency– Smaller structures– Slower clocks, voltage scaling

• Wire delay– Distributed structures

• Multicore programming

1. How we size core resources

2. How we connect the cores

3. How programming will evolve

5

How We Size Core Resources

Processor

Cache

Processor

Cache

Processor

Cache

3 coresSmall Cache

4 coresSmall Cache

Processor

Cache

Processor

Cache

Processor

Cache

Processor

Cache

3 coresBig Cache

Processor

Cache

Processor

Cache

Processor

Cache

OR

Core:IPC=1Area=1

Core:IPC=1Area=1

Core:IPC=1.2Area=1.3

Chip:IPC=3

Chip:IPC=3.6

Chip:IPC=4

1IPC = 1 + m x latency

Instructions per cycle

6

Kill If Less than Linear

“KILL Rule” for Multicore

A resource in a core must be increased in area only if the core’s performance improvement is at least

proportional to the core’s area increase

Put another way, increase resource size only if for every 1% increase in core area there is at least a

1% increase in core performance

Leads to power-efficient multicore design

7

CoreArea=1IPC=0.04 512B

Multicore

Cache

100 cores

CoreArea=1.03IPC=0.17

2KB

Multicore

97 cores


4KB

Multicore

93 cores


8KB

Multicore

87 cores


16KB

Multicore

76 cores


32KB

Multicore

61 cores

Chip IPC=4 Chip IPC=17 Chip IPC=23

Chip IPC=25 Chip IPC=24 Chip IPC=19

2% increase

325% increase

4% increase

47% increase

14% increase

7% increase

24% increase

3% increase

16% increase

7% increase

Kill Rule for Cache Size Using Video Codec

8

Well Beyond Diminishing Returns

Cache System

Madison Itanium2

L3 CacheL3 Cache

Photo courtesy Intel Corp.

9

Miss ratecan be

4x more

Miss penalty in cycles

4x smaller

Insight:

1IPC = 1 + m x latency (cycles)

Maintain constant instructions per cycle (IPC)

1IPC = = 2 1 + 0.5% x 200

4GHz

1IPC = = 2 1 + 2.0% x 50

1GHz

Slower Clocks Suggest Even Smaller Caches

Implies that cache can be 16x smaller!

CacheSizem

10









KILL rule suggests smaller caches for multicoreIf the clock is slower by x, for constant IPC, the cache can

be smaller by x2

KILL rule applies to all multicore resourcesIssue width: 2-way is probably ideal [Simplefit, TPDS 7/2001]

Cache sizes and number of memory hierarchy levels

11

Interconnect Options

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

Mesh Multicore

Packet routing through switches

BUS

p

c

p

c

p

c

Bus Multicore

s s s s

p

c

p

c

p

c

p

c

Ring Multicore

12

Bisection Bandwidth is Important

BUS

p

c

p

c

p

c

Bus Multicore

s s s s

p

c

p

c

p

c

p

c

Ring Multicorep

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

Mesh Multicore

13

s s s s

p

c

p

c

p

c

p

c

Ring Multicore

Concept of Bisection Bandwidth

BUS

p

c

p

c

p

c

Bus Multicore

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

Mesh Multicore

Bandwidth increases as we add more cores

14

Meshes are Power Efficient

%Energy Savings(Mesh vs. Bus)

Benchmarks Number ofProcessors

15

Meshes Offer Simple Layout

• 16 cores

• Demonstrated in 2002

• 0.18 micron

• 425 MHz

• IBM SA27E standard cell

• 6.8 GOPS

Example:MIT’s Raw Multicore

www.cag.csail.mit.edu/raw

16

Multicore

• Single chip

• Multiple processing units

• Multiple, independent threads of control, or program counters – MIMD

BUS

p p

c c

L2 Cache

Tiled

pswitch

cp

switch

cp

switch

cp

switch

c

pswitch

cp

switch

cp

switch

cp

switch

c

pswitch

cp

switch

cp

switch

cp

switch

c

pswitch

cp

switch

cp

switch

cp

switch

c

Tiled Multicore satisfies one additional property

Fully Distributed, No Centralized Resources

17









Mesh based tiled multicore

18

Multicore Programming Challenge

• “Multicore programming is hard”. Why?– New – Misunderstood- some sequential programs are harder– Current tools are where VLSI design tools where in the

mid 80’s– Standards are needed (tools, ecosystems)

• This problem will be solved soon. Why?– Multicore is here to stay– Intel webinar: “Think parallel or perish”– Opportunity to create the API foundations – The incentives are there

19

Old Approaches Fall Short

• Pthreads– Intel webinar likens it to the assembly of parallel programming– Data races are hard to analyze– No encapsulation or modularity – But evolutionary, and OK in the interim

• DMA with external shared memory– DSP programmers favor DMA– Explicit copying from global shared memory to local store– Wastes pin bandwidth and energy– But, evolutionary, simple, modular and small core memory footprint

• MPI– Province of HPC users– Based on sending explicit messages between private memories– High overheads and large core memory footprint

But, there is a big new idea staring us in the face

20

memmem

mem

mem

mem

Stream of data over a hardware FIFO

• Streaming is energy efficient and fast

• Concept familiar and well developed in hardware design and simulation languages

Inspiration from ASICs: Streaming

21

Streaming is Familiar – Like Sockets

• Basis of networking and internet software

• Familiar & popular

• Modular & scalable

• Conceptually simple

• Each process can use existing sequential code

SenderProcess

ReceiverProcess

Interconnect

Port1 Port2

Channel

22

Core-to-Core Data Transfer Cheaper than Memory Access

• Energy– 32b network transfer over 1mm channel 3pJ – 32KB cache read 50pJ– External access 200pJ

• Latency– Reg to reg 5 cycles (RAW)– Cache to cache 50 cycle– DRAM access 200 cycle

Data based on 90nm process node

23

Streaming Supports Many Models

Pipeline

Not great for Blackboard style

Shared state

But then, there is no one size fits all

Client-server

Broadcast-reduce

24

Multicore Streaming Can be Way Faster than Sockets

• No fundamental overheads for– Unreliable communication– High latency buffering– Hardware heterogeneity– OS heterogeneity

• Infrequent setup

• Common-case operations are fast and power efficient– Low memory footprint

Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)

connect(<send_proc, Port1>, <receive_proc, Port2>)

SenderProcess

ReceiverProcess

Interconnect

Port1 Port2

Channel

MCA’s CAPI standard

25

CAPI’s Stream Implementation 1

Process A(E.g., FIR1)

Process B(E.g., FIR2)

Multicore Chip

Core 1

Core 2

I/O register-mapped hardware FIFOs in SOCs

26

CAPI’s Stream Implementation 2

Process A(E.g., FIR)

Process B(E.g., FIR)

Multicore Chip

Core 1

Core 2

On-chip cache to cache transfers over on-chip interconnect in general multicores

Cache

Cache

On-chip Interconnect

27

Conclusions

• Multicore is here to stay

• Evolve core and interconnect

• Create multicore programming standards – users are ready

• Multicore success requires– Reduction in core cache size– Adoption of mesh based on-chip interconnect – Use of a stream based programming API

• Successful solutions will offer evolutionary transition path

the kill rule for multicore

Documents

multicore drivers

core area

core resources2

core performanceleads

resource size

multicore resourcesissue

evolvekill rule

x2kill rule