the kill rule for multicore

27

Upload: elvin

Post on 06-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

The Kill Rule for Multicore. Anant Agarwal MIT and Tilera Corp. Multicore is Moving Fast. Corollary of Moore’s Law Number of cores will double every 18 months. What must change to enable this growth?. Multicore Drivers Suggest Three Directions. Diminishing returns Smaller structures - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Kill Rule for Multicore
Page 2: The Kill Rule for Multicore

The Kill Rule for Multicore

Anant Agarwal

MIT and Tilera Corp.

Page 3: The Kill Rule for Multicore

3

Corollary of Moore’s LawNumber of cores will double every 18 months

What must change to enable this growth?

Multicore is Moving Fast

1

10

100

1000

10000

Co

res

Cores

Page 4: The Kill Rule for Multicore

4

Multicore Drivers Suggest Three Directions

• Diminishing returns– Smaller structures

• Power efficiency– Smaller structures– Slower clocks, voltage scaling

• Wire delay– Distributed structures

• Multicore programming

1. How we size core resources

2. How we connect the cores

3. How programming will evolve

Page 5: The Kill Rule for Multicore

5

How We Size Core Resources

Processor

Cache

Processor

Cache

Processor

Cache

3 coresSmall Cache

4 coresSmall Cache

Processor

Cache

Processor

Cache

Processor

Cache

Processor

Cache

3 coresBig Cache

Processor

Cache

Processor

Cache

Processor

Cache

OR

Core:IPC=1Area=1

Core:IPC=1Area=1

Core:IPC=1.2Area=1.3

Chip:IPC=3

Chip:IPC=3.6

Chip:IPC=4

1IPC = 1 + m x latency

Instructions per cycle

Page 6: The Kill Rule for Multicore

6

Kill If Less than Linear

“KILL Rule” for Multicore

A resource in a core must be increased in area only if the core’s performance improvement is at least

proportional to the core’s area increase

Put another way, increase resource size only if for every 1% increase in core area there is at least a

1% increase in core performance

Leads to power-efficient multicore design

Page 7: The Kill Rule for Multicore

7

CoreArea=1IPC=0.04 512B

Multicore

Cache

100 cores

CoreArea=1.03IPC=0.17

2KB

Multicore

97 cores

CoreArea=1.07IPC=0.25

4KB

Multicore

93 cores

CoreArea=1.15IPC=0.29

8KB

Multicore

87 cores

CoreArea=1.31IPC=0.31

16KB

Multicore

76 cores

CoreArea=1.63IPC=0.32

32KB

Multicore

61 cores

Chip IPC=4 Chip IPC=17 Chip IPC=23

Chip IPC=25 Chip IPC=24 Chip IPC=19

2% increase

325% increase

4% increase

47% increase

14% increase

7% increase

24% increase

3% increase

16% increase

7% increase

Kill Rule for Cache Size Using Video Codec

Page 8: The Kill Rule for Multicore

8

Well Beyond Diminishing Returns

Cache System

Madison Itanium2

L3 CacheL3 Cache

Photo courtesy Intel Corp.

Page 9: The Kill Rule for Multicore

9

Miss ratecan be

4x more

Miss penalty in cycles

4x smaller

Insight:

1IPC = 1 + m x latency (cycles)

Maintain constant instructions per cycle (IPC)

1IPC = = 2 1 + 0.5% x 200

4GHz

1IPC = = 2 1 + 2.0% x 50

1GHz

Slower Clocks Suggest Even Smaller Caches

Implies that cache can be 16x smaller!

CacheSizem

Page 10: The Kill Rule for Multicore

10

Multicore Drivers Suggest Three Directions

• Diminishing returns– Smaller structures

• Power efficiency– Smaller structures– Slower clocks, voltage scaling

• Wire delay– Distributed structures

• Multicore programming

1. How we size core resources

2. How we connect the cores

3. How programming will evolve

KILL rule suggests smaller caches for multicoreIf the clock is slower by x, for constant IPC, the cache can

be smaller by x2

KILL rule applies to all multicore resourcesIssue width: 2-way is probably ideal [Simplefit, TPDS 7/2001]

Cache sizes and number of memory hierarchy levels

Page 11: The Kill Rule for Multicore

11

Interconnect Options

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

Mesh Multicore

Packet routing through switches

BUS

p

c

p

c

p

c

Bus Multicore

s s s s

p

c

p

c

p

c

p

c

Ring Multicore

Page 12: The Kill Rule for Multicore

12

Bisection Bandwidth is Important

BUS

p

c

p

c

p

c

Bus Multicore

s s s s

p

c

p

c

p

c

p

c

Ring Multicorep

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

Mesh Multicore

Page 13: The Kill Rule for Multicore

13

s s s s

p

c

p

c

p

c

p

c

Ring Multicore

Concept of Bisection Bandwidth

BUS

p

c

p

c

p

c

Bus Multicore

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

Mesh Multicore

Bandwidth increases as we add more cores

Page 14: The Kill Rule for Multicore

14

Meshes are Power Efficient

%Energy Savings(Mesh vs. Bus)

Benchmarks Number ofProcessors

Page 15: The Kill Rule for Multicore

15

Meshes Offer Simple Layout

• 16 cores

• Demonstrated in 2002

• 0.18 micron

• 425 MHz

• IBM SA27E standard cell

• 6.8 GOPS

Example:MIT’s Raw Multicore

www.cag.csail.mit.edu/raw

Page 16: The Kill Rule for Multicore

16

Multicore

• Single chip

• Multiple processing units

• Multiple, independent threads of control, or program counters – MIMD

BUS

p p

c c

L2 Cache

Tiled

pswitch

cp

switch

cp

switch

cp

switch

c

pswitch

cp

switch

cp

switch

cp

switch

c

pswitch

cp

switch

cp

switch

cp

switch

c

pswitch

cp

switch

cp

switch

cp

switch

c

Tiled Multicore satisfies one additional property

Fully Distributed, No Centralized Resources

Page 17: The Kill Rule for Multicore

17

Multicore Drivers Suggest Three Directions

• Diminishing returns– Smaller structures

• Power efficiency– Smaller structures– Slower clocks, voltage scaling

• Wire delay– Distributed structures

• Multicore programming

1. How we size core resources

2. How we connect the cores

3. How programming will evolve

Mesh based tiled multicore

Page 18: The Kill Rule for Multicore

18

Multicore Programming Challenge

• “Multicore programming is hard”. Why?– New – Misunderstood- some sequential programs are harder– Current tools are where VLSI design tools where in the

mid 80’s– Standards are needed (tools, ecosystems)

• This problem will be solved soon. Why?– Multicore is here to stay– Intel webinar: “Think parallel or perish”– Opportunity to create the API foundations – The incentives are there

Page 19: The Kill Rule for Multicore

19

Old Approaches Fall Short

• Pthreads– Intel webinar likens it to the assembly of parallel programming– Data races are hard to analyze– No encapsulation or modularity – But evolutionary, and OK in the interim

• DMA with external shared memory– DSP programmers favor DMA– Explicit copying from global shared memory to local store– Wastes pin bandwidth and energy– But, evolutionary, simple, modular and small core memory footprint

• MPI– Province of HPC users– Based on sending explicit messages between private memories– High overheads and large core memory footprint

But, there is a big new idea staring us in the face

Page 20: The Kill Rule for Multicore

20

memmem

mem

mem

mem

Stream of data over a hardware FIFO

• Streaming is energy efficient and fast

• Concept familiar and well developed in hardware design and simulation languages

Inspiration from ASICs: Streaming

Page 21: The Kill Rule for Multicore

21

Streaming is Familiar – Like Sockets

• Basis of networking and internet software

• Familiar & popular

• Modular & scalable

• Conceptually simple

• Each process can use existing sequential code

SenderProcess

ReceiverProcess

Interconnect

Port1 Port2

Channel

Page 22: The Kill Rule for Multicore

22

Core-to-Core Data Transfer Cheaper than Memory Access

• Energy– 32b network transfer over 1mm channel 3pJ – 32KB cache read 50pJ– External access 200pJ

• Latency– Reg to reg 5 cycles (RAW)– Cache to cache 50 cycle– DRAM access 200 cycle

Data based on 90nm process node

Page 23: The Kill Rule for Multicore

23

Streaming Supports Many Models

Pipeline

Not great for Blackboard style

Shared state

But then, there is no one size fits all

Client-server

Broadcast-reduce

Page 24: The Kill Rule for Multicore

24

Multicore Streaming Can be Way Faster than Sockets

• No fundamental overheads for– Unreliable communication– High latency buffering– Hardware heterogeneity– OS heterogeneity

• Infrequent setup

• Common-case operations are fast and power efficient– Low memory footprint

Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)

connect(<send_proc, Port1>, <receive_proc, Port2>)

SenderProcess

ReceiverProcess

Interconnect

Port1 Port2

Channel

MCA’s CAPI standard

Page 25: The Kill Rule for Multicore

25

CAPI’s Stream Implementation 1

Process A(E.g., FIR1)

Process B(E.g., FIR2)

Multicore Chip

Core 1

Core 2

I/O register-mapped hardware FIFOs in SOCs

Page 26: The Kill Rule for Multicore

26

CAPI’s Stream Implementation 2

Process A(E.g., FIR)

Process B(E.g., FIR)

Multicore Chip

Core 1

Core 2

On-chip cache to cache transfers over on-chip interconnect in general multicores

Cache

Cache

On-chip Interconnect

Page 27: The Kill Rule for Multicore

27

Conclusions

• Multicore is here to stay

• Evolve core and interconnect

• Create multicore programming standards – users are ready

• Multicore success requires– Reduction in core cache size– Adoption of mesh based on-chip interconnect – Use of a stream based programming API

• Successful solutions will offer evolutionary transition path