the kill rule for multicore
DESCRIPTION
The Kill Rule for Multicore. Anant Agarwal MIT and Tilera Corp. Multicore is Moving Fast. Corollary of Moore’s Law Number of cores will double every 18 months. What must change to enable this growth?. Multicore Drivers Suggest Three Directions. Diminishing returns Smaller structures - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/1.jpg)
![Page 2: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/2.jpg)
The Kill Rule for Multicore
Anant Agarwal
MIT and Tilera Corp.
![Page 3: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/3.jpg)
3
Corollary of Moore’s LawNumber of cores will double every 18 months
What must change to enable this growth?
Multicore is Moving Fast
1
10
100
1000
10000
Co
res
Cores
![Page 4: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/4.jpg)
4
Multicore Drivers Suggest Three Directions
• Diminishing returns– Smaller structures
• Power efficiency– Smaller structures– Slower clocks, voltage scaling
• Wire delay– Distributed structures
• Multicore programming
1. How we size core resources
2. How we connect the cores
3. How programming will evolve
![Page 5: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/5.jpg)
5
How We Size Core Resources
Processor
Cache
Processor
Cache
Processor
Cache
3 coresSmall Cache
4 coresSmall Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
3 coresBig Cache
Processor
Cache
Processor
Cache
Processor
Cache
OR
Core:IPC=1Area=1
Core:IPC=1Area=1
Core:IPC=1.2Area=1.3
Chip:IPC=3
Chip:IPC=3.6
Chip:IPC=4
1IPC = 1 + m x latency
Instructions per cycle
![Page 6: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/6.jpg)
6
Kill If Less than Linear
“KILL Rule” for Multicore
A resource in a core must be increased in area only if the core’s performance improvement is at least
proportional to the core’s area increase
Put another way, increase resource size only if for every 1% increase in core area there is at least a
1% increase in core performance
Leads to power-efficient multicore design
![Page 7: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/7.jpg)
7
CoreArea=1IPC=0.04 512B
Multicore
Cache
100 cores
CoreArea=1.03IPC=0.17
2KB
Multicore
97 cores
CoreArea=1.07IPC=0.25
4KB
Multicore
93 cores
CoreArea=1.15IPC=0.29
8KB
Multicore
87 cores
CoreArea=1.31IPC=0.31
16KB
Multicore
76 cores
CoreArea=1.63IPC=0.32
32KB
Multicore
61 cores
Chip IPC=4 Chip IPC=17 Chip IPC=23
Chip IPC=25 Chip IPC=24 Chip IPC=19
2% increase
325% increase
4% increase
47% increase
14% increase
7% increase
24% increase
3% increase
16% increase
7% increase
Kill Rule for Cache Size Using Video Codec
![Page 8: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/8.jpg)
8
Well Beyond Diminishing Returns
Cache System
Madison Itanium2
L3 CacheL3 Cache
Photo courtesy Intel Corp.
![Page 9: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/9.jpg)
9
Miss ratecan be
4x more
Miss penalty in cycles
4x smaller
Insight:
1IPC = 1 + m x latency (cycles)
Maintain constant instructions per cycle (IPC)
1IPC = = 2 1 + 0.5% x 200
4GHz
1IPC = = 2 1 + 2.0% x 50
1GHz
Slower Clocks Suggest Even Smaller Caches
Implies that cache can be 16x smaller!
CacheSizem
![Page 10: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/10.jpg)
10
Multicore Drivers Suggest Three Directions
• Diminishing returns– Smaller structures
• Power efficiency– Smaller structures– Slower clocks, voltage scaling
• Wire delay– Distributed structures
• Multicore programming
1. How we size core resources
2. How we connect the cores
3. How programming will evolve
KILL rule suggests smaller caches for multicoreIf the clock is slower by x, for constant IPC, the cache can
be smaller by x2
KILL rule applies to all multicore resourcesIssue width: 2-way is probably ideal [Simplefit, TPDS 7/2001]
Cache sizes and number of memory hierarchy levels
![Page 11: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/11.jpg)
11
Interconnect Options
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
Mesh Multicore
Packet routing through switches
BUS
p
c
p
c
p
c
Bus Multicore
s s s s
p
c
p
c
p
c
p
c
Ring Multicore
![Page 12: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/12.jpg)
12
Bisection Bandwidth is Important
BUS
p
c
p
c
p
c
Bus Multicore
s s s s
p
c
p
c
p
c
p
c
Ring Multicorep
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
Mesh Multicore
![Page 13: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/13.jpg)
13
s s s s
p
c
p
c
p
c
p
c
Ring Multicore
Concept of Bisection Bandwidth
BUS
p
c
p
c
p
c
Bus Multicore
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
Mesh Multicore
Bandwidth increases as we add more cores
![Page 14: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/14.jpg)
14
Meshes are Power Efficient
%Energy Savings(Mesh vs. Bus)
Benchmarks Number ofProcessors
![Page 15: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/15.jpg)
15
Meshes Offer Simple Layout
• 16 cores
• Demonstrated in 2002
• 0.18 micron
• 425 MHz
• IBM SA27E standard cell
• 6.8 GOPS
Example:MIT’s Raw Multicore
www.cag.csail.mit.edu/raw
![Page 16: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/16.jpg)
16
Multicore
• Single chip
• Multiple processing units
• Multiple, independent threads of control, or program counters – MIMD
BUS
p p
c c
L2 Cache
Tiled
pswitch
cp
switch
cp
switch
cp
switch
c
pswitch
cp
switch
cp
switch
cp
switch
c
pswitch
cp
switch
cp
switch
cp
switch
c
pswitch
cp
switch
cp
switch
cp
switch
c
Tiled Multicore satisfies one additional property
Fully Distributed, No Centralized Resources
![Page 17: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/17.jpg)
17
Multicore Drivers Suggest Three Directions
• Diminishing returns– Smaller structures
• Power efficiency– Smaller structures– Slower clocks, voltage scaling
• Wire delay– Distributed structures
• Multicore programming
1. How we size core resources
2. How we connect the cores
3. How programming will evolve
Mesh based tiled multicore
![Page 18: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/18.jpg)
18
Multicore Programming Challenge
• “Multicore programming is hard”. Why?– New – Misunderstood- some sequential programs are harder– Current tools are where VLSI design tools where in the
mid 80’s– Standards are needed (tools, ecosystems)
• This problem will be solved soon. Why?– Multicore is here to stay– Intel webinar: “Think parallel or perish”– Opportunity to create the API foundations – The incentives are there
![Page 19: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/19.jpg)
19
Old Approaches Fall Short
• Pthreads– Intel webinar likens it to the assembly of parallel programming– Data races are hard to analyze– No encapsulation or modularity – But evolutionary, and OK in the interim
• DMA with external shared memory– DSP programmers favor DMA– Explicit copying from global shared memory to local store– Wastes pin bandwidth and energy– But, evolutionary, simple, modular and small core memory footprint
• MPI– Province of HPC users– Based on sending explicit messages between private memories– High overheads and large core memory footprint
But, there is a big new idea staring us in the face
![Page 20: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/20.jpg)
20
memmem
mem
mem
mem
Stream of data over a hardware FIFO
• Streaming is energy efficient and fast
• Concept familiar and well developed in hardware design and simulation languages
Inspiration from ASICs: Streaming
![Page 21: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/21.jpg)
21
Streaming is Familiar – Like Sockets
• Basis of networking and internet software
• Familiar & popular
• Modular & scalable
• Conceptually simple
• Each process can use existing sequential code
SenderProcess
ReceiverProcess
Interconnect
Port1 Port2
Channel
![Page 22: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/22.jpg)
22
Core-to-Core Data Transfer Cheaper than Memory Access
• Energy– 32b network transfer over 1mm channel 3pJ – 32KB cache read 50pJ– External access 200pJ
• Latency– Reg to reg 5 cycles (RAW)– Cache to cache 50 cycle– DRAM access 200 cycle
Data based on 90nm process node
![Page 23: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/23.jpg)
23
Streaming Supports Many Models
Pipeline
Not great for Blackboard style
Shared state
But then, there is no one size fits all
Client-server
Broadcast-reduce
![Page 24: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/24.jpg)
24
Multicore Streaming Can be Way Faster than Sockets
• No fundamental overheads for– Unreliable communication– High latency buffering– Hardware heterogeneity– OS heterogeneity
• Infrequent setup
• Common-case operations are fast and power efficient– Low memory footprint
Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)
connect(<send_proc, Port1>, <receive_proc, Port2>)
SenderProcess
ReceiverProcess
Interconnect
Port1 Port2
Channel
MCA’s CAPI standard
![Page 25: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/25.jpg)
25
CAPI’s Stream Implementation 1
Process A(E.g., FIR1)
Process B(E.g., FIR2)
Multicore Chip
Core 1
Core 2
I/O register-mapped hardware FIFOs in SOCs
![Page 26: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/26.jpg)
26
CAPI’s Stream Implementation 2
Process A(E.g., FIR)
Process B(E.g., FIR)
Multicore Chip
Core 1
Core 2
On-chip cache to cache transfers over on-chip interconnect in general multicores
Cache
Cache
On-chip Interconnect
![Page 27: The Kill Rule for Multicore](https://reader036.vdocument.in/reader036/viewer/2022062408/56813c49550346895da5c5de/html5/thumbnails/27.jpg)
27
Conclusions
• Multicore is here to stay
• Evolve core and interconnect
• Create multicore programming standards – users are ready
• Multicore success requires– Reduction in core cache size– Adoption of mesh based on-chip interconnect – Use of a stream based programming API
• Successful solutions will offer evolutionary transition path