energy-efficient architectures for exascale...
Post on 02-Aug-2020
0 Views
Preview:
TRANSCRIPT
Dr. Stephen W. Keckler Senior Director of Architecture Research, NVIDIA
ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMS
2
The Goal:
Sustained ExaFLOPS on Problems of Interest
...
at reasonable cost
3 Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
The End of Historic
Scaling
4
2017
CORAL 150-300PF (5-10x)
11MW (1.1x) 14-27 GFLOPs/W (7-14x)
Lots of Threads
20PF 18,000 GPUs
10MW 2 GFLOPs/W ~107 Threads
You Are Here
1,000PF (50x) 72,000HCNs (4x)
20MW (2x) 50 GFLOPs/W (25x)
~1010 Threads (1000x) 2013
2023
5
HETEROGENEOUS NODE System Interconnect
NoC
MC NIC
LOC
0
LOC
7
DRAM Stacks
TOC
1
TOC
2
TOC
3
TOC0
Lane
15
Lane
0
TPC0
L20 1MB
L21 1MB
L22 1MB
L23 1MB
TPC
127
NVL
ink
NVLink NoC
MC DRAM DIMMs LLC
NV RAM
6
How do we get to 50GFlops/Watt?
7
Start with an energy-efficient architecture
8
CPU 130 pJ/flop (Vector SP)
Optimized for Latency Deep Cache Hierarchy
Haswell 22 nm
GPU 30 pJ/flop (SP) Optimized for Throughput
Explicit Management of On-chip Memory
Maxwell 28 nm
9
CPU 2 nJ/flop (Scalar SP)
Optimized for Latency Deep Cache Hierarchy
Haswell 22 nm
GPU 30 pJ/flop (SP) Optimized for Throughput
Explicit Management of On-chip Memory
Maxwell 28 nm
10
HOW IS POWER SPENT IN A CPU?
In-order Embedded OOO Hi-perf
Clock + Control Logic 24%
Data Supply 17%
Instruction Supply 42%
Register File 11%
ALU 6% Clock + Pins
45%
ALU 4%
Fetch 11%
Rename 10%
Issue 11%
RF 14%
Data Supply 5%
Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)
20pJ FP op è1nJ instruction
11
Latency-Optimized Core (LOC)
PC
PC
Branch Predict
I$
Register Rename
ALU 1 ALU 2 ALU 4 ALU 3
Reorder Buffer
Instruction Window
Register File
ALU 1 ALU 2 ALU 4 ALU 3
I$
Select
Register File
PCs
Throughput-Optimized Core (TOC)
12
How do we continue to scale energy efficiency
...in a world where technology scaling is diminished?
13
Do Less Work
Eliminate waste and redundancy
Move fewer bits
Move data more efficiently
14
DO LESS WORK Mixed Precision Arithmetic
1 11 52 double-precision
1 8 23 single-precision
1 5 10 half-precision 4x throughput
4x bandwidth 4x capacity
< ¼ energy/op
5x precision bits 60x range
Only use as much precision as you need Exploit mix of representations Scaled arithmetic
15
ELIMINATE WASTE Temporal SIMT
32-wide datapath
time
1 cy
c
1 warp instruction = 32 threads
thread 0
thread 31
ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
Spatial SIMT (current GPUs) 1-wide
time
1 cy
c
ld ld ld ld ld ld ld
0 (threads)
1 2 3 4 5 6 7 ld
ld ld
8 9
ld 10
Pure Temporal SIMT
16
ELIMINATE WASTE Temporal SIMT
32-wide (41%)
4-wide (65%)
1-wide (100%)
Increase efficiency on divergent code
17
ELIMINATE WASTE Variable Warp Sizing
Small warps
+ Improved perf for divergent code
+ Better SIMD utilization
Emulate wide warp HW
+ Wider converged execution
+ Memory locality/convergence
+ Reduced power (frontend)
scheduler
Time Gang schedule warps with same PC
Schedule several warps, different PCs scheduler
Rogers [ISCA 2015]
18
ELIMINATE REDUNDANCY Scalarization
Lee [CGO 2013]
LD R2ç<A> LD R3çR2, 1 ADD R4çR3, 2
LD R2ç<A> LD R3çR2, 2 ADD R4çR3, 2
LD R2ç<A> LD R3çR2,3 ADD R4çR3, 2
LD R2ç<A> LD R3çR2, 4 ADD R4çR3, 2
scalar op vector load vector op
SIMT Execution
ADD R4çSR3, 2 ADD R4çSR3, 2 ADD R4çSR3, 2 ADD R4çSR3, 2
LD SR2ç<A> VLD SR3çSR2, 1
Scalarized SIMT Execution
19
MOVE FEWER BITS
Small multi-ported register file
Capture locality of commonly used operands
Can reduce RF energy by 50%
Register File Cache (RFC)
S F U
M E M
T E X
Operand Routing
Operand Buffering
Main Register File 4x128-bit Banks (1R1W)
RFC 4x32-bit (3R1W) Banks
ALU
Gebhart [ISCA 2011]
20
MOVE DATA MORE EFFICIENTLY Toggle-aware Compression
0x00003A00 0x8001D000 0x00003A01 0x8001D008 ...
4�bytes128Ͳbyte�Uncompressed�Cache�Line
4�bytes
8Ͳbyte�flit
0x00003A00 0x8001D000
0x00003A01 0x8001D008
XOR
Flit�0
Flit�1
=0000...00100...00100... #�Toggles�=�2
0x5�0x3A00�0x7�0x8001D000
128Ͳbyte�FPCͲcompressed�Cache�Line
8Ͳbyte�flit
5�3A00�7�80001D000�5�1D
XOR
Flit�0
Flit�1
=
001001111...110100011000 #�Toggles�=�31
0x5�0x3A01�0x7�0x8001D008 0x5�...
1�01�7�80001D008�5�3A02�1
Metadata
Compression can increase power consumption
Goal: reduce bus toggling
Pekhimenko [HPCA 2016]
21
MINIMIZE DATA MOVEMENT
Reduces distance
Increases bandwidth
Offers opportunity to optimize signaling circuits
Packaging
High-bandwidth on-package memory
22
MINIMIZE DATA MOVEMENT Heterogeneous DRAM Architectures
~200 GB/s
High BW Memory TOCs LOCs
~2TB/s ~200 GB/s
High Capacity Memory
~200 GB/s
High BW Memory TOCs
~2TB/s
High Capacity Memory
Challenges Exploiting all available bandwidth Maximizing locality for frequently accessed data
23
MINIMIZE DATA MOVEMENT Software-managed Caching with On-Package Memory
0"
1"
2"
3"
4"
5"
6"
backp
rop(
bfs(
cns(
comd(
kmeans(
minife(
mumm
er(
needle(
pathfi
nder(
srad_v1(
xsbench(
geo:m
ean(
Throughp
ut(Rela=
ve(
to(No(Migra=o
n(
Legacy"CUDA" First8Touch"+"Range"Exp"+"BW"Balancing" ORACLE"
Strategies Aggressively migrate pages upon First-Touch to GDDR memory Pre-fetch neighbors of touched pages to reduce TLB shootdowns Throttle page migrations when nearing peak BW
Competitive with manual memory copy Close to “perfect” prefetch
Agarwal [HPCA 2015]
24
MINIMIZE DATA MOVEMENT
Tag overhead: hundreds of MB
Alloy tag and data in same DRAM row (Micro12)
Cache organization: optimize for bandwidth
Direct mapped, consecutive sets in same row
Results
Fine-grained transfers good for lower locality apps
Can eliminate some page migration overheads
Hardware Managed DRAM Cache RO
W BU
FFER
Tag Set Index Offset
16-byte DRAM bus
29-bits 5-bits RO
W BU
FFER
1GB D
RAM
(512M lines)
tag*
(2-byte) data
(64-byte)
Only medium of accessing DRAM array
25
LOOMING MEMORY POWER CRISIS
Column Power
120W
160W
26
SUMMARY
Do Less Work
Eliminate waste and redundancy
Move fewer bits
Move data more efficiently
top related