parallel vlsi cad algorithms for energy efficient...
TRANSCRIPT
Parallel VLSI CAD Algorithms for Energy Efficient Heterogeneous Computing PlatformsXueqian Zhao, Zhuo Feng
Department of Electrical and Computer Engineering, Michigan Technological University, Houghton, MI
1. INTRODUCTION
d) Concurrent Kernel Executions
2. CAPACITANCE EXTRACTION
Compute and combine all coefficient matrices
Reorder and cluster panel cubes
Initialize GPU memory allocation
Do workload balance and SM assignment
GMRES solver
End
b) FMM Turns the O(n2) into O(n) via Approximations
( , ) ( , )
( ) ( ), ,
i k i k
k i k i dk i i i k
x y x yp q q x y
∈Ω ∉Ω
= Φ + Φ ∈∑ ∑
Directly Solve
Approximately Solve
Multipole expansion:Approximate panel charges
within a small sphere
Local expansion:Approximate panel potentials
within a small sphere
Key Steps:
Q2P
Q2M, M2M
M2L, L2L
M2P, L2P
Direct Pass
Upward Pass
Downward Pass
Evaluation Pass
+-
1V
a
b c
d
panel bb
b aba
V dARε
−> = ∫
bb
b
qA
ε =
...
panel
1ia i i ii
i b i ia
qV dA qA R=
= = Φ∑ ∑∫Summation:
a) Problem Formulation
( , ) ( , )2 2
k i k jjk Q P i M Pp q m= Φ × + + Φ ×
( , ) ( , )2 2
ik i k j
k Q P M P
j
qp
m
= Φ Φ ×
Q2P coefficient matrix: ( , )
2k i n m
Q P×Φ ∈
( , )2
k j n sM P
×Φ ∈
M2P coefficient matrix:
Panel potentials in cube k:
Direct charges of source cube i
Multipole expansions of source cube j
1miq ×∈
1nkp ×∈
1sjm ×∈
2 2 2 2
2 2
( , ) ( , ) ( , ) ( , )
( , ) ( , )
Q P M P Q P M P
Q P Q P
mix mix
k i k j k i k j
k
l k r d l k r d
l
Ip
I Ip
I Ip× ×
Φ
Φ Φ
= ⊗ Φ Φ
1 110TT T T T T T
global y zxVec q q m m l l =
Global vector:
M2P index matrix: Q2P index matrix: ( , )2
k j n sM PI ×∈( , )
2k i n m
Q PI ×∈
c) FMMGpu Data Structure and Algorithm Flow
Linear Speedups
Realistic Speedups
Runtime
Number of SMs0 1 2 3 … n
T0
T0/2T0/3
T0/n
Super linearSpeedup Range
4
T0/4
Cube indices after ordering
Number of Coefficients0
Coef. Mat.
Cluster
8 SMs
3 SMs
2 SMs
2 SMs
Small Tests
1
ii sm p
ii
SN NS
=
= ×
∑
Small Tests
Small Tests
Find optimal SM assignment
Normalization based on #SM on GPU
d) GPU Performance Modeling and Workload Balancing
Runtime of single FMM iteration on CPU & GPU (ms)
18X18X
22X26X
26X30X
020406080
100120140160180
Precond Direct Upward Downward Evaluation
CPUGPU
Runtime of each kernel on CPU & GPU (ms)
45X
45X
3.5X
33X 24X
e) Experimental Results [4]
3. PERFORMANCE MODELING & OPTIMIZATION FOR ELECTRICAL & THERMAL SIMULATIONS
3D Parasitic Extraction:Capacitance extraction for delay, noise and power estimation computes charge
distributions for discretized panels
Full Chip Thermal Simulation:Thermal simulation for 3-D dense mesh structures requires solving
many millions of unknowns
3D Interconnections Power Delivery Network
a) GPU Has Evolved into A Very Flexible and Powerful Processor
c) Latest NVIDIA Fermi GPU Architecture
Streaming Multiprocessor (SM16)Instruction L1
Shared Memory & L1 Cache
SFUSPSP
SPSP
SP SPSP SP
SPSP
SPSP
SP SPSP SP
SPSP
SPSP
SP SPSP SP
SPSP
SPSP
SP SPSP SP
SFU
SFU
SFU
Streaming Multiprocessor (SM2)Instruction L1
Shared Memory & L1 Cache
SFUSPSP
SPSP
SP SPSP SP
SPSP
SPSP
SP SPSP SP
SPSP
SPSP
SP SPSP SP
SPSP
SPSP
SP SPSP SP
SFU
SFU
SFU
Streaming Multiprocessor (SM1)Instruction L1
Shared Memory & L1 Cache
SFUSPSP
SPSP
SP SPSP SP
SPSP
SPSP
SP SPSP SP
SPSP
SPSP
SP SPSP SP
SPSP
SPSP
SP SPSP SP
SFU
SFU
SFU
Streaming Multiprocessor
GPU
Glo
bal M
emor
y
– All processors do the same job on different data sets (SIMD)– More transistors are devoted to data processing– Less transistors are used for data caching and control flow
Serial Kernel Execution
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
SM Occupancy
Running Tim
e
Kernel 6
Concurrent Kernel Execution
Kernel 1 Kernel 2
Kernel 4
Kernel 5
Kernel 3
Kernel 3 (cont.)
SM Occupancy
Kernel 6
Running Tim
e
– GPU concurrent kernel execution can improve theSM occupancy.
– Fermi GPUs allow at most 16 different kernelsexecuted at the same time
Substrate Layer 2
Substrate Layer 1
Substrate Layer 3
3D-ICs w/ Microchannels
Large Power Grid Simulation:Power grids DC and transient simulations involve more than hundreds of millions of nodes
b) Heterogeneous CPU-GPU Computing Achieves Much Better Energy Efficiency
Intel LarrabeeAMD APUNVIDIA Tegra 2
b) Analytical GPU Runtime Performance Model (ISCA’09)MWP(Memory Warp Parallelism): the max. num. of warps per SM that can access the memory simultaneouslyCWP(Computation Warp Parallelism): the num. of warps the SM processor can execute during one memory warp waiting period plus one
8 Computation + 1 Memory
1
3
1
32
4
2
Additional delay1
3
1
32
4
2
Synchronization
1
34
1
32
4
2
8 Computation + 1 Memory
MWP is greater than CWP1
34
1
32
4
2
4 4
c) Adaptive Performance Modeling (PPoPP’10)Work Flow Graph (WFG): an extension of the control flow graph of a GPU kernel
C2 M C1entry exitCn
M
n computation instructionsmemory instructions
memory dependency
Lc Lc
Lm
Lc
Lm
Instruction issue latencymemory loading latency
C7 Mg C2
C1
C3
C1
C2
C3
Mts
SMt
C1 Ms C3 C2 C3Ms MsC5
C4 C1Ms MsSD
C2 MgMsS CPU
GPU
Mts Mts
Mt2 Mt2 Mts
Inner Loop
WFG of Multigrid Smoother
[1] Z. Feng, X. Zhao, and Z. Zeng. Robust Parallel Preconditioned Power Grid Simulation on GPU with Adaptive Runtime Performance Modeling and Optimization. In IEEETCAD, 2011.[2] Z. Feng and Z. Zeng. Parallel multigrid preconditioning on graphics processing units (GPUs) for robust power grid analysis. In Proc. IEEE/ACM DAC, 2010.[3] Z. Feng and P. Li. Fast thermal analysis on GPU for 3D-ICs with integrated microchannel cooling. In Proc. IEEE/ACM ICCAD, 2010.[4] X. Zhao and Z. Feng. Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms. In Proc. IEEE/ACM DAC, 2011.
4. REFERENCE
Precondition pass
Direct pass (Q2P)
Upward pass (Q2M, M2M)
Downward Pass (M2L, L2L)
Evaluation pass (M2P, L2P)
Compute residual
Converged?
Initialize panel charges
GPUCPU
No
Yes
Worst Case
d) Experimental Results
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 10-9
1.765
1.77
1.775
1.78
1.785
1.79
1.795
1.8
Time (seconds)
Volta
ge (V
)
CholmodGPU
2.3 2.305 2.31 2.315 2.32 2.325 2.33 2.335
x 10-9
1.7708
1.7708
1.7709
1.771
1.771
1.771
1.7711
Time (seconds)
Volta
ge (V
)
CholmodGPU
22X Speedups
050
100150
200
0
100
20050
60
70
80
90
X AxleY Axle
Tem
pera
ture
(Cel
sius
)
60
65
70
75
80
Chip-level Electrical [1,2] and Thermal Modeling & Simulation [3]
IC Temperature Prediction Voltage Waveforms PredictionPerformance Modeling & Optimization [1]
Worst Case
25X Speedups
Automatically Identify the Optimal Simulation Configurations
a) Multigrid Preconditioned Krylov Subspace Iterative Solver
…
VDD VDD VDD
VDD VDD VDD
3D Multi-layer Irregular Electrical/Thermal Grid
Host (CPU) Memory
Gx b= ( )( ) ( )dx tGx t C b tdt
+ =DC :TR :
Original Grid
GPU Global Memory
Jacobi Smoother using Sparse Matrix
1,1 1,4 1,5
2,2 2,3 2,6
3,2 3,3 3,8
4,1 4,4
5,1 5,5 5,7
6,2 6,6 6,8
7,5 7,7
8,3 8,6 8,8
a a aa a aa a a
a aa a a
a a aa a
a a a
Geometric Multigrid Solver (GMD)
Matrix + Geometric Representations
+
GPU-Friendly Multi-LevelIterative Algorithm
MGPCG Algorithm on GPU
Set Initial Solution
Get Initial Residual and
Search Direction
Update Solution and Residual
Multigrid Preconditioning
Check Convergence
Update Search Direction
Return Final Solution
Not Converged
Converged
180 GB/s (GPU) vs. 30 GB/s (CPU) 1350 GFLOPS (GPU) vs. 200 GFLOPS (CPU)
Achieve highly energy efficient computing, and CPU-GPU communicationsNeed reinventing CAD algorithms: circuit simulations, interconnect modeling, performance optimizations,…