gate-level simulation with gp-gpus macro-gate t=t ......level 0 sync barrier m11 m12 m13 m00 m01 m02...
TRANSCRIPT
1
Design flow: role of simulation
Synthesis
if ( sel ==1’b1 )
f <= a;
else
f <= b;
HDL design
Simulation, Formal verification
Mostly simulation
Netlist
=
Equivalence checking
Verification Functional validation
Netlist simulation is extremely slow
2
Event-driven simulation
primary inputs / register outputs
primary outputs / register inputs
Levelized netlist
Inputs changing value create new events
Event propagates if the output of a gate changes
t=0
t=1
t=2
t=3
t=4
t=5
t=6
3
...
Shared Memory
Thread block
T T T T T
T T
T T
T T T T T
T T T T T
T T
T T
T T T T T
Event-driven: GP-GPU perspective
Scenario: individual threads simulate individual gates
Small fraction of gates are simultaneously active
Frequent synchronization across all thread-blocks needed
Better alternative: each thread-block simulates an independent chunk of logic (macro-gate)
Internal net values are stored in fast shared memory for
performance
Devic
eM
em
ory
...
Shared Memory
Thread block
T T T T T
T T
T T
T T T T T
T T T T T
T T
T T
T T T T Tmacro-gate
4
How do we carve macro-gates?
level 1
level 0
level 2
level 3
level 4
level 5
layer 2
layer 1
layer 0
gap = 2
gap = 2
gap = 2
}lid = 2
primary outputs / register inputs
primary inputs / register outputs
1. Levelizeas-late-as-possible (ALAP)
2. Partition in layers
3. Cluster in lids
4. Aggregate ‘lid’ cones of logic in one macro-gate
5
Thread block 1
T T T T T T T T T
T T
T T
T T T T T T T T T
Parallelism: GP-GPU perspective
Thread-block level parallelism
Thread level parallelism
Thread block 0
T T T T T T T T T
T T
T T
T T T T T T T T T
Thread block
T T T T T T T T T
T T
T T
T T T T T T T T T
Independent logic blocks: macro-gates
Independent logic-gates in same level
6
Event-driven simulation: mechanism
Event-driven simulation proceeds at macro-gate granularity
After each epoch of simulation:
Check monitored nets
Schedule macro-gates of next layer
monitored nets
EVENT-DRIVEN SIMULATION
monitored nets with a value change event
t=0
t=T
t=2T
t=3T
7
Simulation within a thread-block
Execution threads
M13
M12
M01 M03 M05
n0 n1 n2
n3
M02 M04M00
M11
...
...
...
...
Level 0
syncbarrier
M11
M12
M13
M01 M03M02M00
n0 n1 n2
n3
syncbarrier
syncbarrier
syncbarrier
Level 1
Level 2
Level 3
Monitored nets at layer boundary
Monitored nets at next layer boundary
OBLIVIOUS SIMULATION
8
Macro-gate balancing
We want ALL threads working at ALL times
Rebalance gate schedule within a macro-gate to increase utilization
Trade-off latency for utilization
Balance
Execution threads
Mo
re la
ten
cy
Under-utilized threads
Le
ss la
ten
cy
Fewer idle threads
Execution threads
9
2 200
Performance speedup (times)
Performance vs. commercial simulator
4 6 8 10 12 14 16 18 44
SPARC x1 (1M)
NoC 4x4 (10M)
NoC 3x3 (2M)
JPEG (3M)
LDPC (10M)
LDPC (1M)
Alpha-pipe (13M)
Alpha-nopipe (12M)
SPARC x2 (1M)
SPARC x4 (1M)
22 24
Up to 44x
Comparison against event-driven simulator
(compile target: 4 processes)
Oblivious
Event-driven
Gate-level Simulation with GP-GPUsDebapriya Chatterjee, Andrew DeOrio and Valeria Bertacco
The University of Michigan, Ann Arbor