gate-level simulation with gp-gpus macro-gate t=t ......level 0 sync barrier m11 m12 m13 m00 m01 m02...

1
1 Design flow: role of simulation Synthesis if ( sel ==1’b1 ) f <= a; else f <= b; HDL design Simulation, Formal verification Mostly simulation Netlist = Equivalence checking Verification Functional validation Netlist simulation is extremely slow 2 Event-driven simulation primary inputs / register outputs primary outputs / register inputs Levelized netlist Inputs changing value create new events Event propagates if the output of a gate changes t=0 t=1 t=2 t=3 t=4 t=5 t=6 3 Thread block T T T T T T T T T T T T T T T T T T T T T T T Event-driven: GP-GPU perspective Scenario: individual threads simulate individual gates Small fraction of gates are simultaneously active Frequent synchronization across all thread-blocks needed Better alternative: each thread-block simulates an independent chunk of logic (macro-gate) Internal net values are stored in fast shared memory for performance Device Memory Shared Memory Thread block T T T T T T T T T T T T T T T T T T T T T T T T T T T T macro-gate 4 How do we carve macro-gates? level 1 level 0 level 2 level 3 level 4 level 5 layer 2 layer 1 layer 0 gap = 2 gap = 2 gap = 2 } lid = 2 primary outputs / register inputs primary inputs / register outputs 1. Levelize as-late-as-possible (ALAP) 2. Partition in layers 3. Cluster in lids 4. Aggregate ‘lidcones of logic in one macro-gate 5 Thread block 1 T T T T T T T T T T T T T T T T T T T T T T Parallelism: GP-GPU perspective Thread-block level parallelism Thread level parallelism Thread block 0 T T T T T T T T T T T T T T T T T T T T T T Thread block T T T T T T T T T T T T T T T T T T T T T T Independent logic blocks: macro-gates Independent logic-gates in same level 6 Event-driven simulation: mechanism Event-driven simulation proceeds at macro-gate granularity After each epoch of simulation: Check monitored nets Schedule macro-gates of next layer monitored nets EVENT-DRIVEN SIMULATION monitored nets with a value change event t=0 t=T t=2T t=3T 7 Simulation within a thread-block Execution threads M13 M12 M01 M03 M05 n0 n1 n2 n3 M02 M04 M00 M11 ... ... ... ... Level 0 sync barrier M11 M12 M13 M01 M03 M02 M00 n0 n1 n2 n3 sync barrier sync barrier sync barrier Level 1 Level 2 Level 3 Monitored nets at layer boundary Monitored nets at next layer boundary OBLIVIOUS SIMULATION 8 Macro-gate balancing We want ALL threads working at ALL times Rebalance gate schedule within a macro-gate to increase utilization Trade-off latency for utilization Balance Execution threads More latency Under-utilized threads Less latency Fewer idle threads Execution threads 9 2 20 0 Performance speedup (times) Performance vs. commercial simulator 4 6 8 10 12 14 16 18 44 SPARC x1 (1M) NoC 4x4 (10M) NoC 3x3 (2M) JPEG (3M) LDPC (10M) LDPC (1M) Alpha-pipe (13M) Alpha-nopipe (12M) SPARC x2 (1M) SPARC x4 (1M) 22 24 Up to 44x Comparison against event-driven simulator (compile target: 4 processes) Oblivious Event-driven Gate-level Simulation with GP-GPUs Debapriya Chatterjee, Andrew DeOrio and Valeria Bertacco The University of Michigan, Ann Arbor

Upload: others

Post on 13-Mar-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gate-level Simulation with GP-GPUs macro-gate t=T ......Level 0 sync barrier M11 M12 M13 M00 M01 M02 M03 n0 n1 n2 n3 sync barrier sync barrier sync barrier Level 1 Level 2 Level 3

1

Design flow: role of simulation

Synthesis

if ( sel ==1’b1 )

f <= a;

else

f <= b;

HDL design

Simulation, Formal verification

Mostly simulation

Netlist

=

Equivalence checking

Verification Functional validation

Netlist simulation is extremely slow

2

Event-driven simulation

primary inputs / register outputs

primary outputs / register inputs

Levelized netlist

Inputs changing value create new events

Event propagates if the output of a gate changes

t=0

t=1

t=2

t=3

t=4

t=5

t=6

3

...

Shared Memory

Thread block

T T T T T

T T

T T

T T T T T

T T T T T

T T

T T

T T T T T

Event-driven: GP-GPU perspective

Scenario: individual threads simulate individual gates

Small fraction of gates are simultaneously active

Frequent synchronization across all thread-blocks needed

Better alternative: each thread-block simulates an independent chunk of logic (macro-gate)

Internal net values are stored in fast shared memory for

performance

Devic

eM

em

ory

...

Shared Memory

Thread block

T T T T T

T T

T T

T T T T T

T T T T T

T T

T T

T T T T Tmacro-gate

4

How do we carve macro-gates?

level 1

level 0

level 2

level 3

level 4

level 5

layer 2

layer 1

layer 0

gap = 2

gap = 2

gap = 2

}lid = 2

primary outputs / register inputs

primary inputs / register outputs

1. Levelizeas-late-as-possible (ALAP)

2. Partition in layers

3. Cluster in lids

4. Aggregate ‘lid’ cones of logic in one macro-gate

5

Thread block 1

T T T T T T T T T

T T

T T

T T T T T T T T T

Parallelism: GP-GPU perspective

Thread-block level parallelism

Thread level parallelism

Thread block 0

T T T T T T T T T

T T

T T

T T T T T T T T T

Thread block

T T T T T T T T T

T T

T T

T T T T T T T T T

Independent logic blocks: macro-gates

Independent logic-gates in same level

6

Event-driven simulation: mechanism

Event-driven simulation proceeds at macro-gate granularity

After each epoch of simulation:

Check monitored nets

Schedule macro-gates of next layer

monitored nets

EVENT-DRIVEN SIMULATION

monitored nets with a value change event

t=0

t=T

t=2T

t=3T

7

Simulation within a thread-block

Execution threads

M13

M12

M01 M03 M05

n0 n1 n2

n3

M02 M04M00

M11

...

...

...

...

Level 0

syncbarrier

M11

M12

M13

M01 M03M02M00

n0 n1 n2

n3

syncbarrier

syncbarrier

syncbarrier

Level 1

Level 2

Level 3

Monitored nets at layer boundary

Monitored nets at next layer boundary

OBLIVIOUS SIMULATION

8

Macro-gate balancing

We want ALL threads working at ALL times

Rebalance gate schedule within a macro-gate to increase utilization

Trade-off latency for utilization

Balance

Execution threads

Mo

re la

ten

cy

Under-utilized threads

Le

ss la

ten

cy

Fewer idle threads

Execution threads

9

2 200

Performance speedup (times)

Performance vs. commercial simulator

4 6 8 10 12 14 16 18 44

SPARC x1 (1M)

NoC 4x4 (10M)

NoC 3x3 (2M)

JPEG (3M)

LDPC (10M)

LDPC (1M)

Alpha-pipe (13M)

Alpha-nopipe (12M)

SPARC x2 (1M)

SPARC x4 (1M)

22 24

Up to 44x

Comparison against event-driven simulator

(compile target: 4 processes)

Oblivious

Event-driven

Gate-level Simulation with GP-GPUsDebapriya Chatterjee, Andrew DeOrio and Valeria Bertacco

The University of Michigan, Ann Arbor