gate-level simulation with gp-gpus macro-gate t=t ......level 0 sync barrier m11 m12 m13 m00 m01 m02...

1

Design flow: role of simulation

Synthesis

if ( sel ==1’b1 )

f <= a;

else

f <= b;

HDL design

Simulation, Formal verification

Mostly simulation

Netlist

=

Equivalence checking

Verification Functional validation

Netlist simulation is extremely slow

2

Event-driven simulation

primary inputs / register outputs

primary outputs / register inputs

Levelized netlist

Inputs changing value create new events

Event propagates if the output of a gate changes

t=0

t=1

t=2

t=3

t=4

t=5

t=6

3

...

Shared Memory

Thread block

T T T T T

T T

T T

T T T T T

T T T T T

T T

T T

T T T T T

Event-driven: GP-GPU perspective

Scenario: individual threads simulate individual gates

Small fraction of gates are simultaneously active

Frequent synchronization across all thread-blocks needed

Better alternative: each thread-block simulates an independent chunk of logic (macro-gate)

Internal net values are stored in fast shared memory for

performance

Devic

eM

em

ory

...

Shared Memory

Thread block

T T T T T

T T

T T

T T T T T

T T T T T

T T

T T

T T T T Tmacro-gate

4

How do we carve macro-gates?

level 1

level 0

level 2

level 3

level 4

level 5

layer 2

layer 1

layer 0

gap = 2

gap = 2

gap = 2

}lid = 2

primary outputs / register inputs

primary inputs / register outputs

1. Levelizeas-late-as-possible (ALAP)

2. Partition in layers

3. Cluster in lids

4. Aggregate ‘lid’ cones of logic in one macro-gate

5

Thread block 1

T T T T T T T T T

T T

T T

T T T T T T T T T

Parallelism: GP-GPU perspective

Thread-block level parallelism

Thread level parallelism

Thread block 0

T T T T T T T T T

T T

T T

T T T T T T T T T

Thread block

T T T T T T T T T

T T

T T

T T T T T T T T T

Independent logic blocks: macro-gates

Independent logic-gates in same level

6

Event-driven simulation: mechanism

Event-driven simulation proceeds at macro-gate granularity

After each epoch of simulation:

Check monitored nets

Schedule macro-gates of next layer

monitored nets

EVENT-DRIVEN SIMULATION

monitored nets with a value change event

t=0

t=T

t=2T

t=3T

7

Simulation within a thread-block

Execution threads

M13

M12

M01 M03 M05

n0 n1 n2

n3

M02 M04M00

M11

...

...

...

...

Level 0

syncbarrier

M11

M12

M13

M01 M03M02M00

n0 n1 n2

n3

syncbarrier

syncbarrier

syncbarrier

Level 1

Level 2

Level 3

Monitored nets at layer boundary

Monitored nets at next layer boundary

OBLIVIOUS SIMULATION

8

Macro-gate balancing

We want ALL threads working at ALL times

Rebalance gate schedule within a macro-gate to increase utilization

Trade-off latency for utilization

Balance

Execution threads

Mo

re la

ten

cy

Under-utilized threads

Le

ss la

ten

cy

Fewer idle threads

Execution threads

9

2 200

Performance speedup (times)

Performance vs. commercial simulator

4 6 8 10 12 14 16 18 44

SPARC x1 (1M)

NoC 4x4 (10M)

NoC 3x3 (2M)

JPEG (3M)

LDPC (10M)

LDPC (1M)

Alpha-pipe (13M)

Alpha-nopipe (12M)

SPARC x2 (1M)

SPARC x4 (1M)

22 24

Up to 44x

Comparison against event-driven simulator

(compile target: 4 processes)

Oblivious

Event-driven

Gate-level Simulation with GP-GPUsDebapriya Chatterjee, Andrew DeOrio and Valeria Bertacco

The University of Michigan, Ann Arbor

gate-level simulation with gp-gpus macro-gate t=t ......level 0 sync barrier m11 m12 m13 m00 m01 m02...

Documents