compiling a parallel dsl to gpu - gpu technology conference … · 2013. 8. 23. · © synopsys...

14
Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys Inc. GTC 2012

Upload: others

Post on 20-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 1

Compiling a Parallel DSL to GPU

Ramesh Narayanaswamy

Badri Gopalan

Synopsys Inc.

GTC 2012

Page 2: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 2

Agenda

• Overview of Verilog Simulation

• Parallel Verilog Simulation Algorithms

• Parallel Simulation Tradeoffs on GPU

• Challenges

GTC 2012

Page 3: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 3

Verilog Hardware Description Language Objectives

• Hardware Models

– Multiple Implementation Levels – Gate, Register Transfer

– Input Output Behavior – Behavioral

• Testbench

– Generate Input Data, Expected Results

– Drive and Compare

– Input Output Behavior Level

– SystemVerilog Testbench Extensions

GTC 2012

Model Hardware • Modern chips run into 100s of Millions of Gates

• Gates in the real world can be thought as parallel blocks in simulation

Model Testbench to test Hardware

Page 4: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 4

Verilog – Models

• Gate Level

– Interconnect

− Combinational logic

− State holding gates: Latches, Flip Flops

− Largely Bit level

• Register Transfer Level (RTL)

– Interconnect Guarded Processes

– Word Level

– Process has Multi line behavior

– Some Process advance clock by one

– Clocks, Sequences are explicit

• Behavioral Level

– Process encompasses many clock cycles

– Process may be longer

GTC 2012

b

a

c q

clk

reg [31:0] q, a, b; reg c, clk;

always @(posedge clk)

if (c)

q <= a;

else

q <= b;

reg [31:0] q, a, b; reg c, clk;

always @(posedge clk)

if (c)

wait @(posedge clk);q <= c;

else

q <= b;

Combinational

Logic Sequential

Logic

Guarded

Process

Behavioral

statements

Page 5: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 5

How Verilog Models Parallelism

GTC 2012

reg [31:0] a,c,b;

always @(a or b)

c = ^b | a;

reg [31:0] a,b,y;

always @(posedge clk)

if (reset)

a <= y;

Wakeup

Guard

reg [31:0] a,b,c,y;

always @(posedge clk)

if (reset)

y <= c & b ^ a;

Read

Global State Simulation Time

Variable Values

Process State

• Where to continue ?

• Blocked on a guard ?

• Executing ?

• Guarded Processes

– @(A or B) change on A or change on B

– @(posedge C) 0 to {1,x,z} or {x,z} to 1 change on C

– #10; 10 time units have elapsed

• Process body - Sequential code with Assignments

• Global State

– Simulation Time

– Values of Variables

– Process State

Collection of Executing Processes: Parallel Workload

HDL Simulators can implement Parallel Workload with:

•Serial Simulation Algorithms

•Parallel Simulation Algorithms

Page 6: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 6

Serial Simulation Algorithms Event Driven Simulation with Dynamic Scheduling

GTC 2012

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

q

q

Each block guarded by

a change-check

Functional evaluation

dynamically scheduled,

executes only what is

needed

Many optimizations for

serial simulation, works

in normal scenarios of

low event activity

But: not easy to expose

parallelism

0->1

0->1

Blocks evaluated in

Current clock cycle

Blocks inactive in

Current clock cycle

Page 7: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 7

Serial Simulation Algorithms Oblivious simulation

GTC 2012

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

q

q

All parallel blocks

evaluated on any

change

No event guards

required: simplifies

implementation

On the surface, looks

suitable for fine grained

parallel implementation

But: high overhead in

the case of low event

activity %

Blocks evaluated in

Current clock cycle

0->1

0->1

Blocks inactive in

Current clock cycle

Page 8: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 8

Parallel Simulation Algorithms Task Based Parallelism

GTC 2012

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

q

q

Design (and / or

testbench) split into

partitions

Each partition mapped

into an execution unit

More suitable for bigger

cores

Speedup depends on

relative and balanced

activity in partitions

0->1

0->1

Blocks evaluated in

Current clock cycle

Blocks inactive in

Current clock cycle

Partition 1 Partition 2

Page 9: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 9

Parallel Simulation Algorithms Fine Grain Parallelism + Oblivious simulation

GTC 2012

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

q

q

All parallel blocks

evaluated on any

change

No event guards

required: simplifies

implementation

Lots of Parallel Blocks

Blocks evaluated in

Current clock cycle

0->1

0->1

Blocks inactive in

Current clock cycle

Page 10: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 10

Parallel Simulation Algorithms Mapping Oblivious simulation to GPU

GTC 2012

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

q

q

0->1

0->1

Level

1

Level

2

Level

3

Level

1

Level

2

Level

3

Page 11: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 11

Parallel Simulation Complexity Hypothetical Medium Sized Chip RTL

GTC 2012

Model: 1 Million+ user processes

Workload:

• 10K+Targeted Tests

• Each test targets a chip feature

Simulation

• Data Dependence between partitions

• Data Dependent Activity

• 0.5 – 3% Activity per Test Phase

• Low Effective Parallelism

• Activity

• Tail Region

Data

Dependent

Activity

Data

dependence

between

partitions

Tail Data flow

Page 12: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 12

Summary Challenges / Benefits in using GPU for RTL Sim

• What Works ?

– A lot of potential parallelism in the model

− Fermi / CUDA thread scheduling

– A lot of memory accesses per clock cycle

− Fermi provides 144 GBps

– CUDA software ecosystem is robust and improving

• Wishlist

– Global Barrier

– Latency Optimized Core

– Lower launch overhead

– CUDA profiler for large datasets

GTC 2012

Page 13: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 13

References

• Mary L. Bailey, Jack V. Briner, Jr., and Roger D.

Chamberlain. 1994. Parallel logic simulation of VLSI

systems. ACM Comput. Surv. 26, 3 (September 1994),

255-294

• Keckler, S.W.; Dally, W.J.; Khailany, B.; Garland, M.;

Glasco, D.; , "GPUs and the Future of Parallel

Computing," Micro, IEEE , vol.31, no.5, pp.7-17, Sept.-

Oct. 2011

GTC 2012

Page 14: Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys 2012 1 Compiling a Parallel DSL to GPU Ramesh Narayanaswamy Badri Gopalan Synopsys

© Synopsys 2012 14

Questions ?

GTC 2012