quantifying the cost and benefit of latency insensitive

Post on 12-May-2022

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Quantifying the Cost and Benefit of Latency

Insensitive Communication on FPGAs

Kevin E. Murray and Vaughn Betz

1

Motivation

2

Local Communication Speed Improving

3

Local Communication Speed Improving

3

>2.4x

Local Communication Speed Improving

3

>2.4x

Local Communication Speed Improving

3

>2.4x

Global Communication Speed Not Improving

4

Global Communication Speed Not Improving

4

1.3x

Global Communication Speed Not Improving

4

3.6x

1.3x

Global Communication Speed Not Improving

4

3.6x

1.3x

System level timing closure increasingly

difficult!

System Level Timing Closure Issues

5

System Level Timing Closure Issues

5

Identify Critical

Path

System Level Timing Closure Issues

5

Identify Critical

Path

Insert Register

System Level Timing Closure Issues

5

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

• RTL changes are error prone

System Level Timing Closure Issues

5

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

• RTL changes are error prone

• Re-compile can take hours or days

System Level Timing Closure Issues

5

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

Closed

Timing?

Done

Yes

• RTL changes are error prone

• Re-compile can take hours or days

System Level Timing Closure Issues

5

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

Closed

Timing?

Done

Yes

No

• RTL changes are error prone

• Re-compile can take hours or days

• No guarantee of convergence

Key Problem & Potential Solutions

6

• Limited by Synchronous Assumption:

• Computation and communication occur in a single clock cycle (if not

pipelined)

• Reasonable when local-global speed gap was small, but not when

large

• Many different proposed design schemes:

• Over-pipelining

• Asynchronous

• Globally Asynchronous Locally Synchronous (GALS)

• Latency Insensitive

What is Latency Insensitive Design?

7

Latency Insensitive System Implementation

8

• Key Idea: Make sub-modules insensitive to their communication latency

• Create a LI module by placing a designer’s synchronous module (Pearl) in a

wrapper (Shell), which is also synchronous

• Use Relay Stations (RS) to pipeline interconnect

• Deadlock free and applicable to (nearly) any synchronous module [1]

Logical System LI Implementation[1] Carloni et. al, “Theory of Latency-Insensitive Design”, TCAD, 2001

Latency Insensitive Communication Protocol

9

• Tag each module port with ‘Valid’ and ‘Stop’ bits

• Pearl is paused until all inputs are ‘valid’

• ‘Stop’ signal provides back-pressure to prevent FIFO overflow

ReceiverSender

Valid

FIFOData

Stop

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Latency Insensitive Design

11

Advantages:

• Interconnect pipelining does not affect correctness

• Designers can still reason about system synchronously

• Easy to pipeline late in the design flow

• Enhanced module re-use & composability

• Suitable for automation

• Use existing CAD tools

Interconnect Pipelining Automation

12

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

Closed

Timing?

Done

Yes

No

Interconnect Pipelining Automation

12

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

Closed

Timing?

Done

Yes

No

Interconnect Pipelining Automation

12

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

Closed

Timing?

Done

Yes

No

Identify Critical

Path

Insert Register

Physical CAD

Closed

Timing?

Done

Yes

No

LI Physical CAD

Latency Insensitive Design

13

Advantages:

• Interconnect pipelining does not affect correctness

• Designers can still reason about system synchronously

• Easy to pipeline late in the design flow

• Enhanced module re-use & composability

• Suitable for automation

• Use existing CAD tools

• Fold interconnect pipelining into physical CAD Tools [Future work]

Latency Insensitive Design

13

Advantages:

• Interconnect pipelining does not affect correctness

• Designers can still reason about system synchronously

• Easy to pipeline late in the design flow

• Enhanced module re-use & composability

• Suitable for automation

• Use existing CAD tools

• Fold interconnect pipelining into physical CAD Tools [Future work]

Disadvantages:

• Area/Speed overhead versus hand-tuned design

• Must verify sufficient throughput

Latency Insensitive Design

14

Trade off:

• Implementation efficiency for designer productivity

Key question of this work:

• What are the overheads of LI design on FPGAs?

Latency Insensitive Implementation

15

Baseline Shell Implementation

16

• ASIC LI design stalls modules by clock gating

• Use ‘Clock Enable’ on FPGAs

Baseline Shell Implementation

16

• ASIC LI design stalls modules by clock gating

• Use ‘Clock Enable’ on FPGAs

• ‘Clock Enable’ becomes timing critical

• Fans out to all registers in pearl

• Connected to upstream and downstream modules

High Fan-out

Optimized Shell Implementation

17

• Break timing path before it becomes high fan-out

• Insert additional registers in Shell

Optimized Shell Implementation

17

• Break timing path before it becomes high fan-out

• Insert additional registers in Shell

Single-bit

Optimized Shell Implementation

17

• Break timing path before it becomes high fan-out

• Insert additional registers in Shell

Single-bit

Multi-bit

Optimized Shell Implementation

17

• Break timing path before it becomes high fan-out

• Insert additional registers in Shell

• Improves Timing

• Adds additional cycle of latency to shell

Single-bit

Multi-bit

Relay Station (RS) Implementation

18

• Analogous to conventional pipeline register

• Additional logic to:

• Handle ‘valid’ and ‘stop’ bits

• Store in-flight data when facing backpressure (avoids stalling)

FIR Design Example

19

Cascaded FIR Case Study

20

• Pearl: FIR filter

• Design: 49 cascaded FIR filters

• Used as a high speed design example

• Investigate the frequency impact of LI

design

• Allow comparison of LI and non-LI

pipelining

Resources EP4sGX230 Utilization

Logic Blocks 51%

DSP Blocks 99%

M9K Blocks <1%

M144K Blocks 0%

Cascaded FIR Case Study - Area

21

1.08 x

1.09 x

1.00 x

Cascaded FIR Case Study - Frequency

22

0.67 x

0.92 x

1.00 x

Cascaded FIR Case Study - Frequency

23

High Speed

Solutions

Cascaded FIR Case Study - Frequency

23

High Speed

Solutions

26%

Cascaded FIR Case Study - Frequency

24

Cascaded FIR Case Study - Frequency

25

Cascaded FIR Case Study - Frequency

25

42%

Cascaded FIR Case Study - Frequency

26

Cascaded FIR Case Study - Frequency

27

Cascaded FIR Case Study - Frequency

28

Pipelining Overhead Cause

29

• Extra control logic adds delay overhead to each Shell or RS

Pipelining Overhead Cause

29

• Extra control logic adds delay overhead to each Shell or RS

Increased effective Tsu

Generalized LI Scaling

30

Generalized LI Shell Scaling

31

• Identify what makes Shells expensive

• Leads to design guidelines to minimize overhead

• Consider impact of scaling three main shell characteristics

• FIFO Depth

• Number of Input Ports

• Port Width

FIFO Depth Scaling

32

• Scaling FIFO Depth costs minimal area

• Block RAMs are underused at shallow depths

FIFO Depth Scaling

33

• Scaling FIFO Depth costs minimal area

• Block RAMs are underused at shallow depths

Use deep FIFOs to minimize stalling

Port Width and Input Port Scaling

34

• Increasing port width or input ports costs significant area

Port Width and Input Port Scaling

35

• Increasing port width or input ports costs significant area

• Frequency degrades faster as input ports are increased

Port Width and Input Port Scaling

35

• Increasing port width or input ports costs significant area

• Frequency degrades faster as input ports are increased

2048 input bits

Port Width and Input Port Scaling

35

• Increasing port width or input ports costs significant area

• Frequency degrades faster as input ports are increased

2048 input bits 160 input bits

Port Width and Input Port Scaling

35

• Increasing port width or input ports costs significant area

• Frequency degrades faster as input ports are increased

2048 input bits 160 input bits

Favour wider ports instead of more ports to

maximize frequency

Granularity

36

LI Design Granularity

37

• How fine or coarse should we make LI Systems?

• Trade-off between:

• Flexibility and productivity benefits

• Area overhead

• Local communication (e.g. 40K LEs) is still fast

• Flexibility most beneficial for slow system-level communication

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

P pins

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

P pins

K pins

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

P pins

K pins N blocks

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

• R: Rent parameter

P pins

K pins N blocks

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

• R: Rent parameter

P pins

K pins N blocks

R = 0.0

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

• R: Rent parameter

P pins

K pins N blocks

R = 0.0 R = 1.0

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

• R: Rent parameter

• Typical circuits:

0.50 < R < 0.75

P pins

K pins N blocks

R = 0.0 R = 1.0

Rent’s Rule Overhead Projections

39

• Combine shell area scaling numbers for various design sizes and

ranges of Rent parameters

Rent’s Rule Overhead Projections

39

• Combine shell area scaling numbers for various design sizes and

ranges of Rent parameters

Rent’s Rule Overhead Projections

39

• Combine shell area scaling numbers for various design sizes and

ranges of Rent parameters

Cascaded FIR

(2.4K LE)

Hypothetical Design Example 20% Overhead

40

• Consider a 4M LE FPGA at 20% area overhead

Hypothetical Design Example 20% Overhead

40

• Consider a 4M LE FPGA at 20% area overhead

71 Modules

(56K LEs)

307 Modules

(13K LEs)

5 Modules

(700K LEs)

LI Design Granularity

41

• Area overhead is strongly related to communication locality (Rent Parameter)

• Designs with well localized communication will result in low overhead

• Rent parameter varies within different parts of a design

• Careful choice of module boundaries may further reduce overhead

Conclusion and Future Work

42

• Illustrated the growing gap between local and global communication speed

• 3.6x and growing

• Developed optimized LI building blocks for FPGAs

• Reduced frequency overhead from 33% to 8%

• Quantified the area and frequency overhead of LI communication on FPGAs

• Provided design guidelines to minimize the overheads of LI communication

Conclusion

43

• Explore the benefits of LI design

• LI aware CAD Tools

• Investigate architectural enhancements

• Hardened FIFOs

• Fine-grained clock gating

• Embedded NoC

• Evaluate LI design on a broader range of designs

• Develop lower area/speed overhead LI design techniques

Future Work

44

Thanks! Questions?Email: kmurray@eecg.utoronto.ca

top related