quantifying the cost and benefit of latency insensitive

90
Quantifying the Cost and Benefit of Latency Insensitive Communication on FPGAs Kevin E. Murray and Vaughn Betz 1

Upload: others

Post on 12-May-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantifying the Cost and Benefit of Latency Insensitive

Quantifying the Cost and Benefit of Latency

Insensitive Communication on FPGAs

Kevin E. Murray and Vaughn Betz

1

Page 2: Quantifying the Cost and Benefit of Latency Insensitive

Motivation

2

Page 3: Quantifying the Cost and Benefit of Latency Insensitive

Local Communication Speed Improving

3

Page 4: Quantifying the Cost and Benefit of Latency Insensitive

Local Communication Speed Improving

3

>2.4x

Page 5: Quantifying the Cost and Benefit of Latency Insensitive

Local Communication Speed Improving

3

>2.4x

Page 6: Quantifying the Cost and Benefit of Latency Insensitive

Local Communication Speed Improving

3

>2.4x

Page 7: Quantifying the Cost and Benefit of Latency Insensitive

Global Communication Speed Not Improving

4

Page 8: Quantifying the Cost and Benefit of Latency Insensitive

Global Communication Speed Not Improving

4

1.3x

Page 9: Quantifying the Cost and Benefit of Latency Insensitive

Global Communication Speed Not Improving

4

3.6x

1.3x

Page 10: Quantifying the Cost and Benefit of Latency Insensitive

Global Communication Speed Not Improving

4

3.6x

1.3x

System level timing closure increasingly

difficult!

Page 11: Quantifying the Cost and Benefit of Latency Insensitive

System Level Timing Closure Issues

5

Page 12: Quantifying the Cost and Benefit of Latency Insensitive

System Level Timing Closure Issues

5

Identify Critical

Path

Page 13: Quantifying the Cost and Benefit of Latency Insensitive

System Level Timing Closure Issues

5

Identify Critical

Path

Insert Register

Page 14: Quantifying the Cost and Benefit of Latency Insensitive

System Level Timing Closure Issues

5

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

• RTL changes are error prone

Page 15: Quantifying the Cost and Benefit of Latency Insensitive

System Level Timing Closure Issues

5

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

• RTL changes are error prone

• Re-compile can take hours or days

Page 16: Quantifying the Cost and Benefit of Latency Insensitive

System Level Timing Closure Issues

5

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

Closed

Timing?

Done

Yes

• RTL changes are error prone

• Re-compile can take hours or days

Page 17: Quantifying the Cost and Benefit of Latency Insensitive

System Level Timing Closure Issues

5

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

Closed

Timing?

Done

Yes

No

• RTL changes are error prone

• Re-compile can take hours or days

• No guarantee of convergence

Page 18: Quantifying the Cost and Benefit of Latency Insensitive

Key Problem & Potential Solutions

6

• Limited by Synchronous Assumption:

• Computation and communication occur in a single clock cycle (if not

pipelined)

• Reasonable when local-global speed gap was small, but not when

large

• Many different proposed design schemes:

• Over-pipelining

• Asynchronous

• Globally Asynchronous Locally Synchronous (GALS)

• Latency Insensitive

Page 19: Quantifying the Cost and Benefit of Latency Insensitive

What is Latency Insensitive Design?

7

Page 20: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive System Implementation

8

• Key Idea: Make sub-modules insensitive to their communication latency

• Create a LI module by placing a designer’s synchronous module (Pearl) in a

wrapper (Shell), which is also synchronous

• Use Relay Stations (RS) to pipeline interconnect

• Deadlock free and applicable to (nearly) any synchronous module [1]

Logical System LI Implementation[1] Carloni et. al, “Theory of Latency-Insensitive Design”, TCAD, 2001

Page 21: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

9

• Tag each module port with ‘Valid’ and ‘Stop’ bits

• Pearl is paused until all inputs are ‘valid’

• ‘Stop’ signal provides back-pressure to prevent FIFO overflow

ReceiverSender

Valid

FIFOData

Stop

Page 22: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Page 23: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Page 24: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Page 25: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Page 26: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Page 27: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Page 28: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Page 29: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Page 30: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Page 31: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Backpressure

Page 32: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Communication Protocol

10

Shell C

Pearl

Shell B

Pearl

Shell A

Pearl

Page 33: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Design

11

Advantages:

• Interconnect pipelining does not affect correctness

• Designers can still reason about system synchronously

• Easy to pipeline late in the design flow

• Enhanced module re-use & composability

• Suitable for automation

• Use existing CAD tools

Page 34: Quantifying the Cost and Benefit of Latency Insensitive

Interconnect Pipelining Automation

12

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

Closed

Timing?

Done

Yes

No

Page 35: Quantifying the Cost and Benefit of Latency Insensitive

Interconnect Pipelining Automation

12

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

Closed

Timing?

Done

Yes

No

Page 36: Quantifying the Cost and Benefit of Latency Insensitive

Interconnect Pipelining Automation

12

Identify Critical

Path

Insert Register

Modify & Verify

Control Logic

Physical CAD

Closed

Timing?

Done

Yes

No

Identify Critical

Path

Insert Register

Physical CAD

Closed

Timing?

Done

Yes

No

LI Physical CAD

Page 37: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Design

13

Advantages:

• Interconnect pipelining does not affect correctness

• Designers can still reason about system synchronously

• Easy to pipeline late in the design flow

• Enhanced module re-use & composability

• Suitable for automation

• Use existing CAD tools

• Fold interconnect pipelining into physical CAD Tools [Future work]

Page 38: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Design

13

Advantages:

• Interconnect pipelining does not affect correctness

• Designers can still reason about system synchronously

• Easy to pipeline late in the design flow

• Enhanced module re-use & composability

• Suitable for automation

• Use existing CAD tools

• Fold interconnect pipelining into physical CAD Tools [Future work]

Disadvantages:

• Area/Speed overhead versus hand-tuned design

• Must verify sufficient throughput

Page 39: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Design

14

Trade off:

• Implementation efficiency for designer productivity

Key question of this work:

• What are the overheads of LI design on FPGAs?

Page 40: Quantifying the Cost and Benefit of Latency Insensitive

Latency Insensitive Implementation

15

Page 41: Quantifying the Cost and Benefit of Latency Insensitive

Baseline Shell Implementation

16

• ASIC LI design stalls modules by clock gating

• Use ‘Clock Enable’ on FPGAs

Page 42: Quantifying the Cost and Benefit of Latency Insensitive

Baseline Shell Implementation

16

• ASIC LI design stalls modules by clock gating

• Use ‘Clock Enable’ on FPGAs

• ‘Clock Enable’ becomes timing critical

• Fans out to all registers in pearl

• Connected to upstream and downstream modules

High Fan-out

Page 43: Quantifying the Cost and Benefit of Latency Insensitive

Optimized Shell Implementation

17

• Break timing path before it becomes high fan-out

• Insert additional registers in Shell

Page 44: Quantifying the Cost and Benefit of Latency Insensitive

Optimized Shell Implementation

17

• Break timing path before it becomes high fan-out

• Insert additional registers in Shell

Single-bit

Page 45: Quantifying the Cost and Benefit of Latency Insensitive

Optimized Shell Implementation

17

• Break timing path before it becomes high fan-out

• Insert additional registers in Shell

Single-bit

Multi-bit

Page 46: Quantifying the Cost and Benefit of Latency Insensitive

Optimized Shell Implementation

17

• Break timing path before it becomes high fan-out

• Insert additional registers in Shell

• Improves Timing

• Adds additional cycle of latency to shell

Single-bit

Multi-bit

Page 47: Quantifying the Cost and Benefit of Latency Insensitive

Relay Station (RS) Implementation

18

• Analogous to conventional pipeline register

• Additional logic to:

• Handle ‘valid’ and ‘stop’ bits

• Store in-flight data when facing backpressure (avoids stalling)

Page 48: Quantifying the Cost and Benefit of Latency Insensitive

FIR Design Example

19

Page 49: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study

20

• Pearl: FIR filter

• Design: 49 cascaded FIR filters

• Used as a high speed design example

• Investigate the frequency impact of LI

design

• Allow comparison of LI and non-LI

pipelining

Resources EP4sGX230 Utilization

Logic Blocks 51%

DSP Blocks 99%

M9K Blocks <1%

M144K Blocks 0%

Page 50: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study - Area

21

1.08 x

1.09 x

1.00 x

Page 51: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study - Frequency

22

0.67 x

0.92 x

1.00 x

Page 52: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study - Frequency

23

High Speed

Solutions

Page 53: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study - Frequency

23

High Speed

Solutions

26%

Page 54: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study - Frequency

24

Page 55: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study - Frequency

25

Page 56: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study - Frequency

25

42%

Page 57: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study - Frequency

26

Page 58: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study - Frequency

27

Page 59: Quantifying the Cost and Benefit of Latency Insensitive

Cascaded FIR Case Study - Frequency

28

Page 60: Quantifying the Cost and Benefit of Latency Insensitive

Pipelining Overhead Cause

29

• Extra control logic adds delay overhead to each Shell or RS

Page 61: Quantifying the Cost and Benefit of Latency Insensitive

Pipelining Overhead Cause

29

• Extra control logic adds delay overhead to each Shell or RS

Increased effective Tsu

Page 62: Quantifying the Cost and Benefit of Latency Insensitive

Generalized LI Scaling

30

Page 63: Quantifying the Cost and Benefit of Latency Insensitive

Generalized LI Shell Scaling

31

• Identify what makes Shells expensive

• Leads to design guidelines to minimize overhead

• Consider impact of scaling three main shell characteristics

• FIFO Depth

• Number of Input Ports

• Port Width

Page 64: Quantifying the Cost and Benefit of Latency Insensitive

FIFO Depth Scaling

32

• Scaling FIFO Depth costs minimal area

• Block RAMs are underused at shallow depths

Page 65: Quantifying the Cost and Benefit of Latency Insensitive

FIFO Depth Scaling

33

• Scaling FIFO Depth costs minimal area

• Block RAMs are underused at shallow depths

Use deep FIFOs to minimize stalling

Page 66: Quantifying the Cost and Benefit of Latency Insensitive

Port Width and Input Port Scaling

34

• Increasing port width or input ports costs significant area

Page 67: Quantifying the Cost and Benefit of Latency Insensitive

Port Width and Input Port Scaling

35

• Increasing port width or input ports costs significant area

• Frequency degrades faster as input ports are increased

Page 68: Quantifying the Cost and Benefit of Latency Insensitive

Port Width and Input Port Scaling

35

• Increasing port width or input ports costs significant area

• Frequency degrades faster as input ports are increased

2048 input bits

Page 69: Quantifying the Cost and Benefit of Latency Insensitive

Port Width and Input Port Scaling

35

• Increasing port width or input ports costs significant area

• Frequency degrades faster as input ports are increased

2048 input bits 160 input bits

Page 70: Quantifying the Cost and Benefit of Latency Insensitive

Port Width and Input Port Scaling

35

• Increasing port width or input ports costs significant area

• Frequency degrades faster as input ports are increased

2048 input bits 160 input bits

Favour wider ports instead of more ports to

maximize frequency

Page 71: Quantifying the Cost and Benefit of Latency Insensitive

Granularity

36

Page 72: Quantifying the Cost and Benefit of Latency Insensitive

LI Design Granularity

37

• How fine or coarse should we make LI Systems?

• Trade-off between:

• Flexibility and productivity benefits

• Area overhead

• Local communication (e.g. 40K LEs) is still fast

• Flexibility most beneficial for slow system-level communication

Page 73: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

Page 74: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

P pins

Page 75: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

P pins

K pins

Page 76: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

P pins

K pins N blocks

Page 77: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

• R: Rent parameter

P pins

K pins N blocks

Page 78: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

• R: Rent parameter

P pins

K pins N blocks

R = 0.0

Page 79: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

• R: Rent parameter

P pins

K pins N blocks

R = 0.0 R = 1.0

Page 80: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule

38

• Use Rent’s Rule to relate design size to pin count:

P = KNR

• R: Rent parameter

• Typical circuits:

0.50 < R < 0.75

P pins

K pins N blocks

R = 0.0 R = 1.0

Page 81: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule Overhead Projections

39

• Combine shell area scaling numbers for various design sizes and

ranges of Rent parameters

Page 82: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule Overhead Projections

39

• Combine shell area scaling numbers for various design sizes and

ranges of Rent parameters

Page 83: Quantifying the Cost and Benefit of Latency Insensitive

Rent’s Rule Overhead Projections

39

• Combine shell area scaling numbers for various design sizes and

ranges of Rent parameters

Cascaded FIR

(2.4K LE)

Page 84: Quantifying the Cost and Benefit of Latency Insensitive

Hypothetical Design Example 20% Overhead

40

• Consider a 4M LE FPGA at 20% area overhead

Page 85: Quantifying the Cost and Benefit of Latency Insensitive

Hypothetical Design Example 20% Overhead

40

• Consider a 4M LE FPGA at 20% area overhead

71 Modules

(56K LEs)

307 Modules

(13K LEs)

5 Modules

(700K LEs)

Page 86: Quantifying the Cost and Benefit of Latency Insensitive

LI Design Granularity

41

• Area overhead is strongly related to communication locality (Rent Parameter)

• Designs with well localized communication will result in low overhead

• Rent parameter varies within different parts of a design

• Careful choice of module boundaries may further reduce overhead

Page 87: Quantifying the Cost and Benefit of Latency Insensitive

Conclusion and Future Work

42

Page 88: Quantifying the Cost and Benefit of Latency Insensitive

• Illustrated the growing gap between local and global communication speed

• 3.6x and growing

• Developed optimized LI building blocks for FPGAs

• Reduced frequency overhead from 33% to 8%

• Quantified the area and frequency overhead of LI communication on FPGAs

• Provided design guidelines to minimize the overheads of LI communication

Conclusion

43

Page 89: Quantifying the Cost and Benefit of Latency Insensitive

• Explore the benefits of LI design

• LI aware CAD Tools

• Investigate architectural enhancements

• Hardened FIFOs

• Fine-grained clock gating

• Embedded NoC

• Evaluate LI design on a broader range of designs

• Develop lower area/speed overhead LI design techniques

Future Work

44

Page 90: Quantifying the Cost and Benefit of Latency Insensitive

Thanks! Questions?Email: [email protected]