how fpgas work when they don't
TRANSCRIPT
How FPGAs Work When They Don’t- and how Feynman can help us understand
SummaryClock domain crossing, timing violations, single event effects and accelerated aging in hostile environments, power supply fluctuations, etc. As if the learning curve for HDL programming isn't steep already, as soon as we have mastered the archaic trade it is to write synthesizable code for FPGAs, we find the physical reality intruding, breaking our assumptions, and removing any remaining illusions we might have about the soothing comforts of deterministic programming. The physical reality is a nuisance; one we should deal with, but often do not. And understandably so. The non-ideal behavior of CMOS is difficult to simulate, difficult to grasp, and a hassle to mitigate.
Fortunately, as we shall see in this presentation, the learning effort can be greatly reduced, as long as we apply the right perspective. One such is Richard Feynman's File Clerk model (FCM), which is both intuitive and instructive when the goal is to understand "how FPGAs work when they don't". With an outset in the FCM we go through the following topics:
● Basic computer organization in FPGAs● Error mechanisms relevant in FPGA design● Applying the FCM to explain
○ Clock domain crossing logic○ SEE due to radiation○ Timing violations○ Voltage and frequency scaling
Resumé
Alex Birklykke, [email protected]
● 2010: Msc.EE in Applied Signal Processing and Implementation● 2015: PhD - Modeling and Predicting the behavior of computers operating without
guardbands (case study of FPGAs)● 2013-2016: FPGA development at Rohde & Schwarz (WLAN layer-1)● 2016-2017: FPGA development at GomSpace A/S● 2017- : Newspace entrepreneur with Space Inventor
Research● Empirical study of FPGA behavior when subject to
voltage and frequency scaling● Based on 65 nm Spartan 3E● Objective was to determine the cause of errors, as well as
model and predict errors. ● Research confirmed that
○ FPGAs are very noise immune devices○ Timing violations are the cause of errors in
voltage/frequency scaled device○ Precise error behavior is hard to predict
Presentation objective
Provide an intuition about how FPGAs work when they don’t
What could go wrong? Timing Closure
● Timing constraints not meet● Multi-seed P&R or refactoring
don’t always solve problem. Especially for systems with high FPGA utilization
● Sometimes it is necessary to ship systems with timing violations
● How to assess the criticality of timing violations?
What could go wrong? Clock domain crossings
● Clock domain crossings are commonly encountered in FPGA applications
● Metastable behavior must be mitigated● Error mechanism must be thoroughly understood
in order to mitigate problem
What could go wrong? Temperature effects and ageing
● Ring oscillator frequency in Virtex-5 FPGA vs:○ Left) Location and temperature.○ Right) Localized wearout
● Might lead to unforeseen timing violations
S. Zhang, Delay Characterization in FPGA-based Reconfigurable Systems. Master Thesis. 2013
What could go wrong? Radiation induced Ageing● Microsemi SmartFusion2 SoC FPGA (65nm)● Irradiated with Cobolt-60 gamma source● Accelerated ageing observed● For comparison, 20 krad ~ 5yrs in low Earth orbit● 10% timing overhead must be introduced, to
ensure timing closure after 5 yrs● Bad news: Other studies have found that the Flash
configuration memory cannot be reprogrammed after a few krad’s
N. Rezzak, J. J. Wang, C. K. Huang, V. Nguyen and G. Bakker, "Total Ionizing Dose Characterization of 65 nm Flash-Based FPGA," 2014 IEEE Radiation Effects Data Workshop (REDW), Paris, 2014, pp. 1-5.
What could go wrong? Chasing better performance
Voltage and/or frequency scaling results in timing errors
A. Birklykke, P. Koch, R. Prasad, L. Alminde and Y. Le Moullec, "Empirical verification of fault models for FPGAs operating in the subcritical voltage region," 2013 23rd International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), Karlsruhe, 2013, pp. 16-23.
It’s all about timing
How FPGAs work when they don’t?
Feynman's Lectures on Computation● Write-up of Feynman's lectures on computation
given at CalTech from 1983-1987● Includes an introductory chapter on computation,
as well as five chapters addressing the limitation of computers.
● Introduces the so-called “File Clerk Model” to explain the system-level behavior of sequential computers.
● Known as the as one of the great communicators of science
The File Clerk Model● Computers are data transfer machines first, and
only secondly an arithmetic device● The file clerk is primarily a data transfer function.
Data processing is only secondary● Feynman: Let’s use the file clerk as a metaphor
for understanding basic computer structure
The File Clerk Model
File clerk “total sales for California” procedure
Take out next “sales” cardIf “Location” says California, then
Take out “total” cardAdd sales number to number on cardPut “total” card back
Put “sales” card backRepeat
Sales cards
Salesman: “Smith”Location: “Tahoe”Salary: 100Sales: 1000
xxx.xx
Total card
File cabinet
The File Clerk Model
File clerk “total sales for California” procedure
Take out next “sales” cardIf “Location” says California, then
Add sales number to SPut “sales” card backRepeat until endTake out “total” cardReplace total with SPut “total” card back
Sales cards
Salesman: “Smith”Location: “Tahoe”Salary: 100Sales: 1000
xxx.xx
Total card
File cabinet
S : 0
Local scratch padLocal scratch pad limits data transfer, thus increasing file clerk performance
The File Clerk Model - Stored Program Clerking
1. R2 <- 12. R3 <- ADD (R1) (R2)3. R1 <- (R2)4. R2 <- (R3)5. R4 <- SUB 1000 (R3)6. PC <- 8 IF (CARRY)7. PC <- 28. HALT
Fetch instruction from address PCPC <- (PC) + 1Do instruction
R1 : 0R2 : 0R3 : 0R4 : 0
User registers Program/Data Memory
Fibonacci.exe
PC : 0 CARRY : 0
Control register
Generic file clerk with instruction set
The File Clerk Model with deadlines● Same model, but where results must be available
at a certain deadline. ● Imagine an angry office manager dictating the
pace● Claim: The time-dependence allow us to
intuitively explain how computers work when they don’t
● Trick: Use sympathetic insight/empathy for our file clerk
Intuitive Explanation of Errors using the File Clerk Model
Cause FCM eqv. FCM effect Reallife effect
Under-voltage Starving clerk Less effective clerk, more time to do same task. Unmet deadlines
Timing degradation
Overclocking Tight deadlines Less room for missteps Slack reduction
Electrical noise Office noise Processing errors more likely, variable execution time
Lower signal integrity, probabilistic propagation delay
Device Ageing Old file clerk Loss of vit and dexterity. More time to do same job
Timing degradation
High temperature Uncomfortable clerk Harder to focus. More time to do same job
Timing degradation
Adapting the File Clerk Model for FPGAs● Timed FCM● Think of a really simple-minded file clerk● Vocabulary restricted to “yes”, “no”, and “maybe”
○ Maybe ~ Metastability● Instructions limited to boolean expressions: file
clerk becomes LUTs● Important differences:
○ Program is unrolled into one long pipeline ○ Registers and file clerks are distributed
Yes, no, maybe?
Adapting the File Clerk Model for FPGAs
● “File clerk production line”● Information transfer is still dominating activity● System-level intuition about FCM still hold
Reg
Reg
File clerksScratch pad
Input data Output data
Mechanics of Timing Errors
Q: Assuming that we have timing violations, what happens?
Q: What conditions must be met before a timing violation result in a logic error?
Q: When do we have to worry?
Sensitization Criteria
Timing violations are a necessary condition for timing errors, but not sufficient. The circuit must also be exercised
FCM analogy: An idle “file clerk production line” does not make errors
Reg
Reg
...,X2, X1 …,Y2, Y1
Patience solves all problems
Reg
Reg
...X ,X, X, X, X …,Y, Y, #, @, ±
By repeating the input, the output will eventually settle to the correct error-free value
Two Primary Error Modes
Reg
Reg
Transition from X1 to X2
● Dynamic hazard when F(X1) != F(X2) → possible “stuck-at” error ● Static hazard when F(X1) == F(X2) → possible “bit-flip” error
F
F(X2), F(X1)
Generation of “Maybe’s”
● Register inputs must be stable during the setup and hold period (aperture).
● Unstable signals during latching → probability of meta-stabilities
● Given sufficient patiences, “maybe’s” will settle to a fixed yes or no. However, there is no guarantee that the value is correct (coin flip)
● With some probability, logic hazard can result in “maybe’s”
Clock Domain Crossing
● Ubiquitous in FPGA designs● Metastable behavior in receiving clock domain● Critical for control signals● Data signals are usually less critical (but it
depends)● Constant signals usually not critical (e.g.
configuration signals for subsystem)
Clock Domain Crossing
Classical mitigation using synchronizer
● Decreases the probability of “maybe’s”○ More levels, less probability
● No guarantee for correct signal transfer!!!● To ensure signal integrity, the patience principle
must be applied○ Sig1 must be repeated
When to worry about timing violations? Evaluate and accept
● Some data signals● Debug● Configuration● Low frequency signals re. fclk
Evaluate and avoid
● Mitigate○ Switch to level signaling○ Add synchronizers
● Refactor
That’s all folks