HIGH LEVEL DEBUGGING TECHNIQUES FOR MODERN VERIFICATION FLOWS
by
Zissis Poulos
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
Copyright c© 2014 by Zissis Poulos
Abstract
High Level Debugging Techniques for Modern Verification Flows
Zissis Poulos
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2014
Early closure to functional correctness of the final chip has become a crucial success factor in the
semiconductor industry. In this context, the tedious task of functional debugging poses a significant
bottleneck in modern electronic design processes, where new problems related to debugging are con-
stantly introduced and predominantly performed manually. This dissertation proposes methodologies
that address two emerging debugging problems in modern design flows.
First, it proposes a novel and automated triage framework for Register-Transfer-Level (RTL) debug-
ging. The proposed framework employs clustering techniques to automate the grouping of a plethora of
failures that occur during regression verification. Experiments demonstrate accuracy improvements of
up to 40% compared to existing triage methodologies.
Next, it introduces new techniques for Field Programmable Gate Array (FPGA) debugging that
leverage reconfigurability to allow debugging to operate without iterative executions of computationally-
intensive design re-synthesis tools. Experiments demonstrate productivity improvements of up to 30×
vs. conventional approaches.
ii
Acknowledgements
First and foremost I would like to sincerely thank my supervisor Professor Andreas Veneris for being
an excellent mentor, guide and teacher throughout my journey into research. You are a constant source
of motivation and passion that drives me to achieve my full potential.
I would like to acknowledge my MASc. committee members, Professors Jason Anderson and
Vaughn Betz for their thorough reviews of my dissertation and their insightful feedback.
I would like to thank my parents who have been a never-ending source of love and support. You
have instilled in me the values and lessons that have guided me through my life. Mother, your spirit
keeps guiding me.
I am indebted to several colleagues at the University of Toronto for all their shared wisdom and
constructive feedback. Special thanks to Hratch Mangassarian, Terry Yang, Bao Le, Mohammad Fawaz,
and Kalin Ovtcharov.
Finally, acknowledgements are due to the University of Toronto for their financial support.
iii
for my Mother
Contents
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Failure Triage in Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Functional Debug and FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Purpose and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Design Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Verification Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Notation and Iterative Logic Arrays . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Multiple Counter-examples in Verification . . . . . . . . . . . . . . . . . . . . 11
2.3 SAT-based Design Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Debugging Multiple Counter-examples . . . . . . . . . . . . . . . . . . . . . 18
2.4 Simulation Metrics in Verification and Debugging . . . . . . . . . . . . . . . . . . . . 19
2.5 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
3 A Triage Engine for RTL Design Debug 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Triage in Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Counter-example Triage Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Error Behavior Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Suspect Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Counter-example Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.4 Error Count Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.5 Counter-example Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.6 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Re-configurability in FPGA Functional Debug 58
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 FPGA Functional Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 A Reconfigurability-Driven Approach to FPGA Functional Debug . . . . . . . . . . . 61
4.3.1 An Area-Optimized Multiplexer Implementation . . . . . . . . . . . . . . . . 62
4.3.2 Debugging with Limited Output Pins . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.1 Area Usage and Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.2 Productivity and Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Conclusions and Future Work 71
5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
vi
List of Tables
3.1 Proposed Triage Engine Performance (R=4) . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Effects of area-optimized multiplexer. . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Effects of area-optimized multiplexers with shift registers. . . . . . . . . . . . . . . . 68
4.4 Compilation time of SignalTap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
vii
List of Figures
1.1 Simplified VLSI design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 The Iterative Logic Array method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Two distinct counter-examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Typical automated debugging flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Debugging result for counter-examples C1 and C2 . . . . . . . . . . . . . . . . . . . . 17
2.5 Prefix and suffix windows of suspects 〈s11 ,5〉 and 〈s22 ,6〉 . . . . . . . . . . . . . . . . 22
2.6 FPGA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 FPGA hardware structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Incorrect grouping by conventional techniques . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Probabilistic behavior of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Example of suspect ranking for a counter-example . . . . . . . . . . . . . . . . . . . . 37
3.4 Counter-example proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Counter-example hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Proposed triage framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 Examples of injected RTL errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Features of real errors and suspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.9 Effect of selecting the R highest in rank suspects . . . . . . . . . . . . . . . . . . . . . 54
3.10 Effect of modification on Ward’s Method . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 Area overhead of SignalTap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Multiplexer for signal selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
viii
4.3 16-to-1 MUX implementation in 6-input LUTs. . . . . . . . . . . . . . . . . . . . . . 63
4.4 Multiplexer with a 4-bit Shift Register. . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Area and Fmax of multiplexers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Stability of SignalTap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
ix
Chapter 1
Introduction
1.1 Background and Motivation
Today, the size and complexity of modern Very Large Scale Integration (VLSI) computer chips grow at
a galloping pace. This exponential trend constitutes the driving force that retains growth in the semi-
conductor industry. The domain is characterized by escalating competitiveness, increasing customer
demands, strict time-to-market deadlines and tight budgetary constraints. Consequently, the semicon-
ductor industry is at a constant lookout to automate most of the steps involved in typical VLSI design
flows. Computer Aided Design (CAD) tools are crucial components in this effort, as they automate
many steps that would otherwise be intractable manually.
The realization of a computer chip comprises of a multitude of steps, where each step refers to a
different layer of design abstraction. Fig. 1.1 illustrates a simplified view of the VLSI design flow.
The flow begins with a high level behavioral specification composed into a natural language document
or written with the use of refined behavioral models in languages such as C or Matlab. Behavioral
synthesis is the step that follows immediately, and converts behavioral specification into a Register
Transfer Level (RTL) description such as Verilog or VHDL. The next step in the design flow is logic
synthesis, where the generated RTL description is translated into a gate-level netlist. The gate-level
description is subsequently synthesized to a transistor-level netlist, which in turn is placed and routed
on the chip fabric. Finally, the resulting physical layout is passed for fabrication to a chip manufacturing
facility.
1
CHAPTER 1. INTRODUCTION 2
Chapter 1. Introduction 2
DesignDebugging
Behavioral
Behavioral
RTL
Specifications
Chip
Description
Netlist
Layout
Synthesis
SynthesisLayout
Fabrication
SynthesisLogic
VerificationFunctional
CheckingEquivalence
TestingFault DiagnosisSilicon Debug /
Verification, Testing and DebugDesign
Gate−level
Figure 1.1: Simplified IC Design flowFigure 1.1: Simplified VLSI design flow
CHAPTER 1. INTRODUCTION 3
Two integral processes of any VLSI design flow are those of verification and test, which are in-
terposed between implementation steps to guarantee that each abstraction layer matches the next one.
More precisely, functional verification is a process that matches the behavioral specification to the RTL
description and the RTL description to the gate-level netlist, ensuring that no functional discrepancies
exist. When functional verification fails, design debugging commences. The purpose of debugging is to
locate the root-cause that led to a functional mismatch exposed by verification. Finally, test is the pro-
cess which ensures that the manufactured chip matches its gate-level netlist. If a mismatch is identified
at that step, then either silicon debug or fault diagnosis commence in order to locate functional errors or
fabric defects respectively.
In modern design flows, most of the verification and debugging steps in Fig. 1.1 are automated.
Equivalence checking [14] [5], property checking [9, 28], automatic test-bench generation, assertion-
based verification [13], functional coverage and the employment of powerful formal engines, such as
Binary Decision Diagrams (BDDs), Boolean Satisfiability (SAT) solvers and Quantified Boolean For-
mulae (QBF) [12, 27, 34, 35, 37, 38], constitute an evolution that re-defined the verification approach.
These algorithms and methods brought automation to functional debugging and achieved a significant
decrease in the time and cost related.
However, despite the above breakthroughs, the ever-growing complexity and size of electronic de-
signs has intensified the cost of debugging-related issues that were previously easier to manage. For
example, more and more engineers with very specialized responsibilities are involved in design cy-
cles. As a result, the task of determining the appropriate engineer(s) to analyze a verification failure
has become far from trivial. Another negative impact of this massive growth appears in the field of
FPGA functional debug, where, today, the internal logic becomes harder to observe and debug, and thus
requires the use of external equipment and time-consuming iterative approaches.
In the following two Sections, we discuss the current status and issues regarding the above growing
problems in verification.
CHAPTER 1. INTRODUCTION 4
1.1.1 Failure Triage in Verification
Typical modern designs incorporate a broad collection of interconnected functional blocks that interact
to define overall design operation. Evidently, functional verification of such complex Systems-on-Chip
(SoCs) becomes much more demanding in time, cost and resources. It is commonplace for verification
and design engineering teams to repetitively exercise the system’s behavior in an effort to expose as
many failures as possible; a process known as regression verification.
Despite recent efforts in industry and academia that bring automation to ease functional debugging,
another emerging problem that relates to regression verification has been left unexplored. Specifically,
automation in debugging mainly focuses on single failures, where each failure is treated separately and
in isolation, even if verification exposes dozens of mismatches. This narrow approach becomes an
impediment in regression verification scenarios and complicates the debugging task, since regression
can potentially generate a plethora of failures to be fixed. Neglecting potential relations between these
failures may lead to multiple engineers investing effort to resolve the same design error; a waste of
precious resources. Moreover, since the design’s erroneous behavior can be familiar to some of the
engineers but totally opaque to others, it is hard to identify the most suitable engineer(s) to debug these
failures.
In this context, triage is the high-level debugging task following regression verification that has a
twofold purpose. First, it tries to group together those failures that are closely related with respect to
their root-cause. Second, it aims to identify the most suitable engineer(s) to perform detailed debugging
for each one of the formed groups. The benefit of triage lies into the fact that only those engineers
familiar with the erroneous behavior will pursue detailed debugging for a given group. Moreover, any
fix for a failure can potentially eliminate most failures belonging to the same group, since they are
probably caused by the same design error.
Current studies indicate that triage can potentially occupy up to 30% of the debugging effort [18].
Despite these projections, triage in modern flows is predominantly performed in an ad hoc, manual and
time-consuming manner [36].
CHAPTER 1. INTRODUCTION 5
1.1.2 Functional Debug and FPGAs
As the cost of state-of-the-art ASIC design continues to escalate, field-programmable gate arrays (FP-
GAs) have become widely used platforms for digital circuit implementation. FPGAs carry several
advantages over ASICs, including reconfigurability and lower NRE costs for mid-to-high volume ap-
plications. While there remains a gap between FPGAs and ASICs in terms of circuit speed, power and
logic density [24], innovations in FPGA architecture, circuits and CAD tools have produced steady im-
provements on all of these fronts [7, 8, 39]. Today, FPGAs are a viable target technology for all but the
highest volume or low-power applications.
The reconfigurability property of FPGAs reduces the cost associated with fixing the various func-
tional errors that can occur during the design cycle. Whether used as hardware emulation platforms or
as actual target devices, FPGAs offer a different debugging approach by allowing design iterations to
include actual silicon execution. Designers verify their design in hardware using the same (or a similar)
FPGA they intend to deploy in the field. When design errors are discovered, the design’s RTL is altered,
re-synthesized and executed in hardware. Although widely adopted, the nature of such verification and
debugging approaches is clearly iterative, where each iteration includes the re-execution of time and
resource-intensive tool flow steps.
The underlying factor that strongly affects the number of debug iterations is the observability of the
design’s internal logic. If the engineer is able to observe a relatively large amount of internal signals
during silicon execution, then less debugging iterations are required. However, in silicon execution the
amount of observable signals is limited by the number of user-dedicated pins that, in most cases, is
prohibitively small to accommodate the needs of functional debugging. As a result, typical techniques
that utilize commercial internal or external logic analyzers have their efficiency highly degraded because
of the above limitation [4, 42]. Engineers are required to observe a predefined set of internal signals,
analyze the respective waveforms, and then re-synthesize the design along with the interconnections to
the logic analyzer, before they observe a new set of signals.
As such, the domain of FPGA functional debug faces an emerging need for robust debugging tech-
niques that leverage reconfigurability to increase observability and eventually shrink down debugging
iterations.
CHAPTER 1. INTRODUCTION 6
1.2 Purpose and Scope
This thesis aims to address the aforementioned emerging issues in functional debugging. The first
contribution is the introduction of a novel and automated triage framework for RTL design debugging.
The second contribution suggests new hardware structures and software techniques that leverage re-
configurability to reduce FPGA functional debugging time.
In summary, the contributions of this thesis are as follows:
• A novel automated triage methodology is presented, which groups together related failures that
are generated by regression verification flows. The framework is based on newly introduced met-
rics that define relations between failures and make predictions about the number of co-existing
RTL errors in the failing design. Part of the framework is a novel ranking scheme that aids the
engineers to identify those potential error locations that should be prioritized for detailed debug-
ging. The proposed framework formulates triage as a pattern recognition problem, and generates
solutions by employing clustering techniques. The developed framework is tested on four differ-
ent industrial designs with multiple injected errors demonstrating significant gains over existing
triage methodologies. Specifically, among all regression verification scenarios, we observed an
overall accuracy of 94%, which constitutes a 40% improvement over conventional approaches.
• New hardware and software techniques are introduced that speed-up FPGA functional debugging
by allowing the tracing of internal design signals during silicon execution without the need for
time-intensive re-synthesis iterations. The proposed method requires a sole execution of the syn-
thesis flow to trace a large number of signals for an arbitrary number of cycles using a limited
number of output pins. Experimental results demonstrate that our approach improves run-time
by up to ×30 vs. a conventional approach for FPGA functional debug. Our approach also offers
stability in the timing characteristics of the circuit being debugged.
1.3 Thesis Outline
This thesis is organized as follows. Chapter 2 presents background information on design verification,
design debugging and FPGAs. Chapter 3 describes a novel automated triage framework for design de-
CHAPTER 1. INTRODUCTION 7
bugging. Chapter 4 proposes new techniques for FPGA functional debug that exploit the re-configurable
nature of these devices. Finally, Chapter 5 concludes this thesis.
Chapter 2
Background
2.1 Introduction
This chapter presents background material that is relevant to the contributions of this thesis. It is or-
ganized as follows. Section 2.2 provides an overview of functional verification and design debugging.
Next, Sections 2.3 and 2.4 discusses the basic concepts of SAT-based functional debugging and sim-
ulation metrics in debugging and verification; topics that are related to the first contribution of this
dissertation. Finally, Section 2.5 summarizes the basic architectural features of modern FPGAs that
form the platform on which the second thesis contribution is based.
2.2 Design Verification
The goal of functional verification is to determine whether the implementation of a design conforms to
its specification. Although many different types of errors exist, this thesis addresses functional errors in
the RTL. Errors relating to power consumption and/or timing violations are not considered. For clarity,
the term error refers to any incorrect RTL element(s) that cause an observable discrepancy between
the design’s implementation and its specification under the same input stimuli. Broadly speaking, there
are two main types of errors that are introduced either by bugs in CAD tools or by the human factor.
Those related to incorrect design code (wrong signal(s), operation(s), disconnected port(s), etc) and
those related to erroneous verification code (wrong assertion(s), incorrect stimuli etc). The first are
referred to as design errors, whereas the latter are known as verification environment errors [22]. In the
8
CHAPTER 2. BACKGROUND 9
context of this dissertation, the general term error refers to both of the above error categories.
The observable effect of an error is called a verification failure. A discrepancy related to a verifi-
cation failure takes the form of conflicting signal values (0, 1 or X for unknown) at observation points,
such as primary output pins, observed internal circuit signals or failing assertions.
2.2.1 Verification Tools
Functional verification can be coarsely categorized into simulation-based, emulation-based and formal
verification. Broadly speaking, simulation-based verification [1, 6, 32] explores the design space by
providing stimulus to the design through a testbench. Mainstream simulation-based verification strate-
gies utilize a logic simulator to determine the existence of failures by repeatedly applying input patterns
to exercise corner-case functionality. The major drawback of this approach is its non-exhaustive nature.
As such, it is unable to guarantee functional correctness of the design. Along these lines, several cov-
erage metrics exist that indicate whether simulation can achieve a high degree of confidence in design
correctness. Despite its shortcomings, simulation-based verification remains the predominant verifica-
tion methodology in the industry today as it is used in more than 90% of verification cases [18]. The
success of simulation-based verification is primarily due to its simplicity and its relative scalability.
In contrast to simulation, formal verification explores the design space exhaustively with the use
of formal engines and mathematical models, such as Binary Decision Diagrams (BDDs), Satisfiability
(SAT) and Quantified Boolean Formula (QBF) solvers [12, 27, 34, 35, 37, 38]. Consequently, for-
mal techniques can prove or disprove the correctness of the design, but they may suffer from limited
practicality due to their limited scalability, as they use mathematical models to model the design.
Finally, emulation-based verification is predominantly based on FPGA prototyping [16, 25]; the
design is first implemented on a FPGA and it is exercised for the purpose of verification. Once the
design passes verification, it is then implemented on a similar platform that is intended to deploy in the
field. The major advantage of emulation/FPGA prototyping is speed. The design can often run orders
of magnitude faster than simulation and the verification process does not slow down as new functional
blocks are integrated in the design. On the other hand, its major drawback is observability. Accessing
the values of internal signals and state elements requires the use of logic analyzers or scan chains that
can only monitor a limited number of signals [4, 20].
CHAPTER 2. BACKGROUND 10
In the simplest of cases, verification tools are run on-the-fly by the engineer to identify a single fail-
ure and then debugging commences so that the failure is eventually rectified. However, this dissertation
addresses a more complex scenario; that of regression verification. In modern verification environments,
regression verification refers to the term where large sets of tests are run overnight on computer clusters
to intensively exercise the design’s behavior and potentially expose a large amount of failures. For a
regression run, it is only when all observed failures are collected that the debugging step begins. As
such, the discussion that follows assumes the existence of multiple failures.
2.2.2 Notation and Iterative Logic Arrays
Before moving into details regarding verification and debugging, it is crucial to first introduce appro-
priate notation for sequential circuits and describe the basic circuit representation used in verification
flows.
For a given sequential circuit with n primary inputs, m observation points and l state elements,
let x1,x2, . . . ,xn, y1,y2, . . . ,ym and e1,e2, . . . ,el denote the primary inputs, observation points and state
elements, respectively. Let also vectors xi = {xi1,x
i2, . . . ,x
in}, yi = {yi
1,yi2, . . . ,y
im} and ei = {ei
1,ei2, . . . ,e
il}
correspond to the values of primary inputs, observation points and state elements at cycle i, respectively.
The vector yi is generally referred to as the observed response for cycle i and corresponds to the values
provided by the implementation of the design. On the other hand, for the specification of the design, the
vector yi = {yi1, y
i2, . . . , y
im} represents the values that define the correct design behavior, and is referred
to as the expected response for cycle i. If verification finds at least one cycle i for which yi 6= yi, then the
design is said to fail verification and that a failure is observed at cycle i. Finally, let Xji = {xi,xi+1, . . . ,xj}
and Yji = {yi,yi+1, . . . ,yj} denote the set of vectors of primary input values and the set of vectors of
observed responses from cycle i to cycle j, respectively.
In verification, it is necessary to model the behavior of a sequential circuit. This can be achieved by
using a method called the Iterative Logic Array (ILA), also known as time-frame expansion. Assuming
a single clock domain, time-frame expansion for k cycles models a sequential circuit by extracting its
combinational part and replicating it for k times, such that the next-state of each clock, or time-frame, is
connected to the current-state of the next time-frame. Using the notation presented above, the following
example demonstrates the time-frame expansion process.
CHAPTER 2. BACKGROUND 11
QD
x2y1
x1e1
(a) Original circuit
Q D
y1
e1
x2
x1
(b) Extracted circuit
y11 e2
1x2
1
y21
x22x1
2
x11
e11
(c) Two time-frame expanded circuit
Figure 2.1: The Iterative Logic Array method
Example 1 Figure 2.1 shows how to generate the ILA of a circuit for two cycles. Figure 2.1(a) shows
the original sequential circuit, with primary inputs x1 and x2, a single primary output y1, and a single
state element e1. First, the combinational component is separated from the sequential components by
extracting the flip-flops, as shown in Figure 2.1(b). The pseudo-inputs and pseudo-outputs are implicitly
shown to be wires coming out of the dotted box. Next, the combinational component is replicated into a
two time-frame expanded circuit shown in Figure 2.1(c).
2.2.3 Multiple Counter-examples in Verification
In the vast majority of verification cases, for each failure exposed during regression, both simulation-
based and formal verification tools generate a counter-example, that is, a sequence of primary inputs,
starting from a given initial state that leads to a discrepancy between the actual and expected responses
of a design’s implementation and specification, respectively. As such, each counter-example contains
CHAPTER 2. BACKGROUND 12
precisely the information needed to excite and observe the erroneous behavior related to each failure.
It should be clarified that there are cases where the design fails verification because it cannot reach
specific states, and as a result no counter-example is generated [31]. For example, consider the case
where a design is exercised to reach a particular state, but the sate is actually unreachable for the input
stimuli and initial states that are applied. Despite the fact that cases like this are not rare, this thesis
does not deal with debugging instances where no counter-example exists. In what follows, we formally
define the concept of a counter-example within a regression verification context.
Suppose that a regression verification process applies a single and long test sequence on the design
under verification and generates a set of distinct counter-examples. Let this set be C = {C1,C2, . . . ,C|C |}.
Since there is a single counter-example Ci ∈ C for each observed failure, the set of counter-examples
exposes exactly |C | verification failures. Each counter-example Ci has an associated length denoted
as Li, which refers to the number of cycles the design is simulated until the corresponding failure is
observed. Assuming that all counter-examples start from cycle 1, each counter-example Ci can be
formally defined as follows.
Definition 1 Given an erroneous design, a counter-example Ci that exposes a failure at cycle Li, is a
tuple 〈I,XLi1 ,YLi
1 , yLi〉, where I is a set of initial states, XLi1 = {x1,x2, . . . ,xLi} is a set of vectors of
primary input values from cycle 1 to cycle Li, YLi1 = {y1,y2, . . . ,yLi} is a set of vectors of observed
responses from cycle 1 to cycle Li, and yLi is a vector of expected responses for cycle Li, with yLi 6= yLi .
Note that, yLi differentiates from yLi (yLi 6= yLi) in that the first represents the expected response
at cycle Li, whereas the latter corresponds to the actually observed response at the same cycle. In this
context, a counter-example comes into conflict with the simulation of the buggy design only at these
responses related to the corresponding verification failure. Also note that a counter-example typically
ends at the cycle where the mismatch related to the failure is observed. In other words, the discrepancy
associated with counter-example Ci of length Li is observed exactly at cycle Li.
According to the above definition, it should be clarified that each counter-example includes only the
expected response that exposes a single failure at a specific simulation cycle. For example, a counter-
example Ci includes only the expected response at cycle Li where the corresponding failure is observed,
even if more failures are observed in previous cycles. Those failures are captured by different counter-
CHAPTER 2. BACKGROUND 13
C1
I
x1 xk xm−2 xm−1 xm
y11 y2
1 y31 y1
k y2k y3
k ym−21 ym−2
2 ym−23 ym−1
1 ym−12 ym−1
3 ym1 ym
2 ym3
. . . . . .
(a) A Counter-example of length m
C2
I
x1 xk xm−2 xm−1
y11 y1
2 y13 yk
1 yk2 yk
3 ym−21 ym−2
2 ym−23 ym−1
1 ym−12 ym−1
3
. . . . . .
(b) A Counter-example of length m-1
Figure 2.2: Two distinct counter-examples
examples and are not associated with counter-example Ci.
Finally, regression verification generates counter-examples in increasing order of finish time. Since
all counter-examples start at cycle 1, their finish time is equal to their length. As such, all counter-
examples are thereby considered to appear in sorted order in counter-example set C , such that Li ≤ L j,
if and only if, i < j.
Example 2 Based on the above definitions, Figure 2.2 illustrates a time-frame expanded erroneous
design with two counter-examples C1 and C2 of respective lengths L1 = m and L2 = m− 1. Counter-
examples C1 and C2 expose two distinct failures at cycles m and m− 1, respectively. According to
the notation presented above, C1 = 〈I,{x1,x2, . . . ,xm},{y1,y2, . . . ,ym}, ym〉. Similarly, we have that
C2 = 〈I,{x1,x2, . . . ,xm−1},{y1,y2, . . . ,ym−1}, ym−1〉. In Figure 2.4, an error (illustrated by a circle)
is excited at cycle m− 2 and eventually propagates to cause a mismatch at primary output y3 during
cycle m. This erroneous behavior is captured by counter-example C1. Counter-example C2, on the
other hand, exposes the erroneous behavior related to the second error, which is excited in cycle k and
CHAPTER 2. BACKGROUND 14
causes a failure in cycle m−1 at output y2, as illustrated by Figure 2.2(b). It is important to note that
each counter-example contains expected responses only for a single failure that has not already been
exposed by another counter-example. Thus, C1 includes the expected response ym for cycle m but does
not include the expected response ym−1 for cycle m−1, because the failure at cycle m−1 has already
been exposed by C2.
As we will see in the next Section, a counter-example is the main object of analysis during functional
debugging.
2.3 SAT-based Design Debugging
Functional debugging is the process in which a failure is analyzed to identify its error source(s). When
the error is in the design or the verification code, the term commonly used in literature is design debug-
ging [1, 35]. With respect to design debugging, the term suspect is used to specify a design component
that can potentially be the error source of a failure. A suspect component can be a line of RTL code, a
Verilog always statement, an if or case statement, a module instantiation etc. More precisely a suspect
is defined as follows.
Definition 2 Let Ci = 〈I,XLi1 ,YLi
1 , yLi〉 represent a counter-example exposing a failure at cycle Li. Then
a design component is called a suspect for Ci, if and only if the component can be functionally modified
such that Ci no longer exposes the observed failure at cycle Li.
SAT-based design debugging is a formal method that encodes the design debugging problem into
a SAT instance for a given counter-example [12, 34, 35, 38]. All satisfying assignments for this SAT
instance correspond to suspects which can be functionally altered and rectify the erroneous behavior
exposed by the counter-example.
As an additional benefit, SAT-based mechanics allow debuggers to return error propagation paths in
the circuit that show how a value from a suspect location propagates through consecutive cycles to reach
the output where a mismatch is observed [22]. Modern SAT-based debugging formulations [22], known
as time-based design debugging, further allow the debugging tool to pinpoint the exact cycle where
a possible error can be excited at a suspect location to cause the observed failure. This information
CHAPTER 2. BACKGROUND 15
augments the knowledge of the engineer(s) during the effort of identifying the actual cause of a failure
among all returned suspect locations.
Definition 3 The cycle at which an erroneous value appears at the output(s) of a suspect component
and eventually propagates to cause a failure is referred to as the excitation cycle for that suspect.
Definition 4 An error propagation path is a path in the ILA representation of a sequential circuit that
starts from a suspect location at the corresponding excitation cycle and shows how the erroneous value
propagates to the failing output.
It becomes apparent that a SAT-based debugger does not return a suspect solely as a design location,
but it also provides the excitation cycle for each suspect component. For the remainder of this thesis,
we refer to possibly erroneous design locations as suspect locations, irrespective of the excitation cycle,
and we use the more general term suspect to refer to the suspect location along with its excitation cycle.
Using the above definitions we can define the input and output of a SAT-based debugger as follows:
Definition 5 A SAT-based debugger takes as input an erroneous design and a counter-example Ci
demonstrating a failure. The output of the debugger is a set of suspects Si = {〈si1 , ti1〉,〈si2 , ti2〉, . . . ,〈si|Si |, ti|Si |〉}
containing all possible suspect locations si j and the respective excitation cycles ti j . Set Si is referred to
as the suspect set (or solution set) of counter-example Ci.
Here, we should mention that SAT-based debugging traditionally requires a parameter called error
cardinality, denoted as N. The error cardinality is equal to the number of suspects that can simulta-
neously rectify the counter-example. Specifically, if the debugger is run with error cardinality N = 1,
then each suspect in the solution set corresponds to a single design component. If, on the other hand,
the error cardinality is set to a higher value, then each suspect is a tuple of N design components that
must be simultaneously modified in order to rectify the failure. For example, if N = 2 then the re-
turned solution set contains all possible pairs of design components that, if both modified, can rectify
the counter-example. In any case, the excitation cycles of each design component are also provided in
the corresponding solution set.
Traditionally, error cardinality N is set to 1 and is iteratively incremented until all possible combi-
nations of suspects are discovered. Nevertheless, in practice it has been shown that in the vast majority
CHAPTER 2. BACKGROUND 16
Specifications
Verification Tool
pass?
Counter-example
Automated Debugger
Suspects
done
No
Yes
Design Under Test
Figure 2.3: Typical automated debugging flow
of debugging scenarios it suffices to set N to the value of 1 or 2, that is, to search for single suspects or
pairs of suspects [35]. For simplicity, the methods described in this thesis assume that N = 1. This is not
a limitation, and as we will discuss in Chapter 3, enforcing higher error cardinality can be effectively
managed by the proposed methodologies.
The above definitions imply that a suspect is not necessarily the actual error that was introduced
into the design. It can be shown that there can be multiple equivalent suspects that explain a single
failure [41]. However, the exhaustive nature of SAT-based debuggers guarantees that the actual error
will be included as a suspect in the returned suspect set [35]. Evidently, the returned suspect set provides
vital hints to the engineer(s) as to where the actual error location should be searched for. The engineer(s)
use their intuition and inherent understanding of the design’s behavior in order to identify these suspect
locations that could be actual errors or related to ones. The process of rigorously examining the suspect
set is called detailed debugging [36].
CHAPTER 2. BACKGROUND 17
C1
I
x1 xk xm−2 xm−1 xm
y11 y1
2 y13 yk
1 yk2 yk
3 ym−21 ym−2
2 ym−23 ym−1
1 ym−12 ym−1
3 ym1 ym
2 ym3
. . . . . .s11
s12
s13
(a) Counter-example C1 and debugging suspects
C2
I
x1 xk xm−2 xm−1
y11 y1
2 y13 yk
1 yk2 yk
3 ym−21 ym−2
2 ym−23 ym−1
1 ym−12 ym−1
3
. . . . . .s21
s22
(b) Counter-example C2 and debugging suspects
Figure 2.4: Debugging result for counter-examples C1 and C2
In summary of the above, a typical verification flow including the task of automated debugging is
shown in Figure 2.3. The verification tool takes as input the design under test (DUT) as well as the high
level specification. If verification fails, a counter-example is generated. The debugging process takes
the counter-example, as well as the DUT and returns a set of suspects that can rectify the erroneous
behavior.
The following example illustrates the aforementioned concepts.
Example 3 An example of debugging two counter-examples is depicted in Figure 2.4 by using the It-
erative Logic Array representation of the sequential design. In this example, two errors cause two
distinct failures. One is excited in cycle m− 2 and propagates to cause a failure at the output in cycle
m, as shown in Figure 2.4(a) and the other is excited in cycle k and propagates to an output in cycle
m− 1, as illustrated by Figure 2.4(b). The generated counter-examples C1 and C2 of length L1 = m
and L2 = m− 1 respectively, are then passed to an automated debugger. For counter-example C1, the
result is a solution set S1 = {〈s11 ,k〉,〈s12 ,m−2〉,〈s13 ,m−1〉} of circuit components that can explain the
CHAPTER 2. BACKGROUND 18
wrong output at cycle m. On the other hand, for counter-example C2, the debugger returns a solution
set S2 = {〈s21 ,k〉,〈s22 ,m− 1〉} of circuit components that can be responsible for the failure at cycle
m−1. All suspects s11 , s12 , s13 , s21 , and s22 excited at cycles k, m−2, m−1, k and m−1 respectively,
along with their propagation paths are illustrated in Figure 2.4. Note that the above excitation cycles
are also included in the returned suspect sets. In Figure 2.4 each suspect location is denoted by a dotted
circle in the cycle at which it is excited. These suspects that correspond to the actual error locations are
illustrated by a solid circle. Also notice that the actual error locations are returned as suspects in the
respective solution sets. The error responsible for counter-example C1 is returned as suspect s12 , while
the error responsible for counter-example C2 is returned as suspect s21 .
2.3.1 Debugging Multiple Counter-examples
One of the major limitations in traditional SAT-based debugging flows is that each counter-example is
treated in isolation, without considering potential relations to other counter-examples. When the number
of counter-examples is relatively small, this narrow approach rarely affects the debugging cycle. On the
other hand, when regression generates dozens or even hundreds of counter-examples, then ignoring
such correlations may heavily jeopardize and/or prolong the verification cycle. For example, imagine a
scenario where many counter-examples expose failures that are related to the same design error. In this
case, debugging and individually analyzing each counter-example introduces significant redundancy in
the verification cycle. If the engineer(s) had some knowledge available regarding this correlation, then
the combined information from these counter-examples would significantly aid the discovery of the
actual design error.
In this context, a straightforward correlation between two counter-examples can be easily defined
based on the suspect locations they share in common. We refer to these suspects as the mutual suspect
set and we define it as follows.
Definition 6 For two distinct counter-examples Ci,C j ∈ C , the set containing all common suspects and
their respective excitation cycles is referred to as the mutual suspect set of Ci and C j, denoted as Mi j
and formally defined as:
Mi j ={{〈siu , tiu〉,〈s jv , t jv〉} : siu = s jv
}(2.1)
CHAPTER 2. BACKGROUND 19
Obviously, in the above definition, Mi j = M ji. Remark that in our notation, the equality sign “=”
between two suspect locations corresponds to a matching between their unique names in the netlist
representation. This implies the exclusion of the excitation time for the purposes of this comparison;
two suspect locations are considered identical when they correspond to the same circuit component even
if their excitation cycles differ. Also, for uniformity, Mii is also defined under Eq.2.1, where all pairs
contain each suspect in Si twice. The example that follows demonstrates the concept of mutual suspect
sets.
Example 4 For the example in Fig. 2.4, presented before, the mutual suspect set of counter-examples
C1 and C2 is M12 ≡M21 = {{〈s11 ,k〉,〈s21 ,k〉}}. Note that, trivially, we have M11 = {{〈s11 ,k〉,〈s11 ,k〉}},
,{〈s12 ,m−2〉,〈s12 ,m−2〉}},{〈s13 ,m−1〉,〈s13 ,m−1〉}}} and similarly M22 = {{〈s21 ,k〉,〈s21 ,k〉}},
,{〈s22 ,m−1〉,〈s22 ,m−1〉}}}.
The above definition forms the basis which this dissertation builds upon in order to extract useful
information from the massive data available after regression verification.
2.4 Simulation Metrics in Verification and Debugging
As previously discussed, simulation-based verification is widely adopted in the semiconductor industry.
Apart from the benefits of speed and scalability, simulation tools also offer a wide range of metrics
that can be exploited to extract useful information for the debugging process. Simulation coverage
is one such metric. Generally, the term simulation coverage refers to various coverage metrics, such
as functional coverage, branch coverage or code coverage. For the purposes of this thesis the term
simulation coverage (or simply coverage) refers to the number of times each code line, block, expression
or branch in the RTL description of a sequential circuit is exercised.
The basic goal of simulation coverage is to provide a degree of confidence that a design is indeed
correct whenever it passes verification. For example, if an RTL block is never covered by simulation,
then this RTL block is not guaranteed to be bug-free even if the design passes verification. On the other
hand, coverage can also provide useful information when verification fails, such as the number of times
a component is simulated before the failure is observed.
CHAPTER 2. BACKGROUND 20
Broadly speaking, knowing whether a design component is rigorously exercised or not, provides
a measure of certainty when one attempts to estimate how reliably that component can be considered
bug-free. One question that arises is whether coverage can still provide useful information given that a
design component is potentially erroneous, that is, given that the component is a suspect location for a
particular counter-example.
Intuitively, if a suspect component is rigorously exercised for many cycles before an error at that
location is excited and propagates to an observation point, then one can speculate that this component
may actually correspond to an error that is hard to excite and/or has a small effect to the rest of the
circuit; because, otherwise, it would only take a small number of cycles to excite an error at that suspect
location. Conversely, a suspect location that needs to be exercised only for a few cycles until an erro-
neous value appears and causes a failure, can be considered as more “severe”, that is, easier to excite
and easier to propagate to an observation point. Simulation coverage can provide useful information,
which, combined with SAT-based debugging results, may enrich our knowledge around the nature of
each suspect component.
To this end, we need to extract coverage information with respect to each suspect that appears in the
solution set of a counter-example. However, each suspect has its own excitation cycle within a counter-
example, and, as a result, different parts of the counter-example need to be examined for each suspect
component. In this context, for each suspect there are two parts of the counter-example that should be
examined separately, namely the prefix window and the suffix window of the suspect. In what follows,
we define these two concepts.
If a suspect location is a potential fix for more than one counter-example, then it must appear in
more than one suspect set, having possibly different excitation cycles. In that case, the prefix window of
a suspect location with respect to a given counter-example is defined as the part of the counter-example
that lies between the following cycles:
• the excitation cycle of the suspect location that is related to the given counter-example
• the excitation cycle of the same suspect location that chronologically precedes the excitation that
relates to the given counter-example.
Note that the chronologically preceding excitation does not relate to the given counter-example,
CHAPTER 2. BACKGROUND 21
but it relates to a different one for which the suspect location is also a possible fix. More precisely,
consider that the uth suspect for counter-example Ci, denoted as 〈siu , tiu〉, includes the same suspect
location as the vth suspect for counter-example C j, denoted as 〈s jv , t jv〉. That is, siu = s jv , but tiu and
t jv are potentially different. Also assume that the excitation cycle t jv precedes the excitation cycle
tiu in time and that between cycles tiu and t jv there is no other excitation cycle for the same suspect
location. Then the prefix window of suspect 〈siu , tiu〉 with respect to counter-example Ci is defined as
pre(〈siu , tiu〉) = 〈Xtiutjv,Ytiu
tjv〉, which effectively defines that part of counter-example Ci that starts at cycle
t jv and ends at cycle tiu . If Ci is the only counter-example for which suspect location siu is returned in the
solution set, then pre(〈siu , tiu〉) = 〈Xtiu1 ,Ytiu
1 〉, that is the prefix window starts from cycle 1, since there
is no other excitation for the same suspect location in any cycle before cycle tiu . The prefix window of
a suspect is examined separately, because it contains coverage information for a given suspect between
two consecutive excitations. As previously discussed, this information may prove to be significant
when one attempts to analyze how easy or difficult it is to excite an erroneous value at a specific suspect
location.
On the other hand, the suffix window of a suspect location is the part of the corresponding counter-
example that begins one cycle after the suspect excitation cycle and ends at the cycle where the failure
is observed. Evidently, the suffix window of a suspect is the one that contains its error propagation path
and demonstrates how easily an erroneous value at a suspect location can propagate to the observation
point. We denote the suffix window of suspect 〈siu , tiu〉 as su f (〈siu , tiu〉) = 〈XLi(tiu+1),Y
Li(tiu+1)〉.
For a prefix and suffix window of a suspect component we also refer to its length as the number of
cycles included in the window. With respect to the above definitions, we denote the length of the prefix
and suffix window of suspect 〈siu , tiu〉 as ||pre(〈siu , tiu〉)|| and ||su f (〈siu , tiu〉)||, respectively.
Example 5 Consider the same counter-examples C1 and C2 that were previously shown in Figure 2.4,
along with their suspect components and the respective excitation cycles. Now, Figure 2.5 illustrates
the prefix and suffix window of suspects 〈s11 ,5〉 and 〈s22 ,6〉. The prefix window of suspect 〈s11 ,5〉 is
given as pre(〈s11 ,5〉) = 〈Xt11t21,Y
t11t21〉 = 〈X5
3,Y53〉, since the chronologically preceding excitation for the
same suspect location happens at cycle 3 and is captured by counter-example C2. The suffix window of
suspect 〈s11 ,5〉 is given as su f (〈s11 ,5〉) = 〈XL1(t11+1),Y
L1(t11+1)〉= 〈X10
6 ,Y106 〉. Similarly, the prefix window
CHAPTER 2. BACKGROUND 22
C1
I
x1 x5 x8 x9 x10
y11 y1
2 y13
y51 y5
2 y53 y8
1 y82 y8
3 y91 y9
2 y93 y10
1 y102 y10
3
. . . . . .
s12
s11s13
C2
I
x1 x3 x6 x7
y11 y1
2 y13 y3
1 y32 y3
3 y61 y6
2 y63 y7
1 y72 y7
3
. . . . . .
s22
s21s23
||pre(〈s11 ,5〉)||
||pre(〈s22 ,6〉)||
||su f (〈s11 ,5〉)||
||su f (〈s22 ,6〉)||
Figure 2.5: Prefix and suffix windows of suspects 〈s11 ,5〉 and 〈s22 ,6〉
of suspect 〈s22 ,6〉 is given as pre(〈s22 ,6〉) = 〈Xt221 ,Y
t221 〉 = 〈X6
1,Y61〉, since there is no excitation of the
same suspect location happening before the excitation at cycle 6. The suffix window of suspect 〈s22 ,6〉
is given as su f (〈s22 ,6〉) = 〈XL2(t22+1),Y
L2(t22+1)〉= 〈X7
7,Y77〉.
Now that the prefix window of a suspect is defined, we can also define the number of times a suspect
location is exercised for the duration of its associated prefix window, as follows.
Definition 7 The number of cycles for which the input(s) of a suspect component siu toggle(s) during its
prefix window is referred to as the frequency of the suspect, and is denoted as fiu .
Since all suspects siu in a counter-example suspect set Si correspond to circuit components, a map-
ping between siu and its corresponding frequency fiu is always feasible.
CHAPTER 2. BACKGROUND 23
Chapter 2. Background 6
I/O I/O I/O I/O
I/O I/O I/O I/O
I/OI/O
I/OI/O
I/OI/O
I/OI/O
LB LB LB LB
LB LB LB LB
LB LB LB LB
LB LB LB LB
Figure 2.1: FPGA architecture
LUT. Figure 2.2b illustrates the internals of a BLE. The logic block inputs and outputs
are connected to the programmable routing of the FPGA to provide connections between
I/Os and other logic blocks.
BLE 1 BLE 2 BLE N ... N BLEs
N Outputs
I Inputs
...
...
K-inputLUT FF
clk
Clock
(a) Cluster-based logic block (b) Basic Logic Element
Figure 2.2: Cluster and BLE architecture
The FPGA routing architecture [26] consists of channels of wires that run between
logic blocks, spanning the vertical and horizontal distance of the chip, as shown in Fig-
ure 2.3. The number of wires (or tracks) within each channel is denoted by W . Signals
Figure 2.6: FPGA architecture
2.5 Field Programmable Gate Arrays
In this Section we offer a brief description of the basic concepts behind a typical Field Programmable
Gate Array architecture. It is exactly these specific architectural features that our presented FPGA
debugging approach exploits in order to speed-up productivity in FPGA functional debug.
An FPGA is a two-dimensional array of programmable logic blocks (LBs) and a configurable rout-
ing network, as show in Figure 2.6. Combinational logic functions within logic blocks in FPGAs are
implemented using K-input look-up-tables (LUTs), which are small memories capable of implementing
any logic function of up to K variables. As shown in Figure 2.7(a), each LUT in an FPGA logic block
is normally coupled with a flip-flop, which can optionally be bypassed. SRAM configuration cells are
programmed to specify the truth table of the logic function implemented by the LUT, as well as control
the flip-flop bypass MUX.
Fig. 2.7(b) shows a simplified view of a programmable routing structure. The inputs to the MUX
attach to logic block output pins or routing conductors in the FPGA device (metal wire segments). The
output of the buffer can drive a routing conductor or a logic block input. Again, SRAM configuration
CHAPTER 2. BACKGROUND 24
Leveraging Reconfigurability to Raise Productivityin FPGA Functional Debug
Zissis Poulos1, Yu-Shen Yang2, Jason Anderson1, Andreas Veneris1
1Dept. of ECE, University of Toronto, Toronto, Canada.{zpoulos, janders, veneris}@eecg.utoronto.ca2Vennsa Inc., Toronto, Canada, [email protected]
Abstract—We propose new hardware and software techniques for FPGAfunctional debug that leverage the inherent reconfigurability of the FPGAfabric to reduce functional debugging time. Traditionally, the functionalityof an FPGA circuit is represented by a programming bitstream that specifiesthe configuration of the FPGA’s internal logic and routing. The proposedmethodology allows different sets of design internal signals to be tracedsolelyby changes to the programming bitstream followed by device reconfigurationand hardware execution. Evidently, the advantage of this new methodologyvs. existing debug techniques is that it operates without the need of iterativeexecutions of the computationally-intensive design re-synthesis, placementand routing tools. In essence, with a single execution of the synthesis flow,the new approach permits a large number of the design internal signals tobe traced for an arbitrary number of clock cycles using a limited numberof external pins. Experimental results using commercial FPGA vendor toolsdemonstrate productivity (i.e., run-time time) improvements of up to 30×vs. a conventional approach to FPGA functional debugging. These resultsdemonstrate the practicality and effectiveness of the proposed approach.
I. I NTRODUCTION
As the cost of state-of-the-art ASIC design continues to escalate, field-programmable gate arrays (FPGAs) have become widely used platformsfor digital circuit implementation. FPGAs carry several advantages overASICs, including reconfigurability and lower NRE costs for mid-to-highvolume applications. While there remains a gap between FPGAs andASICs in terms of circuit speed, power and logic density [1], innovationsin FPGA architecture, circuits and CAD tools have produced steadyimprovements on all of these fronts. Today, FPGAs are a viable targettechnology for all but the highest volume or low-power applications.
The reconfigurability property of FPGAs reduces the cost associatedwith fixing the various functional errors that can occur during thedesign cycle. With ASICs, designers spend considerable time in sim-ulation/verification before type-out, including, for example, simulationwith post-layout extracted capacitances and cross-talk noise analysis.Conversely with FPGAs, designers rarely do post-routing full delaysimulations. Instead, reconfigurability allows design iterations to includeactual silicon execution. Designers verify their design in hardware usingthe same (or a similar) FPGA they intend to deploy in the field. Whendesign errors are discovered, the design’s RTL is altered, re-synthesizedand executed in hardware.
The time needed for design cycles in FPGAs is dominated by re-synthesis (logic synthesis, technology mapping, placement and routing)tool run-times. FPGA placement and routing can take hours or daysfor the largest designs [2], and such run-times are an impediment todesigner productivity. With this observation in mind, in this paper, wepresent new techniques for FPGA functional debug that exploit thereconfigurability concept to raise productivity by reducing the numberof compute-intensive design re-synthesis runs that are needed.
At a high-level, our approaches work as follows: Say, for example, anengineer wishes to trace a large number,N, of a design’s internal signalsduring functional debug, using a small number of available external pins,m (N>>m). We augment the design with additional circuitry that allowthe N signals to be traced with⌈N/m⌉ FPGA device re-configurationsand hardware executions. The key value of our approach is that thedesign is only synthesized, placed and routedonce, rather than⌈N/m⌉times. This is achieved by selecting the different sets ofm trace signalsthrough modifications to the FPGA’s configuration bitstream (i.e. thepost-routed design).
While all of the proposed approaches leverage reconfigurability toreduce loops through the design process, we present a number of designvariants that are desirable in different scenarios, e.g. with differentnumbers of external pins being available for debugging, and withdifferent availabilities of internal FPGA resources, such as block RAMs.A further contribution of this work is a new multiplexer (MUX) designscheme for FPGAs that uses significantly less area than a traditionalMUX design. The new MUX is suitable for use in cases wherein theMUX select inputs are changed using the FPGA bitstream, instead ofusing normally routed logic signals.
...
f4f3f1f2
s
s
f1f2f3f4
sss
4-LUTDFF
MU
X
SRAM cell
Logic blockclk
MU
X
SRAM config cells
...in
i1i2i3
MU
X BUF
ss s . . .
(b) Routing structures
Fig. 1. FPGA hardware structures.
As compared with design re-synthesis for each group ofm signals,experimental results demonstrate that our approaches improves run-timeby up to 30×. They also offer stability in the timing characteristics ofthe circuit being debugged.
The remainder of this paper is organized as follows. Section II reviewsbackground on FPGA architecture and related work on FPGA functionaldebug. The proposed approach to debugging is described in Section III,as well as various architectures to meet different resource constraints.Section IV provides experimental results. Conclusions and suggestionsfor future work are offered in Section V.
II. BACKGROUND
A. FPGA Architecture
An FPGA is a two-dimensional array of programmable logic blocksand a configurable routing network. Combinational logic functions inFPGAs are implemented usingK-input look-up-tables (LUTs), whichare small memories capable of implementingany logic function of up toK variables. As shown in Fig. 1(a), each LUT in an FPGA logic blockis normally coupled with a flip-flop, which can optionally be bypassed.SRAM configuration cells are programmed to specify the truth tableof the logic function implemented by the LUT, as well as control theflip-flop bypass MUX.
Fig. 1(b) shows a simplified view of a programmable routing struc-ture. The inputs to the MUX attach to logic block output pins or routingconductors in the FPGA device (metal wire segments). The output ofthe buffer can drive a routing conductor or a logic block input. Again,SRAM configuration cells drive the select inputs on the MUX, and theSRAM values specify a particular input whose signal is driven throughthe buffer.
Fig. 1 is intended to illustrate that the logic functionality and routingconnectivity of an FPGA depends entirely on values in the programmingbitstream that is shifted into the FPGA’s SRAM configuration cells(which are connected in a scan chain). The programming bitstreamalso specifies the initial value (logic-0 or logic-1) for each flip-flop inthe device. Our approaches to FPGA functional debug rely on makingchanges to the programming bitstream, without having to re-run time-consuming FPGA synthesis, place and route tools.
B. FPGA Functional DebugThere are two major approaches to perform functional debug with
an FPGA. The first approach is to implement the complete design inan FPGA device. This is suitable for small designs that do not needto be executed in high frequency. Because of the reconfigurability,debugging modules can be easily added or modified with no cost. Aset of circuit modification that enhance debug capabilities is presentedin [3]. It provides software-like debug features, such as watchpoints.However, any modification requires recompilation of designs – a run-time intensive task. In a somewhat similar manner to what is proposedin this work, Graham et al. improve debugging productivity by instru-menting FPGA bitstreams [4]. An embedded logic analyzer is insertedinto the design without connecting to any signals. After place-and-route, the signals targeted for tracing are routed to the logic analyzer
(a) FPGA logic structures
Leveraging Reconfigurability to Raise Productivityin FPGA Functional Debug
Zissis Poulos1, Yu-Shen Yang2, Jason Anderson1, Andreas Veneris1
1Dept. of ECE, University of Toronto, Toronto, Canada.{zpoulos, janders, veneris}@eecg.utoronto.ca2Vennsa Inc., Toronto, Canada, [email protected]
Abstract—We propose new hardware and software techniques for FPGAfunctional debug that leverage the inherent reconfigurability of the FPGAfabric to reduce functional debugging time. Traditionally, the functionalityof an FPGA circuit is represented by a programming bitstream that specifiesthe configuration of the FPGA’s internal logic and routing. The proposedmethodology allows different sets of design internal signals to be tracedsolelyby changes to the programming bitstream followed by device reconfigurationand hardware execution. Evidently, the advantage of this new methodologyvs. existing debug techniques is that it operates without the need of iterativeexecutions of the computationally-intensive design re-synthesis, placementand routing tools. In essence, with a single execution of the synthesis flow,the new approach permits a large number of the design internal signals tobe traced for an arbitrary number of clock cycles using a limited numberof external pins. Experimental results using commercial FPGA vendor toolsdemonstrate productivity (i.e., run-time time) improvements of up to 30×vs. a conventional approach to FPGA functional debugging. These resultsdemonstrate the practicality and effectiveness of the proposed approach.
I. I NTRODUCTION
As the cost of state-of-the-art ASIC design continues to escalate, field-programmable gate arrays (FPGAs) have become widely used platformsfor digital circuit implementation. FPGAs carry several advantages overASICs, including reconfigurability and lower NRE costs for mid-to-highvolume applications. While there remains a gap between FPGAs andASICs in terms of circuit speed, power and logic density [1], innovationsin FPGA architecture, circuits and CAD tools have produced steadyimprovements on all of these fronts. Today, FPGAs are a viable targettechnology for all but the highest volume or low-power applications.
The reconfigurability property of FPGAs reduces the cost associatedwith fixing the various functional errors that can occur during thedesign cycle. With ASICs, designers spend considerable time in sim-ulation/verification before type-out, including, for example, simulationwith post-layout extracted capacitances and cross-talk noise analysis.Conversely with FPGAs, designers rarely do post-routing full delaysimulations. Instead, reconfigurability allows design iterations to includeactual silicon execution. Designers verify their design in hardware usingthe same (or a similar) FPGA they intend to deploy in the field. Whendesign errors are discovered, the design’s RTL is altered, re-synthesizedand executed in hardware.
The time needed for design cycles in FPGAs is dominated by re-synthesis (logic synthesis, technology mapping, placement and routing)tool run-times. FPGA placement and routing can take hours or daysfor the largest designs [2], and such run-times are an impediment todesigner productivity. With this observation in mind, in this paper, wepresent new techniques for FPGA functional debug that exploit thereconfigurability concept to raise productivity by reducing the numberof compute-intensive design re-synthesis runs that are needed.
At a high-level, our approaches work as follows: Say, for example, anengineer wishes to trace a large number,N, of a design’s internal signalsduring functional debug, using a small number of available external pins,m (N>>m). We augment the design with additional circuitry that allowthe N signals to be traced with⌈N/m⌉ FPGA device re-configurationsand hardware executions. The key value of our approach is that thedesign is only synthesized, placed and routedonce, rather than⌈N/m⌉times. This is achieved by selecting the different sets ofm trace signalsthrough modifications to the FPGA’s configuration bitstream (i.e. thepost-routed design).
While all of the proposed approaches leverage reconfigurability toreduce loops through the design process, we present a number of designvariants that are desirable in different scenarios, e.g. with differentnumbers of external pins being available for debugging, and withdifferent availabilities of internal FPGA resources, such as block RAMs.A further contribution of this work is a new multiplexer (MUX) designscheme for FPGAs that uses significantly less area than a traditionalMUX design. The new MUX is suitable for use in cases wherein theMUX select inputs are changed using the FPGA bitstream, instead ofusing normally routed logic signals.
...
f4f3f1f2
s
s
f1f2f3f4
sss
4-LUTDFF
MU
X
SRAM cell
Logic blockclk
MU
X
(a) FPGA logic structures
SRAM config cells
...in
i1i2i3
MU
X BUF
ss s . . .
(b) Routing structures
Figure 2.7: FPGA hardware structures.
cells drive the select inputs on the MUX, and the SRAM values specify a particular input whose signal
is driven through the buffer.
Fig. 2.7 is intended to illustrate that the logic functionality and routing connectivity of an FPGA
depends entirely on values in the programming bitstream that is shifted into the FPGA’s SRAM config-
uration cells (which are connected in a scan chain). The programming bitstream also specifies the initial
value (logic-0 or logic-1) for each flip-flop in the device.
2.6 Summary
This Section presented background material necessary for understanding the contributions of this thesis.
First, definitions and concepts about verification and debugging were presented. Next, a brief discussion
on SAT-based debugging and simulation metrics for verification was made. Following that, a brief
presentation of regression verification was given. Finally, a description of the basic architecture and
concepts in Field Programmable Gate Arrays was provided.
Chapter 3
A Triage Engine for RTL Design Debug
3.1 Introduction
Debugging commences once a discrepancy between the specification and the implementation of a design
is discovered. As discussed in the previous Chapter, cutting-edge automated debuggers instrument
formal methodologies to ameliorate the debugging process [19, 35]. This is accomplished by utilizing
counter-examples to generate a set of suspect locations in the design that can explain the erroneous
behavior. These locations provide vital suggestions to the engineer as to where the actual error lies in
the design.
However, regression verification flows complicate and prolong the debugging task, since they can
potentially generate hundreds of counter-examples to be fixed. At the end of regression tests, knowledge
of the relation between counter-examples and their culprit is limited. Normally, this causes confusion in
the design and/or verification engineering team, as each failure is constantly assigned and re-assigned
to various engineers until the most suitable one is found to fix the responsible design error.
Recall that triage is the high-level debugging task following regression verification that aims to
group together counter-examples with respect to their root cause errors and provide information that
helps determine the most suitable engineer to perform detailed debugging for each one of the formed
groups. Usually, this information comes into the form of a list of suspicious design components across
counter-examples that belong to a specific group.
Despite its growing complexity, triage in modern flows is predominantly performed in an ad-hoc
25
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 26
manual and time-consuming manner. In the majority of cases, triage is based on scripts that parse
verification error messages to group the observed failures and determine the responsible engineers. In
less common cases, a single engineer is assigned to monitor and analyze verification error logs on a
daily basis to determine the best suited engineer for detailed debugging. The scripting approach suffers
from frequent inaccuracy in counter-example classification, whereas the manual nature of binding an
engineer to the triage task incurs significant cost in terms of time and relies on the engineer’s intuition
and inherent understanding of the design’s behavior. In this work we present a novel automated counter-
example triage framework. More precisely, our contributions are as follows.
1. First, we devise a ranking system for possibly erroneous design locations (suspects) to quantify
their likelihood of being an actual error. This is achieved by performing a probabilistic analysis to
show that errors i) usually have a suffix window of small length, and ii) usually have a relatively
small frequency during their prefix window. The ranking scheme forms the core of the proposed
triage metrics that correlate counter-examples.
2. Second, we introduce the concept of counter-example proximity, a novel speculative metric that
expresses similarity or dissimilarity between counter-examples based on the likelihood of orig-
inating from the same error source or from distinct ones. The suggested metric is constructed
by exploiting simulation coverage, satisifiability, and the proposed ranking system to determine
counter-example correlation.
3. Triage is then formulated as a pattern recognition problem and solved via hierarchical cluster-
ing methodologies. Our approach allows us to employ machine learning algorithms to build an
automated debugging triage framework.
The proposed triage engine can be seamlessly integrated in a regression verification flow and can be
viewed as a vital preprocessing step before detailed debugging commences. The framework is tested on
four different designs with multiple injected errors and achieves significant gains in accuracy, of up to
40%, compared to existing triage methodologies.
This Chapter is organized as follows. Section 3.2 defines the problem of triage in design debug-
ging. Section 3.3 introduces the proposed failure triage framework along with suggested metrics and
heuristics. Finally, Section 3.4 provides experimental results and Section 3.5 summarizes the chapter.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 27
3.2 Triage in Debugging
As presented in Chapter 2, debugging a single counterexample is a straightforward procedure; the auto-
mated debugger will return a solution set that can justify the erroneous behavior, and from that set, all
suspects will be examined by the engineer to track down the actual error. Moreover, a quick overview
of the suspect locations is usually adequate to identify the rightful owner that should proceed with fix-
ing the counter-example. However, the existence of multiple counter-examples at the end of regression
verification necessitates a pre processing step where one needs identify the relation between counter ex-
amples, perform some coarse-grain analysis and group them before they are delivered to the appropriate
design/verification engineer. In this context, counter-example triage is defined as follows.
Definition 1: Given an erroneous design and a set of counter-examples C = {C1,C2, . . .C|C |},
counter-example triage is a complete disjoint partition of C into N clusters/groups g1,g2, . . . ,gN , with
gi ⊆ C (1≤ i≤ N), such that the following properties hold:
• jointly exhaustive property: There is no counter-example that does not belong to some cluster,
that is,i=N⋃i=1
gi = C .
• mutually exclusive property: Each counter-example belongs exactly to one cluster, that is, gi∩
g j = /0 if i 6= j.
• relation property: Each cluster contains related counter-examples that have a high probability
of originating from the same design error.
From the above definitions, two central points arise and have to be carefully addressed:
1. First, the relation between counter-examples belonging to the same group has to be clearly de-
fined. Ideally, counter-examples that belong to the same group are all caused by the same design
error. However, since state-of-the art debugging tools only approximate the actual error loca-
tions, it is practically impossible to develop a method that guarantees the above. Instead, each
counter-example is assigned to a specific group only if there is a high probability that it is caused
by the same design error as the rest of the counter-examples in that group. Conversely, counter-
examples belonging to different groups should have a low probability of originating from the
same error source.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 28
C1
I
x1 x5 x8 x9 x10
y11 y1
2 y13 y5
1 y52 y5
3 y81 y8
2 y83 y9
1 y92 y9
3 y101 y10
2 y103
. . . . . .
C2
gw
I
x1 x3 x6 x7
y11 y1
2 y13 y3
1 y32 y3
3 y61 y6
2 y63 y7
1 y72 y7
3
. . . . . .
gk
(a) Different outputs failing because of same error
C1
I
x1 x5 x8 x9 x10
y11 y1
2 y13 y5
1 y52 y5
3 y81 y8
2 y83 y9
1 y92 y9
3 y101 y10
2 y103
. . . . . .
C3
I
x1 x4 x8
y11 y1
2 y13 y4
1 y42 y4
3 y81 y8
2 y83
. . . . . .
gk
(b) Same output failing because of different errors
Figure 3.1: Incorrect grouping by conventional techniques
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 29
2. Second, a decision has to be made on the number of groups to be eventually formed by the
engine. Ideally, the number of formed groups should be equal to the number of co-existing errors
responsible for the whole set of generated counter-examples. However, in a real verification
environment there is no prior knowledge on what this number is. As such, triage needs to make a
guess on the number of co-existing errors. The quality of this process depends on how close this
guess is to the actual number of errors in the design.
It becomes clear that triage is characterized by uncertainty; a feature that makes the overall accuracy
of the engine sensitive to the way counter-example similarity is defined. Being able to identify accu-
rate correlations and making a good guess on the number of actual errors drastically increases triage
accuracy. In practice though, producing such accurate estimations is all but a trivial task.
3.2.1 Motivation
Conventional approaches, such as script-based grouping of error logs or manual analysis frequently
fall short when it comes to identifying counter-example relationships and are usually devoid of any
estimation on the actual number of design errors [36]. Fig. 3.1 illustrates two common cases were
traditional methods tend to fail. In Fig. 3.1(a) an error propagates due to different stimulus through
different circuit elements and eventually causes two failures at distinct observation points, y2 and y3,
and at different cycles, 7 and 10 respectively. The counter-examples exposing those two failures will be
wrongly grouped into two separate groups gk and gw, biased by the fact that the observation points -and
hence the error messages- are different. The opposite scenario can also happen. Fig. 3.1(b) illustrates
two distinct errors causing a discrepancy at the same observation point, y3 (although in different cycles 8
and 10) by following different propagation paths. Traditionally, the counter-examples will be placed into
the same group, which is not the desired result. Finally, there is no clear way for the existing automated
methods to determine the rightful owner for each formed group of counter-examples. Frequently, a
verification engineer will be assigned with this task, which involves a lot of manual effort and essentially
defeats the purpose of automation.
Here, it should be noted that for traditional script-based methods, it is hard to base the error message
comparison on the timing information. Particularly, if the same output(s) fail, the error messages will
be considered identical in most cases even if the error messages are triggered in different cycles. Still,
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 30
there are cases where scripting techniques can identify patterns in the cycles where error messages
are triggered and hence are able to distinguish between those that are likely to correspond to different
design errors. For example, if two error messages refer to the same observation point but one is triggered
every 500ns and the other one every 2000ns, then these messages will be considered different and the
corresponding counter-examples will be grouped separately. The scripts used for comparison in this
dissertation abide to these empirical rules. Nevertheless, it becomes clear that triage strategies that rely
solely on error messages or timing information will often suffer from poor accuracy.
Where the above strategies fail, the proposed automated triage engine aims to extract deeper infor-
mation from each counter-example, then automatically group those that are similar and finally pass them
to the suitable engineer(s) for further detailed debugging. In order for the grouping to be acceptably ac-
curate, we propose the following:
1. First, a well-defined similarity metric is devised that quantifies the relationship between two given
counter-examples based on their likelihood of sharing the same error source. The proposed simi-
larity metric, called counter-example proximity, is based on a probabilistic error behavior analysis.
This analysis generates a ranking scheme for possible error locations (suspects). The way ranked
suspects are distributed among counter-examples defines how proximity is constructed.
2. Second, a “guess” is made on the number of errors causing the generated set of counter-examples,
because this determines the number of groups to be formed. The above guess is referred to as
the error count estimation and is also based on the suspect ranking scheme presented in this
dissertation.
3. Furthermore, an efficient technique is applied to form closely related groups by using the similar-
ity metric between all possible pairs of counter-examples. This is achieved by formulating triage
as a clustering machine learning problem.
4. Finally, the triage engine provides suggestions that help assign each group to the most appropriate
engineer(s). The suspect ranking scheme presented in the next Sections essentially shows which
suspects should be prioritized (i.e., examined first). The engineers responsible for high priority
suspects are the ones to whom the counter-examples are assigned.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 31
The following Section describes detailed work on the aforementioned issues.
3.3 Counter-example Triage Framework
3.3.1 Error Behavior Analysis
In this dissertation we make the assumption that the majority of design errors in typical regression veri-
fication flows are introduced by the human factor. In our discussion, we often refer to human introduced
errors as actual errors. The exhaustive nature of SAT-based debugging guarantees that the location of
an actual error will be returned as a single suspect in the solution set of the resulting counter-example.
Some of the remaining suspects will often be related to the error as well, such as locations in the fan-out
of the error, as shown in [26]. However, a significant portion of suspects are not related to the actual
design bug, even if they fix the same failure. As such, before constructing any triage metrics, it is crucial
to identify those suspects that present similar characteristics to actual design errors.
We address the above by speculating on the way an actual error is excited and eventually propagates
to the failing responses. Suspects that follow our assumptions on how an error behaves are promoted
against suspects that violate our expectations. Such a filtering approach is expected to lead towards
more accurate triage metrics by identifying important suspects and separating them from suspects that
are likely to bear noisy information.
Generally, we expect human introduced errors to be excited in temporal proximity to the failing
observation points, an intuitive argument that is central behind Bounded Model Debugging [33]. More-
over, we expect that for a human introduced error to be excited it would take a relatively small number
of times for it to be exercised in simulation. Recall that this number is referred to as the frequency of the
respective design location. In the remainder of this Section, we present a thorough probabilistic analysis
that supports the observations above.
Assuming that a single error exists in the design and that simulation starts at cycle 1, let exi be the
probability that the error is excited at cycle i. Also, let pri be the probability of the error propagating
from cycle i to cycle i+ 1, and obi be the probability of observing a failure at an observation point at
cycle i given that the error has propagated to that cycle. Also assume that the input vector sequences are
temporally independent and stationary random sequences.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 32
Proposition 1: The probability pm of observing the first failure at cycle m given the probability that
the error is excited for the first time at cycle n, where m > n, is:
pm =n−1
∏i=1
(1− exi)× exn×m−1
∏i=n
pri×m−1
∏i=n
(1−obi)×obm (3.1)
Proof: Let events:
Ei = {an error is excited at cycle i},
Xi = {an error propagates from cycle i to cycle i+1 given that it propagated to cycle i},
Oi = {a failure is observed in cycle i given that an error has propagated to cycle i}.
The probability pm can be expressed in terms of the events Ei, Xi, and Oi as follows:
pm = P( n−1⋂
i=1
Ei∩En∩(m−1⋂
i=n
Xi∩m−1⋂i=n
Oi∩Om
∣∣∣ En
)).
But eventsn−1⋂i=1
Ei are conditionally independent tom−1⋂i=n
Xi,m−1⋂i=n
Oi∩Om. Thus:
pm = P( n−1⋂
i=1
Ei∩En)×P
(m−1⋂i=n
Xi∩m−1⋂i=n
Oi∩Om
∣∣∣ En
).
By Bayes’ law and the chain rule we have P (A∩B|C) = P (A|C)×P (B|A∩C).
Hence:
pm =P( n−1⋂
i=1
Ei∩En)×P
(m−1⋂i=n
Xi
∣∣∣ En
)×
×P(m−1⋂
i=n
Oi
∣∣∣ m−1⋂i=n
Xi∩En
)×P
(Om
∣∣∣ m−1⋂i=n
Oi∩m−1⋂i=n
Xi∩En
).
But events Om andm−1⋂i=n
Oi are conditionally independent given En, thus pm can be re-written as:
pm =P( n−1⋂
i=1
Ei∩En)×P
(m−1⋂i=n
Xi
∣∣∣ En
)×
×P(m−1⋂
i=n
Oi
∣∣∣ m−1⋂i=n
Xi∩En
)×P
(Om
∣∣∣ m−1⋂i=n
Xi∩En
)
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 33
By assumption, inputs of consecutive cycles are temporally independent. As a result, Xi is indepen-
dent of X j, and Ei is independent of E j for all i 6= j, meaning that:
P(Xi∩X j|En
)= P
(Xi|En
)×P
(X j|En
), and P
(Ei∩E j
)= P
(Ei)×P
(E j).
Consequently:
P(
m−1⋂i=n
Xi
∣∣∣ En
)=
m−1∏i=n
P(Xi|En
), and P
(n−1⋂i=1
Ei∩En)=
n−1∏i=1
P(Ei)×P
(En).
Similarly, conditional independence between Oi and O j yields:
P(
m−1⋂i=n
Oi
∣∣∣ m−1⋂i=n
Xi∩En
)=
m−1∏i=n
P(
Oi
∣∣∣ m−1⋂i=n
Xi∩En
).
Hence, pm can be simplified to:
pm =n−1
∏i=1
P(Ei)×P
(En)×
m−1
∏i=n
P(Xi|En
)×
m−1
∏i=n
P(
Oi
∣∣∣ m−1⋂i=n
Xi∩En
)×P
(Om
∣∣∣ m−1⋂i=n
Xi∩En
)Based on the assumptions:
1− exi = P(Ei), exn = P
(En), pri = P
(Xi|En
), obi = P
(Oi
∣∣∣ m−1⋂i=n
Xi ∩En
), therefore pm can be
defined as:
pm =n−1
∏i=1
(1− exi)× exn×m−1
∏i=n
pri×m−1
∏i=n
(1−obi)×obm
�.
Since we simply aim to construct a generalized view of error bahavior, we can assume that the
probabilities pri = pr, obi = ob and exi = ex remain fixed for all cycles i. Then Proposition 1 implies:
pm = (1− ex)n−1× ex× prm−n× (1−ob)m−n×ob (3.2)
Fig. 3.2 illustrates the results of plotting Eq. 3.2. To show our findings, pm is plotted under two
different settings. In the first, depicted in Fig. 3.2(a), the cycle where the error is first excited is kept con-
stant such that n = 1, whereas cycle m where the failure is observed is selected from the set [2,3,4,5,6].
Recall, that for a single error and a single failure the prefix window of the error corresponds to the se-
quence of cycles from the initial states to the cycle where the error is excited, and that the suffix window
refers to the sequence of cycles immediately after the excitation cycle until the cycle where the failure is
observed. In the above setting, the prefix window is set to a constant length of 1 and the suffix window
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 34
1 2 3 40
0.05
0.1
0.15
0.2
m−n: suffix window length
Pm
pr=ob=0.8pr=ob=0.5pr=ob=0.2
(a) Effect of suffix window length
0 0.5 10
0.5
1x 10−3
ex: excitation probability
Pm
n=2n=3n=4n=5n=6
(b) Relation between excitation and failure observation
probability
Figure 3.2: Probabilistic behavior of errors
length varies. Additionally, the propagation, observation and excitation probabilities are set constant,
such that pr = ob = [0.2,0.5,0.8] and ex = 0.5. Probability pm is plotted as a function of the failure
observation cycle m.
In the second setting, depicted in Fig. 3.2(b), the number of cycles between the first excitation cycle
and the cycle where the error is observed is kept constant so that m− n = 2, while n (the cycle of
first excitation) takes values from the set [2,3,4,5,6]. Essentially, the prefix window length now varies
whereas the suffix window length is fixed to 2 cycles. Probabilities pr and ob are set constant to 0.8. In
the second setting, pm is plotted as a function of ex.
In Fig. 3.2(a), the negative exponential nature of the probability curves confirms the expectations
that an error usually causes a failure only a small number of cycles after it has been excited. Hence,
the error’s suffix is expected to be relatively short. Fig. 3.2(b) leads us to an additional observation.
We observe that as the prefix length increases, the highest value for pm is achieved when the excitation
probability becomes smaller. The above behavior confirms our intuition that even when an error is
first excited close to the failure point, the longer the prefix is, the harder to excite the error it should
be (the excitation probability that maximizes pm drops). The excitation probability is proportional to
the likelihood of the error location being covered by simulation; an error that is covered by simulation
a small number of times has a small likelihood of being excited (harder to excite), and vice versa.
Therefore, any conclusions related to error excitation can be applied to error coverage (frequency) as
well.
It has to be noted that the description above serves not as a theoretical proof of the behavior of
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 35
errors but only as an experimental intuition of the most typical cases. This is because it is based on
certain assumptions. Specifically, we assume conditional independence between error excitation and
error propagation events for cycles that precede the error excitation cycle, which is definitely not a true
independence. However, this probabilistic analysis offers a generalized model of error behavior and
leads to the following general observations. An error is expected:
1. to have a relatively short suffix window
2. to exhibit low coverage (low frequency) during its prefix window
The above observations form the basis of the suspect ranking scheme described in the next Section.
The proposed ranking guides triage metrics to more accurate outcomes, as will be demonstrated by
experimental results.
3.3.2 Suspect Ranking
For a counter-example Ci ∈ C , the returned solution set Si contains all possible suspects for the observed
mismatch. However, there are many cases where the solution set is large. To add more pain, some of
the returned suspects are incidental and not typical of common error locations, such as reset signals,
input suspects, or stuck-at-faults at the gate-level. Along these lines, the goal of suspect ranking is to
generate a ranked version of the solution that serves two purposes. First, it segregates suspects that
are not typical of human introduced errors from suspects that are or are closely related to ones. As
such, counter-example relation is defined based on suspects that are closely related to actual errors and
not by treating all suspects evenly. Second, it aids engineers to prioritize detailed debugging by first
examining those suspects high in rank. Moreover, these high-ranked suspects can offer better guidance
when deciding the most suitable engineer for detailed debugging. For example, if one or more high-
ranked suspects are located in the same design module then the engineer(s) responsible for this particular
module should be the ones to investigate the counter-examples in detail. Conventional debugging does
not offer any such guidance, since all suspects are generated without any sense of priority.
In order to generate an appropriate ranking, we need quantify the observations of Section 3.3.1.
Recall that 〈siu , tiu〉 ∈ Si refers to the uth suspect location and its excitation cycle in the suspect set of
counter-example Ci. Also, as previously defined, the suffix window of suspect 〈siu , tiu〉 is denoted as
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 36
su f (〈siu , tiu〉). Finally, fiu denotes the frequency of suspect location siu with respect to counter-example
Ci. Recall that the frequency equals to the number of cycles for which the input(s) of suspect component
siu toggle(s) during its prefix window, denoted as pre(〈siu , tiu〉). Let score(〈siu , tiu〉) be a scoring function
quantifying the likelihood of 〈siu , tiu〉 being an actual error, defined as follows:
score(〈siu , tiu〉) =Li−||su f (〈siu , tiu〉)||
Li×(
1− fiu− γ
max{ fiv : 〈siv , tiv〉 ∈ Si}
)(3.3)
Based on the probabilistic analysis in the previous Section, the higher score(〈siu , tiu〉) is, the more
typical of an actual error suspect siu is considered. The first factor Li−||su f (〈siu ,tiu 〉)||Li
in Eq. 3.3 is in the
range of [0 . . .1] and quantifies the expectation that a real error is excited in temporal proximity to the
observed mismatch. The shorter the suffix window su f (〈siu , tiu〉) is, the higher score(〈siu , tiu〉) becomes,
as desired. The second factor increases as the frequency fiu of suspect 〈siu , tiu〉 decreases, respectively
resulting in an increase to the score function, again as desired. Similar to the first factor, the second also
falls within the range of [0 . . .1] for homogeneity. The denominator is set to the maximum frequency
observed for all suspects in the corresponding solution, as a measure of comparison. A relatively small
offset γ is subtracted from the numerator to avoid zeroing out the contribution of suspects that have
maximum frequency for the counter-example.
Based on the scoring function above we can construct a ranking for all suspects. Let rank be
a relation, such that for two distinct suspects 〈siu , tiu〉 and 〈siv , tiv〉, if rank(〈siu , tiu〉) < rank(〈siv , tiv〉),
then 〈siu , tiu〉 is more likely to be the actual error compared to 〈siv , tiv〉, respectively score(〈siu , tiu〉) >
score(〈siv , tiv〉). Given that score(〈siu , tiu〉) has been computed for all suspects 〈siu , tiu〉 ∈ Si, then the rank
of suspect 〈siu , tiu〉 is formally defined as:
rank(〈siu , tiu〉) = {r : |{〈siv , tiv〉 ∈ Si : score(〈siv , tiv〉)≥ score(〈siu , tiu〉)}|= r−1} (3.4)
It can be easily confirmed that a suspect with the highest score for a given counter-example will be
assigned a rank of 1, and a suspect with the lowest score will be assigned a rank of |Si|, which is the
lowest possible for that particular suspect set. In our implementation, ties between suspect scores are
broken randomly. Based on the above equations, real errors and suspects related to them are more likely
to be placed high in rank, exactly as desired.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 37
C1
I
x1 x5 x8 x9 x10
y11 y1
2 y13 y5
1 y52 y5
3 y81 y8
2 y83 y9
1 y92 y9
3 y101 y10
2 y103
. . . . . .
s12
s11s13
||su f (〈s11 ,5〉)||= 5||su f (〈s12 ,8〉)||= 2||su f (〈s13 ,9〉)||= 1f11 = 2f12 = 2f13 = 5
score(〈s11 ,5〉) = 0.31score(〈s12 ,8〉) = 0.49score(〈s13 ,9〉) = 0.018rank(〈s11 ,5〉) = 2rank(〈s12 ,8〉) = 1rank(〈s13 ,9〉) = 3
Figure 3.3: Example of suspect ranking for a counter-example
Example 6 Consider a counter-example C1 of length L1 = 10 (from cycle 1 to cycle 10) that exposes
a conflicting value at output y3 in cycle 10. The output of a SAT-based debugger is a solution set
S1 = {〈s11 ,5〉,〈s12 ,8〉,〈s13 ,9〉}, as shown in Figure 3.3, where the suspect location s12 corresponds to
the actual error location. Suppose that the frequencies of the suspects are f11 = 2, f12 = 2 and f13 = 5.
The suffix length of each suspect is computed directly from the counter-example. In this example, we
have ||su f (〈s11 ,5〉)||= 5, ||su f (〈s12 ,8〉)||= 2 and ||su f (〈s13 ,9〉)||= 1. Based on Eq. 3.3, and by setting
γ = 0.1, the resulting suspect scores are:
score(〈s11 ,5〉) =5
10×(
1− 1.95
)= 0.31
score(〈s12 ,8〉) =8
10×(
1− 1.95
)= 0.49
score(〈s13 ,9〉) =910×(
1− 4.95
)= 0.018
Consequently, the ranking scheme yields rank(〈s11 ,5〉) = 2, rank(〈s12 ,8〉) = 1, rank(〈s13 ,9〉) = 3.
In the above example we observe that the actual error location eventually becomes the top-ranked
suspect. Of course, this is not always the case, but, in general, the proposed scoring will push actual
errors higher in the rank, as shown by experimental results.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 38
3.3.3 Counter-example Proximity
The cornerstone of triage is a well-defined metric to express a relation between any two given counter-
examples. In order to develop such a metric we exploit information from the suspect ranking scheme
along with the number of suspects that two counter-examples share in common.
As defined in Chapter 2, the set of mutual suspects between two counter-examples Ci and C j is
denoted as Mi j. Intuitively, when Mi j is large relative to the number of total suspects in both solutions,
then Ci and C j are considered to be strongly related, thus possibly originating from the same error
source. However, if mutual suspects are low in ranking in both or at least one of the solutions, then
this correlation becomes weaker. For example, if a mutual suspect is high-ranked in Si but low-ranked
in S j, then it is more likely to be a real error for counter-example Ci and not for C j, even if it can fix
both; counter-example C j is expected to be caused by a suspect higher in rank. We combine the above
expectations into a speculative metric called counter-example proximity between any pair Ci and C j,
which is denoted as prox(Ci,C j) and defined as:
prox(Ci,C j) = 1− |Mi j||Si|+ |S j|− |Mi j|
× ∏{〈siu ,tiu 〉,〈s jv ,t jv 〉}∈Mi j
(1− |rank(〈siu , tiu〉)− rank(〈s jv , t jv〉)|
max{|Si|, |S j|}
)(3.5)
In literature, proximity can take various forms and express similarity or dissimilarity between the
objects that are analyzed [10]. In the context of this dissertation the proximity metric between counter-
examples expresses dissimilarity and, thus, a small proximity indicates a strong relation (small dissimi-
larity).
According to Definition 3.5, when Ci and C j are strongly related then prox(Ci,C j) tends to 0,
whereas a weak correlation sets prox(Ci,C j) closer to 1. Apparently, prox(Ci,C j) = 0 implies that
the counter-examples are guaranteed to be caused by the same error, but obviously this should only
happen when the counter-examples expose identical behavior. On the other extreme, prox(Ci,C j) = 1
means that the counter-examples are definitely caused by different errors, and this should only happen
when they do not share any suspects. Observe that the number of mutual suspects over total suspects
is encoded in the factor |Mi j||Si|+|S j|−|Mi j| . As desired, a large mutual suspect set Mi j will force prox(Ci,C j)
closer to 0.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 39
In the case where all suspects in solutions Si and S j are mutual then |Mi j| = |Si| = |S j|, thus
|Mi j||Si|+|S j|−|Mi j| = 1, and the first factor maximizes its contribution. In this context, the second factor
quantifies the contribution of mutual suspects based on their ranking. Ideally, counter-examples caused
by the same error will exhibit similar behavior. Therefore, their mutual suspects are expected to have
similar ranks in their respective solution sets. Based on Eq. 3.5, proximity decreases as the difference
|rank(〈siu , tiu〉)− rank(〈siv , tiv〉)| in the ranking of mutual suspects increases, which models the above
expectation. Remark that, prox(Ci,Ci) = 0 as desired, since all suspects are mutual ( Mi j|Si|+|S j|−|Mi j| = 1)
and have the same rank (|rank(〈siu , tiu〉)− rank(〈siv , tiv〉)| = 0 always). On the other hand, if Ci and
C j share no mutual suspects then they are definitely unrelated and caused by different errors, which is
successfully captured by Eq. 3.5, since in that case |Mi j|= 0 and thus prox(Ci,C j) = 1.
Example 7 Consider three counter-examples C1, C2 and C3 that expose three distinct failures at out-
puts y3 at cycle 10, y2 at cycle 7 and y3 at cycle 8, respectively. The corresponding solution sets are
S1 = {〈s11 ,5〉,〈s12 ,8〉,〈s13 ,9〉}, S2 = {〈s21 ,3〉,〈s22 ,6〉,〈s23 ,6〉} and S3 = {〈s31 ,4〉,〈s32 ,8〉}, as shown in
Figure 3.4, where s12 , s22 , and s31 correspond to the actual error locations. Note that counter-examples
C1 and C2 expose failures that originate from the same error location. Suppose that the proposed sus-
pect ranking scheme produces the ranking that is shown in Figure 3.4. In this example, the mutual
suspect sets between each pair of counter-examples are M12 = {(〈s11 ,5〉,〈s21 ,3〉),(〈s12 ,8〉,〈s22 ,6〉)},
M13 = {(〈s13 ,9〉,〈s32 ,8〉)} and M23 = /0, since C2 and C3 do not share any suspects in common. Based
on Eq. 3.5, the resulting proximity for each pair of counter-examples is:
prox(C1,C2) =1− 23+3−2
×(
1− |2−1|3
)×(
1− |1−2|3
)= 1−0.22 = 0.78
prox(C1,C3) =1− 13+2−1
×(
1− |3−2|3
)= 1−0.17 = 0.83
prox(C2,C3) =1
We observe that the relation between C1 and C2 is estimated to be stronger compared to the relation
between C1 and C3, since the proximity for the first pair is smaller, even though different outputs fail for
the first pair and the same output fails for the second one. Thus, leveraging information from the suspect
sets and the ranking scheme allows us to define similarity more accurately compared to traditional
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 40
C1
I
x1 x5 x8 x9 x10
y11 y1
2 y13 y5
1 y52 y5
3 y81 y8
2 y83 y9
1 y92 y9
3 y101 y10
2 y103
. . . . . .
s12
s11s13
C2
I
x1 x3 x6 x7
y11 y1
2 y13 y3
1 y32 y3
3 y61 y6
2 y63 y7
1 y72 y7
3
. . . . . .
C3
I
x1 x4 x8
y11 y1
2 y13 y4
1 y42 y4
3 y81 y8
2 y83
. . . . . .
s22
s21s23
s31
s32
rank(〈s12 ,8〉) = 1rank(〈s11 ,5〉) = 2rank(〈s13 ,9〉) = 3
rank(〈s21 ,3〉) = 1rank(〈s22 ,6〉) = 2rank(〈s23 ,6〉) = 3
rank(〈s31 ,4〉) = 1rank(〈s32 ,8〉) = 2
M12 = {(〈s11 ,5〉,〈s21 ,3〉),(〈s12 ,8〉,〈s22 ,6〉)}M13 = {(〈s13 ,9〉,〈s32 ,8〉)}M23 = /0
|M12|= 2|M13|= 1|M23|= 0
prox(C1,C2) = 0.78prox(C1,C2) = 0.83prox(C1,C2) = 1
Figure 3.4: Counter-example proximity
methodologies. Recall that existing automated triage techniques would erroneously decide that C1 and
C3 are related and, on the contrary, that C1 and C2 are not. This is because scripts base the decision
solely on the failing outputs. For the same example, we observe that the computed ranks of the mutual
suspects between C1 and C2 are different. This incurs some uncertainty that does not allow us to be
“too confident” that C1 and C2 are caused by the same error. If, on the other hand, the ranks would be
identical, then the corresponding proximity would be lower (specifically prox(C1,C2) = 0.5) and our
confidence would effectively be stronger.
3.3.4 Error Count Estimation
For a grouping of the generated counter-examples to be meaningful, it is necessary to define the num-
ber of groups expected to be formed. Ideally, this number should equal the number of design errors
responsible for the set of counter-examples C . However, in the vast majority of regression scenarios the
number of co-existing errors is not known a priori. Therefore an initial guess on the number of groups
has to be made that will reflect an acceptable grouping scheme.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 41
For that purpose, we construct a heuristic called error count estimation that leverages information
from the suspect ranking scheme. For each mutual suspect set Mi j a subset MRi j ⊆Mi j is created, which
is referred to as the reduced set MRi j. This set contains only those suspects that have at most a rank of
R≤ min{|Si|, |S j|} in suspect sets Si or S j. Formally:
MRi j = Mi j−
{{〈siu , tiu〉,〈siv , tiv〉} ∈Mi j : (rank(〈siu , tiu〉)> R)∨ (rank(〈siv , tiv〉)> R)
}(3.6)
As described in Section 3.3.2, high-ranked suspects, that is, suspects with small R, generally have a
stronger relation to actual errors. Intuitively, a large number of high-ranked mutual suspects indicates
that counter-examples are caused by a small number of errors, and vice versa. Along these lines, after
computing all possible sets Mi j and the reduced sets MRi j, for each suspect location siu we count all its
appearances in the reduced mutual sets. To do so, for each suspect location siu we construct the set
CNTsiu, which contains all suspects found in reduced mutual sets that correspond to the same suspect
location as siu , including siu itself:
CNTsiu={
s jv : {〈siu , tiu〉,〈s jv , t jv〉} ∈MRi j}
(3.7)
Effectively, |CNTsiu| provides the number of times that suspect location siu appears in the reduced
mutual sets. To extrapolate over all suspects, let CNT be a set that contains all computed CNTsiusets,
without including duplicate sets:
CNT =
i=|C |⋃i=1
u=|Si|⋃u=1
CNTsiu(3.8)
Note that, |CNT | corresponds to the total number of distinct high-ranked suspect locations that are
returned among all debugging sessions for the set of generated counter-examples. Now, the average
number of times such high-ranked suspects participate in a mutual set, and hence in a solution set,
estimates how many counter-examples we expect those suspects to be responsible for. This number is
denoted as CNTavg, and given as:
CNTavg =
∑CNTsiu
∈ CNT|CNTsiu
|
|CNT | (3.9)
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 42
Then, our error count estimation, denoted as e, is given by:
e =⌈ |C |
CNTavg
⌉(3.10)
Eq. 3.10 essentially says that the expected number of co-existing errors responsible for all counter-
examples is calculated by dividing the number of counter-examples |C | by the average number of
counter-examples we expect each high-ranked suspect to be responsible for. Observe that, if no mutual
suspects of high rank (R) exist, then |CNTsiu|= |{siu}|= 1 for all high-ranked suspects, and CNTavg = 1
according to Eq. 3.9. Then the error count estimation is e = |C |, acceptably predicting that each counter-
example is most likely caused by a unique error, and thus the number of errors equals the number of
counter-examples. On the other hand, the existence of high-ranked suspects among various solutions
incurs a decrease in e, since CNTavg increases. Eq. 3.10 offers a loose approximation on the number of
co-existing errors, but in practice, offers a good estimation on the number of groups to be formed, as
demonstrated by experimental results.
Example 8 Consider the previously presented counter-examples C1, C2, and C3, as shown in Figure 3.4.
Suppose that we decide to select only those mutual suspects that have rank 1 or 2 for each counter-
example. Then the reduced mutual suspect sets are:
M211 = {{〈s11 ,5〉,〈s11 ,5〉},{〈s12 ,8〉,〈s12 ,8〉},{〈s13 ,9〉,〈s13 ,9〉}},
M212 = {{〈s11 ,5〉,〈s21 ,3〉},{〈s12 ,8〉,〈s22 ,6〉}},
M222 = {{〈s21 ,3〉,〈s21 ,3〉},{〈s22 ,6〉,〈s22 ,6〉},{〈s23 ,6〉,〈s23 ,6〉}},
M233 = {{〈s31 ,4〉,〈s31 ,4〉},{〈s32 ,8〉,〈s32 ,8〉}},
M213 = {{〈s13 ,9〉,〈s32 ,8〉}} and
M223 = /0.
Notice that for the error count estimation we also need the set MRii for each counter-example Ci.
Next, we compute the times each suspect location appears in each of these sets, which gives CNTs12=
CNTs22= {s12 ,s22}, CNTs11
=CNTs21= {s11 ,s21}, CNTs31
= {s31}, and CNTs32= {s32}. Then we com-
pute the set that contains all the above counting sets, CNT ={{s12 ,s22},{s11 ,s21},{s31},{s32}
}. Fi-
nally, the average number of times these suspects appear in the reduced mutual suspect sets is computed
as follows:
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 43
CNTavg =|{s12 ,s22}|+ |{s11 ,s21}|+ |{s31}|+ |{s31}|
|CNT | =2+2+1+1
4=
32
Finally the error count estimation e is calculated by Eq. 3.10:
e =⌈ |C |
CNTavg
⌉=⌈ 3
32
⌉= 2
For the previous example, the error count estimation successfully predicts that there are two actual
errors responsible for the three generated counter-examples. As expected, the existence of high-ranked
mutual suspects supports a prediction that indicates a relatively small number of actual errors. On the
other hand, the existence of high-ranked suspects that are not shared among other counter-examples
prohibits the above estimation from being too low on the number of predicted errors. Finally, observe
that if only the top ranked suspects (rank 1) would be selected for the estimate, then useful information
would be discarded, resulting into a “bad guess” (e = 3), which wrongly predicts that every counter-
example is caused by a distinct error.
On the contrary, if all suspects were selected for the computations in the above equations, then
noisy suspects that do not relate to actual errors would have an unpredictable contribution towards the
method’s estimate. If most of them appear as common suspects then the estimation would be even lower,
whereas if most of them solely appear in their respective counter-example then the estimate would be
higher. The above scenario cannot be demonstrated in the previous example due to its simplicity, since
there are only but a few noisy suspects; the error count estimation would still be e = 2. Obviously,
selecting extreme values for the suspect rank would either discard useful knowledge or include noisy
information. As such, it is generally more reasonable to select ranks that fall between the extreme
values.
The error count estimation is mainly based on how various suspects are shared among counter-
examples. As a result, the lower bound on the method’s estimate is implicit and might be affected by
noise as described previously. Nonetheless, the triage engine should ideally enforce a strict lower bound.
This is indeed possible since there are counter-examples for which we know that they are definitely
unrelated; these that share no common suspects. The number of counter-examples that are guaranteed
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 44
to be unrelated essentially equals the minimum number of distinct errors that should be considered by
the triage engine. For example, if three counter-examples are definitely unrelated, then we know that
there exist at least three distinct errors responsible for all the observed failures. Their number might be
higher, and the proposed method will attempt to approximate this number, but it is definitely no less,
and the method should never predict a smaller number. It is possible to explicitly enforce a minimum
error count estimation, so that, in the worst case, the proposed estimation does not fall below this lower
bound. The lower bound on error count estimation, denoted as emin, is given by:
e≥ emin =∣∣∣{Ci :
j=|C |∑j=1|Mi j|= 0}
∣∣∣ (3.11)
In essence, Eq. 3.11 says that emin is equal to the number of counter-examples that share no mutual
suspects with any other counter-example.
3.3.5 Counter-example Clustering
The information embedded in the metrics described above is applied for the last step of the proposed
triage framework, which is the formation of groups of similar counter-examples. For that purpose, we
formulate triage as a clustering problem and employ a hierarchical clustering algorithm [10, 40] to solve
it.
Hierarchical clustering aims to group together elements based on their relationship, which is quan-
tified by a metric called distance [10, 40]. A distance metric takes positive real values, and is assigned
per pair of elements. A small distance between a pair of elements indicates a strong relationship, and
vice versa. In our framework, the elements to be grouped are essentially counter-examples. Since
counter-example proximity expresses relationship in the same manner as the required distance metric, it
is natural to generate the clustering distance metric from the computed counter-example proximity.
In the context of our work, hierarchical clustering takes as input the set of all counter-examples C ,
and the proximity between all pairs of counter-examples in the form of a |C |× |C | matrix, referred to as
the proximity matrix P|C |,|C |:
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 45
P|C |,|C | =
prox(C1,C1) prox(C1,C2) · · · prox(C1,C|C |)
prox(C2,C1) prox(C2,C2) · · · prox(C2,C|C |)...
.... . .
...
prox(C|C |,C1) prox(C|C |,C2) · · · prox(C|C |,C|C |)
(3.12)
Each row i in the proximity matrix P|C |,|C |, entails exactly the information that correlates counter-
example Ci with the rest of the generated counter-examples. As such, we can define the proximity vector
of counter-example Ci as ~Ci =[prox(Ci,C1), prox(Ci,C2), . . . , prox(Ci,C|C |)
]∈ R|C |. If we consider ~Ci
as a coefficient vector, then we can map counter-example Ci to a |C |-dimensional Euclidean space.
Then the desired clustering distance metric can be defined as the Euclidean distance between each pair
of counter-examples in the |C |-dimensional Euclidean space. We denoted this distance as di j. Formally:
di j = ||~Ci− ~C j||=(
k=|C |∑k=1|prox(Ci,Ck)− prox(C j,Ck)|2
)1/2
(3.13)
The corresponding distance matrix, denoted as D|C |,|C | is given below:
D|C |,|C | =
d11 d12 · · · d1|C |
d21 d22 · · · d1|C |...
.... . .
...
d|C |1 d|C |2 · · · d|C ||C |
(3.14)
The reason for using the distance matrix D|C |,|C | to express counter-example relation, rather than
using the proximity matrix P|C |,|C |, is the fact that counter-example proximity does not necessarily re-
spect the Euclidean property. As we will see in the next Section, mapping counter-examples as data
points in a high dimensional Euclidean space allows us to use more flexible metrics for cluster merging,
something that is not possible when the proximity matrix is used in the form presented here. During the
development of the proposed triage engine, we observed that for data sets that are well separable the
effect of transitioning from the proximity matrix to the distance matrix is negligible.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 46
The output of hierarchical clustering is not a single partition of the counter-example set C . In the
general case, the algorithm produces a nested sequence of partitions, with a single, all-inclusive cluster
at the top and singleton clusters of individual data points at the bottom. Each intermediate iteration
can be viewed as combining (splitting) two clusters from the next lower (next higher) iteration. In the
proposed implementation, clusters of counter-examples are formed in a bottom-up fashion (agglomera-
tive) by merging clusters that are likely to contain similar counter-examples (that have small Euclidean
distance). The process stops when emin clusters are formed. The reason is that any partition with less
than emin clusters implies the existence of less errors than the ones guaranteed to exist, and thus should
be discarded. Of course, in the case where emin = 1 then an all-inclusive cluster that contains all counter-
examples is produced. At each iteration of the algorithm we merge only two clusters, and therefore the
number of total formed clusters decreases always by one.
The algorithm generates all possible partitions of the counter-example set C that involve emin clusters
or more. However, the error count estimation, e, presented in this Section, suggests the most reasonable
partition based on our pre-processing and assumptions made. This is the reason why the algorithm
does not stop when the number of currently formed clusters is equal to the error count estimation, but
it proceeds until the number of clusters is emin. We wish to have all partitions available at the end, for
examination, in case the suggested partition does not satisfy the verification engineer(s).
So far it has become obvious that the proximity metric is what defines the distance between counter-
examples. However, all clustering algorithms -including hierarchical clustering- require to further define
what the distance (relation) between two clusters is. The decision to merge two clusters is determined
by a linkage criterion. There are various linkage criteria that confer a different behavior to the algorithm
and usually produce different partitions of the data set. The most popular of them are the Single-Linkage,
Complete-Linkage, Group Average criteria and Ward’s Method [10]. The first three criteria do not re-
quire that the clustered objects reside in an Euclidean space and can be applied on any proximity matrix
as long as it expresses some similarity/dissimilarity between objects. Ward’s Method, on the other hand,
requires that the objects are represented by a feature vector [40], which is the case in the proposed formu-
lation. As such, the proposed distance matrix and the vector-based representation of counter-examples
allows the application of a wider range of linkage criteria. Nevertheless, all the aforementioned criteria
are based on greedy choices when clusters are merged, which is the reason why hierarchical clustering
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 47
is a greedy algorithm and is not based on any optimization of a given cost function.
In this work, we use Ward’s Method [40], where at each step we merge the pair of clusters that leads
to minimum increase in total within-cluster variance after merging. More precisely, Ward’s Method says
that the distance between two clusters A and B, denoted as ∆AB, is the amount of increase in variance of
the data points that belong to these clusters, if we decide to merge them into a larger one. Formally:
∆AB = ∑Ci∈A∪B
||~Ci−~mA∪B||2− ∑
Ci∈A||~Ci−~mA||
2− ∑Ci∈B||~Ci−~mB||
2
=nA×nB
nA +nB×||~mA−~mB||2 (3.15)
where ~mk is the center of cluster k, and nk is the number of counter-examples in it.
With hierarchical clustering, the sum of squares starts out at zero, because every point is in its own
singleton cluster, and then grows as we merge clusters. Ward’s method keeps this growth as small as
possible. This property tends to create compact clusters, which proved to perform well, as shown by
experiments in the next Section.
Even if the linkage criterion presented here will try to merge clusters of related counter-examples,
the counter-example proximity is derived in such a way that allows us to know with full certainty that
some counter-examples should never belong to the same cluster. Ward’s Method, and in reality any other
linkage criterion, does not have such information available. The distance/proximity between definitely
unrelated counter-examples is the maximum that can be generated, but cannot guarantee that two clusters
between these counter-examples will never be formed. One possible solution is to explicitly force the
distance between such counter-examples to a relatively high number, so that it will cause a massive
increase in variance when the respective clusters attempt to merge. However, there is not straightforward
way of performing such a modification without violating the relative distance between these counter-
examples and the ones that are indeed related to them. Instead, it is much simpler to modify the linkage
criterion such that it prohibits the merging of clusters that contain definitely unrelated counter-examples,
and thus performing no changes to the distances themselves. More specifically, we modify Ward’s
Method such that the distance between clusters that contain definitely unrelated counter-examples is set
to infinity. We denote the modified cluster distance, as ∆′AB:
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 48
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . .
1st Iteration
3rd Iteration 2nd Iteration
C1
C2
C3
Figure 3.5: Counter-example hierarchical clustering
∆′AB =
+∞ ∃Ci ∈ A,C j ∈ B : |Mi j|= 0
∆AB otherwise(3.16)
This way, the merging of two clusters that contain counter-examples guaranteed to be caused by
different errors is always blocked, since the distance between these clusters will never be minimum.
Also note that the above issue cannot be merely resolved by forcing a lower bound of emin clusters,
because the lower bound emin constrains the minimum number of clusters to be formed, and not the way
clusters are merged.
To summarize, the hierarchical clustering algorithm is performed in three major steps:
1. Place each counter-example into its own singleton cluster
2. Iteratively merge the two closest clusters
3. Stop when all counter-examples are merged into emin clusters
Example 9 Consider again the same three counter-examples for which the proximity and error count
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 49
estimation were previously computed in examples 6 and 7 respectively. The proximity matrix for counter-
examples C1, C2 and C3 is:
P3,3 =
0 0.78 0.83
0.78 0 1
0.83 1 0
(3.17)
The corresponding Euclidean distance matrix is computed as:
D3,3 =
0 1.116 1.194
1.116 0 1.415
1.194 1.415 0
(3.18)
Note that the relation between the counter-examples is preserved after this transformation. For
illustration only purposes assume that we map the data points corresponding to the three counter-
examples on a 2-D Euclidean plane as shown in Figure 3.5. The hierarchical clustering algorithm will
initially consider each counter-example as a singleton cluster and in two iterations will produce the final
all-inclusive cluster. At the second iteration counter-examples C1 and C2 are merged into a single cluster
and C3 remains a singleton. Recall that the error count estimation for this example was computed as
e = 2, and thus the above partition is the suggested one. Notice that these counter-examples are now
correctly grouped as opposed to the unfortunate outcome of conventional triage techniques for the same
counter-examples, shown at the beginning of this Chapter.
3.3.6 Overall Flow
The overall flow that contains the steps described in this Chapter is illustrated by Figure. 3.6. The input
to the flow is a set of counter-examples generated by regression verification. The debugger is evoked
and provides a solution set for each counter-example. Based on Eq. 3.3, a ranked version of the suspects
is constructed. The ranking scheme is subsequently utilized for the computation of counter-example
proximity and the error count estimation based on Eq. 3.5 and Eq. 3.10 respectively. Those metrics are
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 50
then passed to the clustering algorithm that forms all possible clusters of related counter-examples. The
output of the triage engine is the unique partitioning that comprises of e related clusters, suggested by
the error count estimation. The grouping is then examined by engineers, along with the suspect ranking
scheme which is already computed. Note that the triage process is initially executed with the error count
estimation, but it depends on the engineer to accept the formation of e groups or examine an alternative
number of groups already generated by the hierarchical clustering algorithm.
COUNTER-
EXAMPLE
PROXIMITY
ERROR COUNT
ESTIMATION (e)
AUTOMATED
DEBUGGER
SOLUTION SETS
{S1,S2,…,SN}
COUNTER-
EXAMPLE
CLUSTERING
GROUP g1 GROUP g2 GROUP ge
. . .
SATISFYING
CLUSTERING ?
DETAILED ROOT
CAUSE ANALYSIS
YES
{C1,C2,…,CN}
SUSPECT
RANKING
ACCEPT
ALTERNATIVE
CLUSTERINGNO
Figure 3.6: Proposed triage framework
It should be clarified that the debugging step preceding the clustering process requires manual effort
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 51
from the engineer(s). However, this does not add any overhead to the whole debugging flow. Rather, this
manual task is simply moved earlier in the flow in order to generate the necessary metrics for clustering.
As such, it will not have to be repeated once the groups of counter-examples are passed to the rightful
engineer(s).
The complexity of the overall flow is dominated by the complexity of hierarchical clustering and is
upper-bounded by O(|C |3), where |C | is the number of counter-examples to be grouped. Since the size
of the counter-example set is not expected to be in the thousands in the majority of regression cases, the
triage engine is expected to scale well within typical regression verification flows.
3.4 Experimental Results
This Section presents preliminary experimental results for the proposed triage framework. All experi-
ments are conducted on a single core of an Intel Core i5 3.1 GHz workstation with 8GB of RAM. Four
OpenCores [29] designs are used for the evaluation (vga, fpu, spi and mem ctrl). The underlying
automated debugging tool used for extracting the suspect locations is implemented based on [35]. A
platform coded in Python is developed to parse the returned results of the debugger, calculate the rele-
vant metrics and perform hierarchical clustering on the resulting counter-examples. For each design, a
set of different “typical” errors is injected each time by modifying the RTL description. In total, sixteen
regression simulations are run, generating a different number of counter-examples each time, caused by
a different set of errors.
For each design, a pre-generated set of test sequences is used that is stored in vector files. Each
regression run involves hundreds to thousands of input vectors. For the purpose of capturing failures
caused by the injected design errors we use end-to-end checkers that compare the expected value for
various operations, exception checkers and various assertions throughout the designs. It should be
noted that the injected RTL errors are not generated randomly, as we observed that the majority of
randomly introduced bugs are either not captured by the pre-defined test suites or create trivial cases for
counter-example clustering (i.e. counter-examples that are “shifted“ in time versions of other counter-
examples). Instead, we meticulously inject errors that resemble typical human-introduced ones and that
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 52
"14540: ERROR: output mismatch. Expected f292e945, Got
f309efe9 (3ff759808cd7826af292e945) in vector: 4"
"27540: ERROR: output mismatch. Expected f00007b2, Got
efcda8a0 (cd7fa2441cff92e8f00007b2) in vector: 17"
"33540: ERROR: output mismatch. Expected 795a1f75, Got
79804398 (7b9e426741b9bdbf795a1f75) in vector: 23"
"34540: ERROR: output mismatch. Expected 35804398, Got
35dae339 (b3e7a98fbde72f7a35804398) in vector: 24"
"** Error: Assertion error.
Time: 1150 ns Started: 950 ns Scope:
test.dut.chk_fpu.a_div File: ../sva/fpu.sv Line: 233"
"ERROR: Underflow Exception Expected: 0, Got 1
45540: ERROR: output mismatch. Expected 00000000, Got
00000000 (8a314ad1997a7e9b00000000) in vector: 35"
"24540: ERROR: output mismatch. Expected ceac709c, Got
cf2c709c (4ef3129a4f4fc19bceac709c) in vector: 14"
"49540: ERROR: output mismatch. Expected 45aad895, Got
462ad895 (c17e453045ab57b845aad895) in vector: 39"
"** Error: Assertion error.
Time: 1350 ns Started: 1250 ns Scope:
test.dut.u1.chk_pre_norm.a_check_pos_sign File:
../sva/pre_norm.sv Line: 70"
"** Error: Assertion error.
Time: 2650 ns Started: 2550 ns Scope:
test.dut.u1.chk_pre_norm.a_check_neg_sign File:
../sva/pre_norm.sv Line: 75"
"43540: ERROR: output mismatch. Expected 6fcfb17a, Got
efcfb179 (6fcfb17a1bb73bd36fcfb17a) in vector: 33"
"48540: ERROR: output mismatch. Expected aebaa9dd, Got
2ebaa9dd (aebaa9de996ed347aebaa9dd) in vector: 38"
"ERROR: DIV_BY_ZERO Exception: Expected: 1, Got 0
28540: ERROR: output mismatch. Expected 00000000, Got 00000000 (92bf785f9b6e56a400000000) in vector: 18"
another. If the similarity score is high then two failures should be
grouped together; if they are low, then they should be separated.
Other information can be used as additional metrics to either
bias the weights when comparing separate error paths, or simply
used as tie-breakers when two failures are borderline similar. For
example, recently changed code could act as a simple filter to
disregard, or change the weight of the different components in the
path. Another example is using different operation modes as a tie
breaking score for borderline similar failures. These metrics are
much more dependent on the environment and can be tuned for
optimal use.
Once a similarity score is generated from two signatures,
there are many clustering algorithms that can be applied to group
them. These algorithms typically involve a threshold parameter
that will decide how easily two similar failures can be grouped
together. This threshold value need not be static either as it can
change based on the environment information as well. In fact, it
may be best to experiment with many such variables until settling
on an appropriate set of thresholds and metrics.
Finally, when the bins are created, the best suited engineer
to fix the problem must be identified. The source control database
can tag engineers based on the owner of the most common
modules/files or the author of the last change committed for each
bin.
4. CASE STUDY
The failure triage infrastructure described in this paper has
been developed and is available for commercial use. Its industrial
use has been applied to commercial designs in communication
applications. Due to the confidential nature of the commercial
designs, we cannot disclose the level of data required for this
paper. However, to provide a detailed level of information and
illustrate the effectiveness of the triage infrastructure, we have
collaborated with graduate students from the University of
Toronto where the triage tool was applied on two sample designs.
To create a realistic verification environment, dozens of “typical”
bugs were created in the RTL and testbench components of the
designs. A sample of the bugs are shown in the remaining of the
paper.
In this section we describe in detail the two case studies. For
each we present an overview of the design, provide a sample of
the failures found during simulation and show the results of the
triage engine.
4.1 Design 1: FPU module
The FPU design used in the case study is a single precision
IEEE 754 compliant Float Point Unit (FPU) from Opencores [5]
with some modifications. The design is written in Verilog and is
composed of eight modules totalling 1415 lines of code. It can
perform six operations and supports four rounding modes. The
architecture consists of a floating point exception number units,
floating point pre-normalization unit that adjust the numbers to
equal exponents, primitive operation modules, and floating point
post-normalization that denormalizes and rounds the result.
The test suite contains a test set for each FPU operation with
different round modes. The test sequences are pre-generated and
stored in vector files. Depending on the operation and round
mode, the corresponding test sequence is loaded into the memory,
as well as the expected values. There is an end-to-end checker that
compares the expected value for each operation, exception
checkers and some assertions throughout the design.
When simulating the design with the different test
sequences we get many failures that occur. Due to space
constraints we only show some of the firings as follows.
Notice that some errors are due to assertion failures and
others are due to golden value mismatches and exception
catching. We run the OnPoint root cause analysis engine and use
the result to generate the signatures described. The clustering
algorithm groups all the failures into five bins. After performing
further manual debugging on each bin, the root cause of each error
is identified. In these experiments, the grouping is performed
correctly as the same bugs are grouped together and distinct bugs
are grouped separately. Note that if binning was done purely
based on the failure messages there would have been four bins
where at least two bug sources would have gone unidentified, and
another bin would have presented a duplicate error.
Next, each of the bins containing the root causes are briefly
described. The first bin groups four checker failures and one
assertion failure together, which is typically hard to do manually.
Bin 1: 4 checkers, 1 assertion
Bug Location: primitives.v : 90 & 98
-> // Bug: missing one clock delay
-> always @(posedge clk)
-> quo <= #1 opa / opb;
-> always @(posedge clk)
-> rem <= #1 opa % opb;
-> // Fix:
-> always @(posedge clk) begin
-> quo1 <= #1 opa / opb;
-> quo <= #1 quo1;
-> end
-> always @(posedge clk) begin
-> rem1 <= #1 opa % opb;
-> rem <= #1 rem1;
-> end
(a) Missing pipeline stage
"# ** Error: Assertion error.
# Time: 603 ns Started: 603 ns Scope:
test.dut.chk_top_i1.assertion_a_fifo_rreq File:
../sva/vga_top.sv Line: 92"
"# ** Error: Assertion error.
# Time: 609 ns Started: 603 ns Scope:
test.dut.wbm.clut_sw_fifo.chk_fifo_i1._a_read_pointer File:
../sva/vga_fifo.sv Line: 32"
"# ** Error: Assertion error.
# Time: 609 ns Started: 603 ns Scope:
test.dut.wbm.data_fifo.chk_fifo_i1._a_word_down_counter File:
../sva/vga_fifo.sv Line: 72"
"# ** Error: Assertion error.
# Time: 897 ns Started: 897 ns Scope:
test.dut.sigMap_i1.assertion_wbs_dat_o File:
../sva/VennsaChecker.sv Line: 45
# 897.0 ns: expected aaaaaaaa, got ffffff9f"
"# ** Error: Assertion error.
# Time: 633 ns Started: 627 ns Scope:
test.dut.wbm.data_fifo.chk_fifo_i1._a_read_pointer File:
../sva/vga_fifo.sv Line: 32"
"# ** Error: Assertion error.
# Time: 651 ns Started: 645 ns Scope:
test.dut.pixel_generator.rgb_fifo.chk_fifo_i1._a_read_pointer
File: ../sva/vga_fifo.sv Line: 32"
"# ** Error: Assertion error.
# Time: 831 ns Started: 831 ns Scope:
test.dut.sigMap_i1.assertion_wbs_dat_o File:
../sva/VennsaChecker.sv Line: 45
# 831.0 ns: expected ffffff9f, got ffffffXf"
"# At time 273.0 ns: ERROR in wishbone:
golden=0000000000000000, actual=0000000100000000"
"# ** Error: Assertion error.
# Time: 273 ns Started: 273 ns Scope:
test.dut.sigMap_i1.assertion_wbs_dat_o File:
../sva/VennsaChecker.sv Line: 45
# 273.0 ns: expected 00000000, got 00000001"
"# At time 111.0 ns: ERROR in sync: golden=0,
actual=1"
The correct and buggy RTL is shown above. In this case the bug
is that there is a missing pipeline stage.
The second bin contains a single exception error. The bug in
this case resides in the Verilog testbench where bad stimulus is
generated. Interestingly this failure is distinguished from the
others and is binned on its own.
The third bin groups two checker failures. In this case a
basic grouping algorithm would have resulted in a similar result.
This bug is in the RTL and is due to setting the top-most bit to
zero instead of one.
Bin four groups two assertion failures and two checker
failures, which is typically hard to identify manually. The bug is
in the RTL and is a result of incorrectly decoded signals inside a
case statement.
Bin five captures a single exception on its own. In this case,
after root cause analysis, it finds that the bug is in the expected
model used for the exception handling. This strengths the notion
that the triage approach is also valid for bugs outside of the DUT.
4.2 Design 2: VGA controller
The VGA controller is from Opencores [5] with some
modification, it is written in Verilog, composed of 17 modules
totalling 4,076 lines of code and approximately 90,000
synthesized gates. The controller provides VGA capabilities for
embedded systems. The architecture consists of a Color
Processing module and a Color Lookup Table (CLUT), a Cursor
Processing module, a Line FIFO that controls the data stream to
the display, a Video Timing Generator, and Wishbone master and
slave interfaces to communicate with all external memory and the
host, respectively.
The operation of the core is as follows. Image data is
fetched automatically via the Wishbone Master interface from the
video memory located outside the primary core. The Color
Processor then decodes the image data and passes it to the Line
FIFO to transmit to the display. The Cursor Processor controls the
location and image of the cursor processor on the display. The
Video Timing Generator module generates synchronization pulses
and interrupt signals for the host.
The test suite for the VGA core is constructed using UVM.
Four main tests are used for verifying this design. These include
register, timing, pixel data, and FIFO tests. The transaction has
randomly generated control-data pairing packets under certain
constraints. These transactions are expected to cover all the VGA
operation modes in the tests (and they may be reused to test other
video cores such as DVI, etc). The sequencer exercises different
combinations of these transactions through a given testing scheme
so that most corner cases and/or mode switching are covered. The
monitors are connected to the DUT and the reference model
Bin 5: 1 exception
Bug Location: test_top.v : 302 - 306
-> // Bug: incorrect reference model
-> if(div_by_zero != exc4[2])
-> begin
-> exc_err=1;
-> $display("\nERROR: DIV_BY_ZERO Exception:
Expected: %h, Got %h\n",exc4[2],div_by_zero);
-> end
Bin 4: 2 assertions, 1 checker, 1 exception
Bug Loaction: pre_norm.v : 213 - 216
-> always @(signa or signb or add ...
-> ...
-> // Bug: switched assignments
-> 3'b0_0_0: sign_d = 1;
-> 3'b0_1_0: sign_d = !fractb_lt_fracta;
-> 3'b1_0_0: sign_d = fractb_lt_fracta;
-> 3'b1_1_0: sign_d = 0;
-> // Fix:
-> 3'b0_0_0: sign_d = fractb_lt_fracta;
-> 3'b0_1_0: sign_d = 0;
-> 3'b1_0_0: sign_d = 1;
-> 3'b1_1_0: sign_d = !fractb_lt_fracta;
Bin 3: 2 checkers
Bug Location: post_norm.v : 354
-> // Bug: Incorrect padding bit
-> assign {exp_rnd_adj0, fract_out_rnd0} = round ?
fract_out_pl1 : {1'b1, fract_out};
-> // Fix:
-> assign {exp_rnd_adj0, fract_out_rnd0} = round ?
fract_out_pl1 : {1'b0, fract_out};
Bi
Bug Location: test_top.v : 322 - 326
-> // Bug: incorrect stimulus
-> ...
-> @(posedge clk);
-> #1;
-> ...
-> oper = tmp[103:96];
-> ...
-> case(oper)
-> 8'b00000001: fpu_op=3'b000; // Add
-> ...
-> 8'b01000000: fpu_op=3'b110; // rem
-> default: fpu_op=3'bx;
-> endcase
-> ...
(b) Bad stimulus generated
respectively. They check the protocols of the responses, and make
sure that the data being sent to scoreboard has correct timing. The
scoreboard and checkers contain all the field checkers which
compare the data from the DUT, and reports the mismatches.
The golden reference model is implemented using C++. It
receives the same set of stimulus from the driver (uvm_driver
class) and produces the expected value of the outputs. Along with
the reference model, 50 SystemVerilog Assertions (SVA) are used
to do some instant checks. While running simulation, SVA can
catch unexpected behaviours of the design and prevent corrupted
data going through the flow.
A sample of the failures that occurred during a suite of
simulation tests is shown. Notice that there are a set of assertions
and correctness checkers that fire. As in the FPU case, OnPoint is
run on the test to generate the suspects which are used as
signatures during the triage process.
If triage were performed based purely on the error message
the result would be six bins. In contrast there are only four errors
in this case, thus time would have been wasted analyzing
redundant failures. Furthermore, two of the errors could also have
been missed if only one failure is analyzed within each bin. In
contrast, the triage infrastructure proposed correctly generates
four bins, one for each error. The resulting triage bins and the root
cause of the failures are shown below.
Bin one groups four assertions failures based on three
different assertions thus eliminating wasted time by analyzing
each one separately. The single bug source is due to an incorrect
assignment based on the state of the vga color processor.
Bin two catches three distinct assertion failures once again.
In this case, the RTL bug is due to picking the wrong bit of a read
pointer inside a fifo.
Bin three contains both a checker and an assertion failure.
Interestingly, this bug resides in the testbench where some
stimulus signals are instantiated using the wrong models. As a
result both the checker and an assertion fail. Debug such cases
typically would involved multiple designers and verification
engineers.
Bin four contains a single checker failure. This failure is
caused by a bug in the testbench where the reference model
contains the bug.
In all these cases, we have confirmed that the bins generated
by the proposed triage approach correctly bin the failure based on
the same root cause. We confirmed the finding by verifying that
fixing the bugs remove all the failures for a given bin. It should be
noted that the proposed triage approach may not always be
correct, if distinct bugs are close in proximity they may end up in
the same bin.
5. Conclusion
In this work we presented a novel failure triage approach
that is both automated and generates better results than previous
script-based and manual techniques. The triage engine relies on
information from root cause analysis tools that provide visibility
into the propagation paths of the bug. These paths along with their
activation times provide unique insight that is used to group
similar failure together. To illustrate the effectiveness of the
approach we provide two small case studies where distinct bugs
are correctly binned separately. Further research in this area will
focus on improving the resolution and quality of the binning
algorithms and generating custom heuristics for testbench and
environment originating bugs.
6. REFERENCES
[1] H. Foster, “Assertion-based verification: Industry myths to
realities (invited tutorial),” in Computer Aided Verification,
2008, pp. 5–10.
[2] S. Huang and K. Cheng, Formal Equivalence Checking and
Design Debugging. Kluwer Academic Publisher, 1998.
[3] A. Smith, A. Veneris, M. F. Ali, and A. Viglas, “Fault
Diagnosis and Logic Debugging Using Boolean
Satisfiability,” IEEE Trans. on CAD,vol. 24, no. 10, pp.
1606–1621, 2005.
[4] Vennsa Technologies Inc.,
http://www.vennsa.com/product_simulation.html
[5] OpenCores, http://www.opencores.org
Bin 4: 1 checker
Bug Location : self_checking.v : 110 - 121
-> // Bug: incorrect "blanc_golden" generated from
erroreous reference model
-> if(^{hsync_golden, vsync_golden, csync_golden,
blanc_golden} !==1'bx)
-> if({hsync_golden, vsync_golden, csync_golden,
blanc_golden} != {hsync, vsync, csync, blanc})
-> begin
-> $display("At time %t: ERROR in sync:
golden=%h, actual=%h", $time,
-> {hsync_golden, vsync_golden,
csync_golden, blanc_golden},
-> {hsync, vsync, csync, blanc}
-> );
-> ->ERROR;
->
-> end
Bin 3: 1 checker, 1 assertion
Bug Location: test_bench_top.v : 684
-> // Bug: incorrect stimulus "wb_err_i" generated from
wb_slv model
-> wb_slv #(24) s0(.clk( clk ),
-> .rst( rst ),
-> .adr( {1'b0, wb_addr_o[30:0]} ),
-> ...
-> .err( wb_err_i ),
-> .rty( )
-> );
Bin 2: 3 assertions (3 different)
Bug Location : vga_fifo.v : 191
-> always @(posedge clk or negedge aclr)
-> ...
-> // Bug: missing use of function
-> else if (frreq) rp <= #1 {rp[aw-1:1], rp};
-> // Fix:
-> else if (frreq) rp <= #1 {rp[aw-1:1], lsb(rp)};
Bin 1: 4 assertions (3 different)
Bug Location: vga_colproc.v : 263
-> always @(c_state or vdat_buffer_empty or colcnt or
DataBuffer or rgb_fifo_full or clut_ack or clut_q or Ba or Ga
or Ra)
-> begin : output_decoder
->
-> // initial values
-> // Bug incorrect initial value
-> ivdat_buf_rreq = 1'b1;
->
-> // Fix:
-> ivdat_bug_rreq = 1'b0;
(c) Incorrect assignment based on state of vga color processor
respectively. They check the protocols of the responses, and make
sure that the data being sent to scoreboard has correct timing. The
scoreboard and checkers contain all the field checkers which
compare the data from the DUT, and reports the mismatches.
The golden reference model is implemented using C++. It
receives the same set of stimulus from the driver (uvm_driver
class) and produces the expected value of the outputs. Along with
the reference model, 50 SystemVerilog Assertions (SVA) are used
to do some instant checks. While running simulation, SVA can
catch unexpected behaviours of the design and prevent corrupted
data going through the flow.
A sample of the failures that occurred during a suite of
simulation tests is shown. Notice that there are a set of assertions
and correctness checkers that fire. As in the FPU case, OnPoint is
run on the test to generate the suspects which are used as
signatures during the triage process.
If triage were performed based purely on the error message
the result would be six bins. In contrast there are only four errors
in this case, thus time would have been wasted analyzing
redundant failures. Furthermore, two of the errors could also have
been missed if only one failure is analyzed within each bin. In
contrast, the triage infrastructure proposed correctly generates
four bins, one for each error. The resulting triage bins and the root
cause of the failures are shown below.
Bin one groups four assertions failures based on three
different assertions thus eliminating wasted time by analyzing
each one separately. The single bug source is due to an incorrect
assignment based on the state of the vga color processor.
Bin two catches three distinct assertion failures once again.
In this case, the RTL bug is due to picking the wrong bit of a read
pointer inside a fifo.
Bin three contains both a checker and an assertion failure.
Interestingly, this bug resides in the testbench where some
stimulus signals are instantiated using the wrong models. As a
result both the checker and an assertion fail. Debug such cases
typically would involved multiple designers and verification
engineers.
Bin four contains a single checker failure. This failure is
caused by a bug in the testbench where the reference model
contains the bug.
In all these cases, we have confirmed that the bins generated
by the proposed triage approach correctly bin the failure based on
the same root cause. We confirmed the finding by verifying that
fixing the bugs remove all the failures for a given bin. It should be
noted that the proposed triage approach may not always be
correct, if distinct bugs are close in proximity they may end up in
the same bin.
5. Conclusion
In this work we presented a novel failure triage approach
that is both automated and generates better results than previous
script-based and manual techniques. The triage engine relies on
information from root cause analysis tools that provide visibility
into the propagation paths of the bug. These paths along with their
activation times provide unique insight that is used to group
similar failure together. To illustrate the effectiveness of the
approach we provide two small case studies where distinct bugs
are correctly binned separately. Further research in this area will
focus on improving the resolution and quality of the binning
algorithms and generating custom heuristics for testbench and
environment originating bugs.
6. REFERENCES
[1] H. Foster, “Assertion-based verification: Industry myths to
realities (invited tutorial),” in Computer Aided Verification,
2008, pp. 5–10.
[2] S. Huang and K. Cheng, Formal Equivalence Checking and
Design Debugging. Kluwer Academic Publisher, 1998.
[3] A. Smith, A. Veneris, M. F. Ali, and A. Viglas, “Fault
Diagnosis and Logic Debugging Using Boolean
Satisfiability,” IEEE Trans. on CAD,vol. 24, no. 10, pp.
1606–1621, 2005.
[4] Vennsa Technologies Inc.,
http://www.vennsa.com/product_simulation.html
[5] OpenCores, http://www.opencores.org
Bin 4: 1 checker
Bug Location : self_checking.v : 110 - 121
-> // Bug: incorrect "blanc_golden" generated from
erroreous reference model
-> if(^{hsync_golden, vsync_golden, csync_golden,
blanc_golden} !==1'bx)
-> if({hsync_golden, vsync_golden, csync_golden,
blanc_golden} != {hsync, vsync, csync, blanc})
-> begin
-> $display("At time %t: ERROR in sync:
golden=%h, actual=%h", $time,
-> {hsync_golden, vsync_golden,
csync_golden, blanc_golden},
-> {hsync, vsync, csync, blanc}
-> );
-> ->ERROR;
->
-> end
Bin 3: 1 checker, 1 assertion
Bug Location: test_bench_top.v : 684
-> // Bug: incorrect stimulus "wb_err_i" generated from
wb_slv model
-> wb_slv #(24) s0(.clk( clk ),
-> .rst( rst ),
-> .adr( {1'b0, wb_addr_o[30:0]} ),
-> ...
-> .err( wb_err_i ),
-> .rty( )
-> );
Bin
Bug Location : vga_fifo.v : 191
-> always @(posedge clk or negedge aclr)
-> ...
-> // Bug: missing use of function
-> else if (frreq) rp <= #1 {rp[aw-1:1], rp};
-> // Fix:
-> else if (frreq) rp <= #1 {rp[aw-1:1], lsb(rp)};
Bin 1: 4 assertions (3 different)
Bug Location: vga_colproc.v : 263
-> always @(c_state or vdat_buffer_empty or colcnt or
DataBuffer or rgb_fifo_full or clut_ack or clut_q or Ba or Ga
or Ra)
-> begin : output_decoder
->
-> // initial values
-> // Bug incorrect initial value
-> ivdat_buf_rreq = 1'b1;
->
-> // Fix:
-> ivdat_bug_rreq = 1'b0;
(d) Picking the wrong bit of read pointer inside fifo
Figure 3.7: Examples of injected RTL errors
also generate interesting non-trivial scenarios for triage. Figure 3.7 shows some examples of injected
RTL errors for the vga and fpu designs. Note that the error in Figure 3.7(b) resides in the testbench; we
also introduce and group errors in the verification environment and not only in the design that undergoes
debugging.
A first set of experiments is conducted to confirm the claims made based on our probabilistic analysis
in Section 3.3.1.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 53
1 2 3 >=40
20
40
60
suffix window length (cycles)
% o
ver
tota
l num
ber
actual errorssuspects
(a) Suspect allocation based on suffix across all testcases
1 2 3 4 >=50
20
40
60
80
times covered by simulation
% o
ver
tota
l num
ber
actual errorssuspects
(b) Suspect allocation based on frequency across all test-
cases
Figure 3.8: Features of real errors and suspects
After regression simulation, 285 counter-examples are collected for all designs, and since the actual
error is known between the returned suspects, we observe their first excitation cycle along with the
frequency of the corresponding RTL component. Results are illustrated in Fig. 3.8, where we see that
both real errors and suspects generally follow our expectations that suffix window length is generally
short and that these locations are exercised only a small number of times during their prefix window.
However, actual human introduced errors tend to follow the above pattern more accurately compared to
the rest of the suspects; a feature that enables these errors to be generally high in the suspect ranking
scheme and comes in compliance with our expectations.
A second set of experimental results is depicted in Figure 3.9, where we explore the effect of the
ranking scheme on the framework’s average accuracy across all sixteen testcases. For the experiments
presented in this Section the γ offset (Eq. 3.3) is set to 0.1. Furthermore, for this set of experiments
we run the triage engine for each regression testcase once with the standard and once with the modified
Ward’s Method criterion (WM and WM-Mod, respectively). The ratio(1− misclassified Ci’s
|C |)
determines
accuracy. A counter-example belonging to the wrong group is considered misclassified. In Figure 3.9,
R = 0 denotes the absence of ranking scheme (MRi j = Mi j always). More precisely, Figure 3.9(a) demon-
strates how bigger the method’s estimate e is, when compared to the actual number of errors. Apparently,
computations devoid of any suspect ranking lead to the inclusion of 2.8 extra clusters on average. This
reflects to a 77% average accuracy for the triage engine shown in Figure 3.9(b). Remark, that selecting
only the top ranked suspect (R = 1) discards useful knowledge by excluding the rest of the suspects and
results into 3.1 extra clusters on average, decreasing overall accuracy to 74%. Also observe that, when
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 54
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
R
avg.
err
or o
n e
(a) Average error on e across all testcases
0 1 2 3 4 5 6 7 8 9 10
40
60
80
100
R
avg.
acc
urac
y (%
)
proposed triage WMproposed triage WM−Mod
script
(b) Effect of R on triage accuracy across all testcases
Figure 3.9: Effect of selecting the R highest in rank suspects
R is set too high, low rank suspects are included in the computations and introduce noise to the error
count estimation. This also incurs a decrease in overall accuracy shown in Figure 3.9(b). However, any
reasonable selection for R, between 2−10, results in better accuracy overall, with the best outcome of
94% achieved when R = 4. Note that in the latter case, e is off only by 0.7 on average. Generally, even
for extreme values of R (1 or 10) the triage engine always outperforms conventional scripting-based
triage, which achieves a 67% average accuracy shown by the red line in Fig. 3.9(b). Finally, we observe
that by modifying Ward’s Method we achieve higher accuracy overall. With the standard criterion em-
ployed the triage engine exhibits an average accuracy of 83% across all selections for the parameter R.
On the other hand, with the modified criterion the triage engine achieves an average accuracy of 86%
across all values for R.
Table 3.1 demonstrates detailed results for all sixteen testcases and four designs with R = 4 so that
the top 4 in rank suspects are selected for the error count estimation. Our selection is indicative as
we choose this value of R because it reflects a good behavior for the algorithm. Again, though, any
reasonable R ∈ [2 . . .10] generates similar results as shown in Fig. 3.9. The first, second and third
columns refer to the design name, the testcase number and the size in gates of the design, respectively.
Columns 4 and 5 contain the actual number of errors that are injected into the design and the number of
test vectors that are used for each of the sixteen regression runs. Column 6 indicates the total number of
counter-examples that are generated by each regression run. Column 7 shows the error count estimation
(e) that the proposed method generates for each testcase. Columns 8 to 10 include a comparison in
accuracy between the proposed triage flow and a typical binning strategy based on a script that exploits
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 55
Table 3.1: Proposed Triage Engine Performance (R=4)
circuit testcase # gates # errors # test |C | e
accuracy
# suspects error rank time (sec)triage triage script
No. vectors (e) (# errors) (avg) (high - low)
1 4 7550 15 4 100% 100% 80% 12.5 1-6 12.8
fpu 2 83303 5 7550 20 6 83% 100% 65% 14.1 2-5 13.8
3 6 9802 24 6 100% 100% 68% 12.7 1-4 16.7
4 7 9802 31 7 95% 95% 65% 14.3 1-7 19.8
5 3 21203 14 4 88% 100% 71% 14.0 2-6 14.3
vga 6 72292 4 21203 15 6 78% 89% 67% 13.9 2-9 15.5
7 5 21203 22 5 100% 100% 55% 15.1 1-4 17.9
8 6 21203 29 7 89% 95% 69% 12.9 1-5 19.3
9 2 3370 8 2 100% 100% 75% 10.8 1-3 12.4
spi 10 1724 3 3370 13 3 100% 100% 62% 11.4 1-4 12.7
11 4 5019 16 5 90% 100% 63% 12.2 1-5 13.4
12 5 5019 16 6 94% 100% 69% 10.3 1-5 15.7
13 4 10834 9 4 100% 100% 56% 15.8 1-6 12.2
mem ctrl 14 46767 5 10834 15 5 100% 100% 67% 14.6 1-4 12.7
15 6 19507 18 8 89% 94% 72% 15.0 2-4 13.8
16 7 17006 20 9 90% 95% 70% 15.2 1-5 15.7
AVG: 94% 98% 67% 14.9
error message information. Specifically, columns 8 and 9 refer to the accuracy of the triage flow when
performed with our error count estimation (e) or with the actual number of errors (# errors) respectively,
assuming prior knowledge of that number. The eleventh column presents the average number of suspects
across all counter-examples in each regression session. Column 12 presents the lowest and highest rank
assigned to the actual error in the ranking list. Finally, the last column indicates the total time consumed
by the calculation of the two metrics and the clustering process.
The engine’s average accuracy reaches 94% when the algorithm is executed with our initial guess
(column 6) and reaches 98% for those groupings where the number of clusters equals the number of
design errors. Generally, a perfect initial guess that reflects to the actual number of errors is observed
in seven out of sixteen testcases, achieving a 99% accuracy on average. On the other hand, in cases
where the error count estimation is off by one or two clusters, accuracy drops to 88%. A conventional
approach similar to the one in Section 3.2 consisting of scripts to perform the grouping achieves an
overall accuracy of 67%. As such, the proposed method improves accuracy up to 40% when the initial
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 56
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1670
80
90
100
testcase No.Cla
ssifi
catio
n ac
cura
cy (
%)
R = 4
WM−Mod
WM
Figure 3.10: Effect of modification on Ward’s Method
guess is utilized; a solid improvement that indicates the potential of the proposed framework. Moreover,
the actual error is assigned a high rank in the suspect ranking list, as shown in column 10. Finally,
computation of the two metrics and clustering consume an average of 14.9 seconds in total, which is
acceptable for the purposes of triage.
Detailed results demonstrating the benefit of constraining Ward’s Method to prohibit the merging of
clusters that contain definitely unrelated counter-examples are shown in Fig. 3.10. Results are generated
per testcase, using the same configuration (R = 4, γ = 0.1) as in Table 3.1. Fig. 3.10 illustrates accuracy
results for the triage engine when standard Ward’s Method (WM) is used as a linkage criterion and when
the modified version (WM-Mod) is applied. We observe that the proposed modified criterion achieves
higher accuracy in 9 out of 16 testcases overall improving the average accuracy by 4.5%. Recall that the
detailed results in Table 3.1 are generated when the modified version of Ward’s Method is applied and
thus agree with the results presented in Fig. 3.10.
3.5 Summary
In this Chapter, a novel automated debugging triage framework is proposed. The algorithm extracts
information from simulation and debugging results to define relationship between various counter-
examples. Strongly related counter-examples are then grouped together to guide detailed debugging.
CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 57
In order to quantify counter-example relation we introduce the concept of counter-example proximity
and propose a suspect ranking scheme for its computation. Furthermore, we devise a speculative met-
ric to estimate the number of co-existing errors. The applicability and efficacy of the triage engine is
demonstrated by experimental results within typical regression verification flows, indicating a signifi-
cant increase in grouping accuracy compared to traditional triage techniques.
Chapter 4
Leveraging Re-configurability To Raise
Productivity In FPGA Functional Debug
4.1 Introduction
Compared to high cost, state-of-the-art ASIC design, field-programmable gate arrays (FPGAs) offer a
wide gamut of benefits when employed as platforms for digital circuit implementation. What mainly
distinguishes FPGAs is that they carry the advantage of re-configurability and relatively low NRE costs
for mid-to-high volume applications.
The reconfigurability property is one of the outstanding assets of FPGAs when it comes to functional
verification. With ASICs, designers spend considerable time in simulation/verification before tape out,
including, for example, simulation with post-layout extracted capacitances and cross-talk noise analysis.
Conversely with FPGAs, designers rarely do post-routing full delay simulations. Instead, reconfigurabil-
ity allows design iterations to include actual silicon execution. Designers verify their design in hardware
using the same (or a similar) FPGA they intend to deploy in the field. When design errors are discovered,
the design’s RTL is altered, re-synthesized and executed in hardware.
The time needed for design cycles in FPGAs is dominated by re-synthesis (logic synthesis, technol-
ogy mapping, placement and routing) tool run-times. FPGA placement and routing can take hours or
days for the largest designs [15], and such run-times are an impediment to designer productivity. With
this observation in mind, in this dissertation, we present new techniques for FPGA functional debug that
58
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 59
exploit the reconfigurability concept to raise productivity by reducing the number of compute-intensive
design re-synthesis runs that are needed.
At a high-level, our approaches work as follows: Say, for example, an engineer wishes to trace
a large number, N, of a design’s internal signals during functional debug, using a small number of
available external pins, m (N >> m). We augment the design with additional circuitry that allow the
N signals to be traced with dN/me FPGA device re-configurations and hardware executions. The key
value of our approach is that the design is only synthesized, placed and routed once, rather than dN/me
times. This is achieved by selecting the different sets of m trace signals through modifications to the
FPGA’s configuration bitstream (i.e. the post-routed design).
While the proposed approach leverages reconfigurability to reduce loops through the design process,
a further contribution of this work is a new multiplexer (MUX) design scheme for FPGAs that uses
significantly less area than a traditional MUX design. The new MUX is suitable for use in cases wherein
the MUX select inputs are changed using the FPGA bitstream, instead of using normally routed logic
signals. We also present a design variant to handle the scenario where limited external pins are available
for debugging.
As compared with design re-synthesis for each group of m signals, experimental results demonstrate
that our approach improves run-time by up to 30×. Our approach also offers stability in the timing
characteristics of the circuit being debugged.
This Chapter is organized as follows. Section 4.2 discusses the role of FPGAs in functional de-
bug and outlines previous work on the field. Section 4.3 introduces the proposed approach to FPGA
functional debug. Finally, Section 4.4 provides experimental results and Section 4.5 summarizes the
Chapter.
4.2 FPGA Functional Debug
There are two major approaches to perform functional debug with an FPGA. The first approach is to
implement the complete design in an FPGA device. This is suitable for small designs that do not need
to be executed at a high frequency. Because of the reconfigurability, debugging modules can be easily
added or modified with no cost. A set of circuit modifications that enhance debug capability is presented
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 60
in [25]. It provides software-like debug features, such as watchpoints and breakpoints. However, any
modification to watchpoints or breakpoints requires recompilation of designs – a run-time intensive
task. In a somewhat similar manner to what is proposed in this work, Graham et al. improve debugging
productivity by instrumenting FPGA bitstreams [16]. An embedded logic analyzer is inserted into
the design without connecting to any signals. After place-and-route, the signals targeted for tracing
are routed to the logic analyzer by modifying bitstreams using vendor tools. Although the approach
provides flexibility in choosing the desired internal signals for tracing, it remains a very complicated
procedure. Furthermore, when different sets of signals are selected for tracing, re-routing needs to be
performed, which can significantly affect the timing closure of the design.
Xilinx’s ChipScope tool provides features to trace different signals without re-executing place and
route [42]. Special logic analyzer hardware is inserted into a design to trace internal signals during
design execution. The captured signals can then be displayed using Xilinx’s ChipScore Pro Analyzer
Tool, running on a connected host computer. While the approach bears similarity to our own, the
techniques and tools associated with ChipScope are proprietary and not disclosed publicly.
The second approach to using FPGAs for functional debug is that of embedding reconfigurable logic
into SoCs to enhance debug capability [2, 30]. The programmability of reconfigurable logic can be
applied to implement various debug paradigms, such as assertions, signal capture and what-if analysis.
Those paradigms help engineers to understand the internal behavior of the chip and provide at-speed
in-system debug. Engineers can instrument the reconfigurable logic on-the-fly, as needed. However,
each change to the debug circuitry incurs significant cost and overhead.
A recent work presented in [21] proposes a methodology for post-mapping incremental trace-
insertion that only utilizes unoccupied logic in order to preserve the original FPGA mapping. The
method allows for small incremental re-compilations for signal tracing, while also having a negligible
effect to critical path delay. The approach presented in our work can potentially be integrated in such
robust methodologies to futher reduce the time between debugging sessions and preserve timing closure
in a sign-off circuit.
Finally, several works on selecting the signals that one may wish to trace for debugging have been
proposed [20, 23, 43]. While most works target ASIC designs, the work in [20] is designed specifically
for FPGAs. It predicts which signals may be useful for debugging and automatically instruments the
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 61
16 32 64 128 2560
1000
2000
3000
# Traced Nodes
# A
LM
s +
# R
egis
ters
Figure 4.1: Area overhead of SignalTap.
www
i0i1i2
outw...
wim
s0s1 sn. . .
Figure 4.2: Multiplexer for signal selection.
design. Any prior work on signal selection could be used in conjunction with our approach.
4.3 A Reconfigurability-Driven Approach to FPGA Functional Debug
This Section presents a new approach to enhance the observability of FPGA designs for functional
debug. To debug functional errors in an FPGA design, the design is first synthesized, placed and routed
on the target FPGA device. The programming bitstream is generated, programmed into the FPGA, and
execution commences. If unexpected behavior is observed, a set of internal signals is selected to be
traced by a logic analyzer to provide more information. In the conventional debug process, the design
needs to be recompiled and the FPGA needs to be reprogrammed. Fig. 4.1 shows the area overhead of
Altera’s SignalTap II [4] logic analyzer vs. the number of signals being tapped. One can see that the
overhead grows significantly as the number of monitored signals increases. Due to the area overhead of
the logic analyzer, usually only a small set of signals are traced at any one time. The process is repeated
until the values for all signals of interest are acquired. The main issue with this process is that it can
take hours to compile large designs [3]. As such, repeated compilation can introduce significant time
overhead and prolong the overall debug process.
To alleviate the issue, a new design process that avoids recompilation is presented in this work. The
idea is to modify the bitstream directly when different signals need to be traced. This is achieved by
inserting a multiplexer into the design implemented on the FPGA, with the MUX inputs being all sig-
nals that one potentially wants to trace. Fig. 4.2 depicts a multiplexer that can select one of m groups of
w signals. The select signals of the multiplexer are preset to logic-0 or to logic-1. Then, one can trace
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 62
different signals by manipulating the bitstream to set the select signals to different constants. Since there
is no re-routing required, the bitstream modifications can be done easily. As a result, the time overhead
of this process is reduced to a bitstream modification followed by a bitstream downloading. Bitstream
downloading normally requires only seconds – significantly less overhead than the re-compilation ap-
proach.
Another advantage of the proposed process is its negligible effect on the stability of the design. In
the conventional debug process, the design is re-routed each time when different signals are selected,
As a result, designers often need to readjust the design to meet the various timing constraints. Even
though recent FPGA tools provide incremental compilation to preserve the engineering efforts from
previous place/route steps, experiments show that the speed performance of designs after incremental
compilation can vary. In the proposed process, because all signals one potentially wants to trace are
connected to the selection module at the beginning, only the original one compilation is necessary. As
a result, selecting different signals through bitstream modifications minimizes the overall impact on the
performance of the design. Note that although our study targets Altera FPGAs, the proposed debugging
flow is not limited to Altera, and applies equally to FPGAs from other vendors.
4.3.1 An Area-Optimized Multiplexer Implementation
It is well-known that FPGAs are inefficient at implementing multiplexers. Therefore, in this Section,
a novel multiplexer implementation, optimized in the number of LUTs, is presented. The proposed
construction also takes advantage of the bitstream changes (described above).
Fig. 4.3(a) shows a traditional 16-to-1 MUX implementation in a Stratix III FPGA (the image is a
screen capture from Altera’s technology map viewer tool). Observe that five 6-input LUTs are required.
In a traditional MUX, the values of signals on the MUX select inputs can change at any time while
the circuit operates. However, in the proposed design process, the selected trace signals do not need
to change as the circuit operates. Rather, the set of selected signals is determined by the FPGA bit-
stream, and as such, may only change between device configurations. This makes an alternative MUX
implementation possible – one that consumes only three 6-LUTs in the 16-to-1 case.
The new MUX design is based on recognizing that a LUT’s internal hardware contains a MUX,
coupled with SRAM configuration cells. In our design, the LUT’s internal MUX forms a portion of
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 63
LUTs
Data in
puts
Select in
puts
(a) Traditional
i0 i1
i2
i3
i4
i5
i6 i7
i8
i9
i10
i11
f1 f2
i12
i13
i14
i15
f1 = i5
f2 = X
f3 = f1 OUT
(b) Proposed
Figure 4.3: 16-to-1 MUX implementation in 6-input LUTs.
the MUX we wish to implement (made possible owing to the MUX select lines being held constant
during device operation). Fig. 4.3(b) shows the proposed 16-to-1 MUX, where the 16 inputs are labeled
(i0-i15). In this case, the LUT configuration SRAM cells (i.e., the truth table) determine which MUX
input signal is passed to the output. For the purposes of illustration, in Fig. 4.3(b), each LUT is labeled
with the logic function needed to select the 6th MUX input (i5) to the output. Only three LUTs are
required: The LUT labeled f 1 passes input i5 to its output. LUT f 2 can implement any logic function
since its output is not observable (however, to save power f 2 should be programmed to constant logic-0
or logic-1). LUT f 3 is programmed to pass f 1 to its output. The proposed design offers significant area
savings relative to the traditional design, and allows signal selection via bitstream changes.
4.3.2 Debugging with Limited Output Pins
The debugging architecture described above requires multiple output pins if a group of signals is traced
in one silicon execution. This approach may not be feasible in cases where the output pins are limited.
Therefore, an alternative architecture that utilizes a parallel-in serial-out shift register is presented in
Fig. 4.4.
In Fig. 4.4, only one output pin is used. Values of the target group are loaded into the shift register in
parallel in each clock cycle. Then, the system clock is stopped, and a second debugging clock, is used to
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 64
4444
4
clkdebug
ABCD
s1 s2
Figure 4.4: Multiplexer with a 4-bit Shift Register.
shift out the stored value. There is a trade-off between the number of output pins and the test execution
time. If more output pins are available, the data can be distributed into multiple shift registers which
feed different output pins. This results in fewer clock cycles for retrieving data from the shift registers.
This architecture can be improved to obtain all values stored in the shift registers within one system
clock cycle (without stopping the system clock). Instead of shifting the data with a debug clock supplied
from off-chip, one can use the on-FPGA PLL to synthesize the debug clock from the system clock, with
the debug clock being n times faster than the system clock, where n is the width of the shift registers.
The advantage of this implementation is that the design does not need to be halted after each cycle in
order to empty the shift registers. However, this approach is only feasible if the system can be operated
at a low frequency.
4.4 Experimental Study
This Section presents the area overhead and timing impact of the proposed structures. The structures
are integrated into benchmarks selected from the OpenCores and CHStone benchmark suites [17]. The
CHStone benchmarks are synthesized from the C language to Verilog RTL using a high-level synthesis
tool [11]. All RTL benchmarks are then compiled using Altera’s Quartus II 11.0, targeting the 65 nm
Stratix III FPGA (EP3SL70), with a maximum frequency constraint of 1 GHz. Table 4.1 summarizes
the ALM and register utilization of each original benchmark (i.e., without any debugging structures
integrated). The table also shows the post-routing maximum frequency (Fmax) of the benchmarks.
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 65
Table 4.1: Benchmarks.
Ckt.# # Fmax
Ckt.# # Fmax
ALM Reg (MHz) ALM Reg (MHz)
ethernet 1323 1256 321.85 main 24483 20046 37.47
mem ctrl 1024 1051 266.95 dfsin 13946 16367 118.29
tmu 2336 3425 168.63 aes 8224 9090 129.10
rsdecoder 658 539 730.46 adpcm 11330 9852 101.58
In our experiment setup, registers in each module of each benchmark are randomly selected as
tracing candidates. Benchmarks are modified such that traced signals are wired to the top-level of the
benchmark and connected to the proposed structures. Altera’s synthesis attributes, keep and noprune,
are used to ensure that all signals exist after optimization. In the following discussion, the notation,
m-w, represents the tracing setting where m signals are candidates for tracing and w signals are traced
concurrently in one silicon execution.
Experimental results of the structures described in Section 4.3 are presented in the next subsection,
followed by an analysis of the productivity and the stability of the proposed design process.
4.4.1 Area Usage and Timing Analysis
The area overhead and Fmax of the proposed architectures with various sizes are depicted in Figure 4.5.
Four implementations are investigated: a traditional MUX implementation, a 6-LUT-based MUX im-
plementation (as proposed in Section 4.3.1), a 4-LUT-based MUX implementation (same as proposed
in Section 4.3.1) except using 4-LUTs instead of 6-LUTs), and a shift-register-based implementation
(as proposed in Section 4.3.2). As shown in Fig. 4.5(a), the 6-input LUT implementation uses, on av-
erage, 35% fewer ALMs than the traditional MUX implementation. The 4-input LUT implementation
can further reduce the usage of ALMs. This is because each ALM in a Stratix III device can contain
two 4-input LUTs, and Quartus II may merge two 4-input LUTs into one ALM. However, there is no
user control to force such an optimization to happen, and therefore, in the remaining experiments, all
multiplexers in the proposed structures are implemented with the 6-input LUT approach. In this exper-
iment, the shift-register implementation only uses one output pin and is driven with an external debug
clock. Due to the shift register, the area cost is slightly greater than the cost with the full multiplexer
implementation.
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 66
128−2 128−4 128−8 256−2 256−4 256−80
25
50
75
100
Configurations
# A
LM
s
Traditional Mux
6LUT Mux
4LUT Mux
Shift Register
(a) Area
128−2 128−4 128−8 256−2 256−4 256−80
200
400
600
800
Configurations
Fm
ax (
MH
z)
Traditional Mux
6LUT Mux
4LUT Mux
Shift Register
(b) Fmax
Figure 4.5: Area and Fmax of multiplexers.
Fig. 4.5(b) shows the Fmax of each MUX implemented in isolation. Since the area-optimized im-
plementation requires fewer ALMs to construct a multiplexer, less parasitic capacitance is introduced
on the critical path. Consequently, multiplexers with the 4-input LUT implementation have the highest
frequency in most cases.
Table 4.2(a) reports the percentage increase in ALMs and registers of benchmarks when the area-
optimized multiplexer is integrated. Two groups of tracing settings are considered. The worst-case in
each group is shown in bold. Results show that in most cases the area overhead is less than 10%. The
area overhead is contributed not only by the additional structure, but also because we wire signals from
sub-modules up to the top-level module. The maximum frequency Fmax of the benchmarks with the
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 67
Table 4.2: Effects of area-optimized multiplexer.(a) Area Increase Percentage (ALMs + registers) (%)
Ckt. 128-2 128-4 128-8 256-2 256-4 256-8
ethernet 6.91 7.10 7.34 10.87 11.26 11.53
mem ctrl 6.12 6.48 6.90 10.57 10.66 11.69
tmu 5.95 6.02 6.11 10.95 10.99 11.12
rsdecoder 11.36 10.52 9.86 13.03 19.05 17.79
main 0.27 0.29 0.65 0.77 0.75 1.15
dfsin 0.46 0.52 0.39 1.08 1.06 1.05
aes 1.14 1.31 1.67 2.54 2.48 2.94
adpcm 1.61 1.52 1.66 1.76 1.59 1.64
(b) Fmax Change Percentage (%)
Ckt. 128-2 128-4 128-8 256-2 256-4 256-8
ethernet -0.28 -0.02 -0.07 -0.31 -0.06 -0.11
mem ctrl -3.2 -10.1 -5.2 -8.19 -12.15 -7.23
tmu 1.99 2.12 2.2 1.06 0.98 0.92
rsdecoder -35.06 -32 -17.51 -33.99 -29.87 -28.51
main -1.81 -1.36 -4.06 -0.43 2.86 2.16
dfsin 3.53 -1.5 3.29 1.57 -0.06 -3.51
aes -0.74 -0.33 0.77 0.17 -0.6 -1.14
adpcm 3.62 2.07 -0.27 1.79 -0.5 -0.36
same tracing settings is reported in Table 4.2(b). Overall, Fmax is not affected greatly – changes are
mainly due to algorithmic noise. The only exception is with rsdecoder, with reason being that the
critical path for this benchmark is altered to pass through the multiplexer.
Similar to Table 4.2(a) and Table 4.2(b), the effect of the the shift register-based structure on the area
and Fmax of benchmarks is summarized in Table 4.3(a) and Table 4.3(b), respectively. Here, instead
of using an external debug clock, a faster debug clock is generated from the system clock using the
Stratix III PLL. The faster clock allows us to shift out the content of the shift-register within one system
clock cycle. As expected, because of the additional shift registers, the overall area overhead can be a
bit higher than the area overhead of the full multiplexer discussed previously. Furthermore, Fmax drops
significantly in all cases – the system clock speed is limited by the debug clock speed. For three of the
eight benchmarks, Fmax drops more than 50%.
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 68
Table 4.3: Effects of area-optimized multiplexers with shift registers.(a) Area Increase Percentage (ALMs + registers) (%)
Ckt. 128-2 128-4 128-8 256-2 256-4 256-8
ethernet 6.81 7.43 7.16 9.71 10.53 11.42
mem ctrl 6.72 6.57 6.92 10.56 11.11 11.93
tmu 6.56 6.20 6.70 9.81 10.29 10.76
rsdecoder 12.36 13.37 12.53 19.21 20.21 19.80
main 0.21 0.25 0.25 0.60 0.64 0.63
dfsin 0.29 0.27 0.39 0.73 0.76 0.85
aes 1.09 1.02 1.13 2.14 2.24 2.17
adpcm 1.22 1.20 1.49 1.70 1.77 1.88
(b) Fmax Change Percentage (%)
Ckt. 128-2 128-4 128-8 256-2 256-4 256-8
ethernet -42.76 -44.2 -44.89 -51.1 -54.26 -53.32
mem ctrl -29.25 -28.05 -28.67 -39.87 -35.37 -31.33
tmu -5.24 -6.08 -6.03 -0.43 -9.68 -7.14
rsdecoder -75.55 -71.96 -69.02 -75 -76.25 -76.41
main 0.77 0.61 -0.72 -2.86 -1.49 0.43
dfsin -10.74 -9.38 -8.57 -10.24 -7.75 -6.02
aes -4.93 -2.98 -1.9 -11.11 -10.22 -11.96
adpcm -3.63 1.1 1.55 4.1 3.36 2.63
4.4.2 Productivity and Stability
In the last set of the experiments, we evaluate the productivity and stability of the conventional design
process. Altera’s SignalTap II is used as the embedded logic analyzer. As mentioned in Section 4.3, due
Table 4.4: Compilation time of SignalTap.
Ckt.
128-8 256-8
Prop. SignalTap (sec) Prop. SignalTap (sec)
(sec) First Incr. Total (sec) First Incr. Total
ethernet 139 134 117 2006 141 134 119 3946
mem ctrl 150 143 124 2129 156 143 123 4073
tmu 169 161 137 2354 179 161 140 4639
rsdecoder 106 103 99 1685 109 103 98 3233
main 1449 1448 293 6141 1453 1448 290 10737
dfsin 706 696 216 4150 711 696 217 7648
aes 465 453 186 3428 466 453 184 5901
adpcm 634 615 226 4234 639 615 225 7815
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 69
5 10 15 20 25 30
0.9
0.92
0.94
0.96
0.98
1
# Critical Path Nodes
No
rmaliz
ed
Fm
ax
ethernetmem ctrltmumaindfsinaesadpcm
(a) Tracing nodes on the critical path
5 10 15 20 25 300.2
0.4
0.6
0.8
1
Debug Session
No
rma
lize
d F
max
rsdecoder
(b) Tracing random nodes
Figure 4.6: Stability of SignalTap.
to the size of SignalTap, acquiring trace data for a large number of signals is often achieved by succes-
sively tracing multiple smaller groups. Recompilation is required when a different group of signals is
selected.
The experiment is carried out as follows. Two tracing settings are studied: 128-8 and 256-8. In
order to use the incremental compilation feature in Quartus II, only post-fitting signals are considered.
First, the design is compiled without the SignalTap module. 128(256) post-fitting nodes are randomly
selected after the first compilation. Next, eight signals from the set are monitored. The procedure is
repeated until all 128(256) signals are traced.
The compilation time results are summarized in Table 4.4. The first column lists the benchmarks.
The next four columns report the results for the first tracing setting: the compilation time of the proposed
process, the first compilation of the SignalTap process, the average compilation time of each data ac-
CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 70
quisition session and the total cumulative compilation time of the SignalTap-based debugging process.
The result of the second tracing setting is reported in the final four columns. As shown in the table,
since the proposed bitstream-modifications-only process only requires one compilation, the compilation
time roughly equals to the first compilation of the SignalTap process. Although incremental compilation
reduces the compilation time by 4%-80%, each additional compilation adds time overhead. Overall, the
proposed process can save up to 93% (i.e., 139/2006 for ethernet) in the case of the 128-8 scenario,
and 97% (i.e., 103/3233 for rsdecoder) in the case of 256-8.
Incremental compilation tries to preserve the engineering effort from a previous compilation to
minimize the impact to design performance. While it does well in many cases, experiments show that
Fmax can still vary when the monitored signals are on the critical path. The result is plotted in Fig. 4.6(a).
In each case, a total of 32 signals are traced. The x-axis of the plot is the number of traced signals that
are on the critical path. The y-axis is the normalized Fmax, where the base is the Fmax of the original
benchmark. One can see that Fmax drops in various degrees, as much as 10%. It all depends on what
signals are monitored. However, the reader should note that cases where the majority of the traced
signals reside on the critical path correspond to worst case scenarios. Still, even for those cases where
a small number of traced signals are on the critical path one can observe an unpredictable bahavior in
maximum frequency.
For designs that can be operated at a very high frequency, the SignalTap module can in fact be
where the critical path resides. In this case, monitoring any set of signals can change Fmax, as shown
in Fig. 4.6(b). The x-axis of the plot is the data acquisition session, where 8 signals are traced in each
session with 32 sessions in total. The plot shows that Fmax is unstable from one session to another.
4.5 Summary
Functional debugging using FPGA devices provides several advantages over the traditional software
simulation approach. This chapter presents a set of hardware structures to take the advantage of the
FPGA reconfigurability feature to enhance the observability for debugging. Furthermore, experimental
results demonstrate that the new techniques can improve the productivity of the debugging process up
to 30×.
Chapter 5
Conclusions and Future Work
5.1 Summary of Contributions
Debugging today remains a complicated and resource-intensive process, with its multiple aspects, re-
quirements and limitations spanning various hardware design practices and design flows. New prob-
lems related to debugging are constantly introduced in modern design flows, requiring the development
of novel debugging methodologies to keep up with this growing complexity and heterogeneity of the
debugging task. Two of those challenging debugging problems that constantly grow in importance and
difficulty are those of triage in RTL design debug and FPGA functional debug.
The purpose of this thesis is to present practical techniques to address these problems. The first
contribution is a novel automated triage framework for RTL design debug, which is developed in an
effort to offer a viable alternative to traditional script-based or manual triage. The second contribution
introduces new hardware and software techniques that leverage the re-configurability property of FPGAs
in order to increase productivity for FPGA functional debug.
• In Chapter 3, a novel automated triage framework is presented, which groups together related
counter-examples that are generated by regression verification flows. The framework is based on
newly introduced metrics that define relations between counter-example and make predictions on
the number of co-existing RTL errors in the failing design. The proposed framework formulates
triage as a clustering problem and generates partitions of the counter-example set by employing
hierarchical clustering techniques.
71
CHAPTER 5. CONCLUSIONS AND FUTURE WORK 72
• In Chapter 4, novel hardware and software techniques are introduced that accelerate FPGA func-
tional debug by allowing the tracing of internal design signals during silicon execution without
the need for time-intensive re-synthesis iterations. The proposed method requires a sole execution
of the synthesis flow to trace a large number of signals for an arbitrary number of cycles using a
limited number of output pins.
5.2 Future Work
The following provides a summary of extensions and future directions relating to the contributions of
Chapters 3 and 4.
• There are several areas for extensions and future directions with respect to the contributions of
Chapter 3. To the best of our knowledge, the proposed framework is the first published work
on automated counter-example triage for RTL design debug. As a result, there is a lot of fertile
ground for extensions and improvements for this approach to eventually reach maturity. One of
the limitations of the proposed work in Chapter 3 is that it only performs well when the human
introduced errors responsible for the counter-example set are generally profane and affect the
circuit more than gate-level errors or stuck-at faults. This implies that in the case where the actual
error is, for example, a wrong bit inversion or a wrong gate then it will most likely not be identified
as an important suspect. In the same context, errors introduced by CAD tools are not modeled
and their typical behavior is not investigated. Apart from the above, the way counter-example
proximity is defined only allows hierarchical clustering to be applied. One idea is to formulate
triage using more flexible models that will allow the application of various clustering methods,
such as K-means, K-medoids, and Gaussian Mixture Models [10].
• One of the extensions to the work presented in Chapter 4 can be the integration of debug features,
such as trigger events, to the proposed structures to enhance the debugging ability. Another
interesting extension is developing a debugging algorithm that utilizes the proposed structures
and provides an efficient and effective FPGA debugging environment.
Bibliography
[1] M. Abramovici, M. Breuer, and A. Friedman, Digital Systems Testing and Testable Design. Com-
puter Science Press, 1990.
[2] M. Abramovici, P. Bradley, K. Dwarakanath, P. Levin, G. Memmi, and D. Miller, “A reconfig-
urable design-for-debug infrastructure for SoCs,” in Design Automation Conference, 2006, pp.
7–12.
[3] Increasing Productivity With Quartus II Incremental Compilation, Altera Corp., San Jose, CA,
2008.
[4] Design Debugging Using the SignapTap II Logic Analyzer, Altera, Corp., San Jose, CA, 2011.
[5] J. Baumgartner, H. Mony, V. Paruthi, R. Kanzelman, and G. Janssen, “Scalable sequential equiva-
lence checking across arbitrary design transformations,” in International Conference on Computer
Design, 2006.
[6] J. Bergeron, Writing Testbenches: Functional Verification of HDL Models, Second Edition.
Kluwer Academic Publishers, 2003.
[7] V. Betz and J. Rose, “Cluster-based logic blocks for FPGAs: Area-efficiency vs. input sharing and
size,” in IEEE Custom Integrated Circuits Conf., Santa Clara, CA, 1997, pp. 551–554.
[8] ——, “FPGA routing architecture: Segmentation and buffering to optimize speed and density,” in
ACM/SIGDA Int’l Symposium on FPGAs, Monterey, CA, 1999, pp. 140–149.
[9] A. Biere, A. Cimatti, E. M. Clarke, O. Strichman, and Y. Zhu, “Bounded model checking,” Ad-
vances in Computers, vol. 58, pp. 118–149, 2003.
73
BIBLIOGRAPHY 74
[10] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics).
Springer, 2007.
[11] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson, S. Brown, and T. Cza-
jkowski, “LegUp: high-level synthesis for FPGA-based processor/accelerator systems,” in Inter-
national Symposium on Field-Programmable Gate Arrays, 2011, pp. 33–36.
[12] F. M. De Paula, M. Gort, A. J. Hu, S. Wilton, and J. Yang, “Backspace: Formal analysis for
post-silicon debug,” in International Conference on Formal Methods in CAD, 2008, pp. 1–10.
[13] H. Foster, A. Krolnik, and D. Lacey, Assertion-Based Design. Kluwer Academic Publishers,
2003.
[14] E. Goldberg, M. Prasad, and R. Brayton, “Using SAT for combinational equivalence checking,” in
Design, Automation and Test in Europe, 2001, pp. 114–121.
[15] M. Gort and J. Anderson, “Deterministic multi-core parallel routing for FPGAs,” in International
Conference on Field Programmable Logic and Applications, 2010, pp. 78 –86.
[16] P. Graham, B. Nelson, and B. Hutchings, “Instrumenting bitstreams for debugging FPGA circuits,”
in International Symposium on Field-Programmable Custom Computing Machines, 2001, pp. 41–
50.
[17] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, “Proposal and quantitative analysis of the CH-
Stone benchmark program suite for practical C-based high-level synthesis,” Journal of Information
Processing, vol. 17, pp. 242–254, 2009.
[18] H.Foster, “From volume to velocity: The transforming landscape in function verification.” in De-
sign Verification Conference, 2011.
[19] S. Huang and K. Cheng, Formal Equivalence Checking and Design Debugging. Kluwer Aca-
demic Publisher, 1998.
[20] E. Hung and S. Wilton, “Speculative debug insertion for FPGAs,” in International Conference on
Field Programmable Logic and Applications, 2011, pp. 524–531.
BIBLIOGRAPHY 75
[21] E. Hung and S. J. Wilton, “Incremental trace-buffer insertion for fpga debug,” in IEEE Transac-
tions on Very Large Scale Integration (VLSI) Systems, vol. PP, no. 99, pp. 1–1, 2013.
[22] B. Keng and A. Veneris, “Path directed abstraction and refinement in sat-based design debugging,”
in Design Automation Conference, 2012.
[23] H. F. Ko and N. Nicolici, “Algorithms for state restoration and trace-signal selection for data
acquisition in silicon debug,” IEEE Transactions on CAD, vol. 28, no. 2, pp. 285 – 297, 2009.
[24] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” IEEE Transactions on
CAD, vol. 26, no. 2, pp. 203–215, 2007.
[25] L. Lagadec and D. Picard, “Software-like debugging methodology for reconfigurable platforms,”
in IEEE International Symposium on Parallel and Distributed Processing, 2009, pp. 1–4.
[26] H. Mangassarian, L. Bao, A. Goultiaeva, A. Veneris, and F. Bacchus, “Leveraging dominators
for preprocessing qbf,” in Design, Automation Test in Europe Conference Exhibition, 2010, pp.
1695–1700.
[27] H. Mangassarian, A.Veneris, S.Safarpour, M.Benedetti, and D.Smith, “A performance-driven
QBF-based on iterative logic array representation with applications to verification, debug and test,”
in International Conference on Computer Aided Design, 2007.
[28] K. McMillan, “Interpolation and SAT-based model checking,” in Computer Aided Verification,
2003.
[29] OpenCores.org, “http://www.opencores.org,” 2007.
[30] B. Quinton and S. Wilton, “Programmable logic core based post-silicon debug for SoCs,” in IEEE
International Silicon Debug and Diagnosis Workshop, 2007.
[31] Ranjan, R.K., C. C., and S. S., “Beyond verification: Leveraging formal for debugging,” in Design
Automation Conference, 2009, pp. 648–651.
[32] P. Rashinkar, P. Paterson, and L. Singh, System-on-a-chip Verification: Methodology and Tech-
niques. Kluwer Academic Publisher, 2000.
BIBLIOGRAPHY 76
[33] S. Safarpour, A. Veneris, and F. Najm, “Managing verification error traces with bounded model
debugging,” in ASP Design Automation Conference, 2010.
[34] O. Sarbishei, M. Tabandeh, B. Alizadeh, and M. Fujita, “A formal approach for debugging arith-
metic circuits,” in IEEE Transactions on CAD, vol. 28, no. 5, May 2009, pp. 742–754.
[35] A. Smith, A. Veneris, M. F. Ali, and A. Viglas, “Fault diagnosis and logic debugging using Boolean
satisfiability,” IEEE Transactions on CAD, vol. 24, no. 10, pp. 1606–1621, 2005.
[36] S.Safarpour, B.Keng, Y.S.Yang, and E.Qin, “Failure triage: The neglected debugging problem,” in
Design and Verification Conference, 2012.
[37] S.Safarpour, M.Liffton, H.Mangassarian, A.Veneris, and K.A.Sakallah, “Improved design debug-
ging using maximum satisfiability,” in International Conference on Formal Methods in CAD, 2007.
[38] A. Suelflow, G. Fey, R. Bloem, and R. Drechsler, “Using unsatisfiable cores to debug multiple
design errors,” in Great Lakes Symposium on VLSI, 2008.
[39] J. Swartz, V. Betz, and J. Rose, “A fast routability-driven router for FPGAs,” in ACM/SIGDA Int’l
Symposium on FPGAs, Monterey, CA, 1998, pp. 140–149.
[40] G. J. Szekely and M. L. Rizzo, “Hierarchical clustering via joint between-within distances: Ex-
tending ward’s minimum variance method,” Journal of Classification, vol. 22, no. 2, pp. 151–183,
2005.
[41] A. Veneris and I. N. Hajj, “Design error diagnosis and correction via test vector simulation,” IEEE
Transactions on CAD, vol. 18, no. 12, pp. 1803–1816, 1999.
[42] ChipScope ILA Tools Tutorial, Xilinx Inc., San Jose, CA, 2003.
[43] Y.-S. Yang, N. Nicolici, and A. Veneris, “Automating data analysis and acquisition setup in a
silicon debug environment,” IEEE Transactions on VLSI, 2011.