an integrated on-silicon verification method for fpga...

Journal of Electronic Testing (2019) 35:173–189https://doi.org/10.1007/s10836-019-05786-z

An Integrated on-Silicon Verification Method for FPGA Overlays

Alexandra Kourfali1 · Florian Fricke2 ·Michael Huebner3 ·Dirk Stroobandt1

Received: 9 July 2018 / Accepted: 6 March 2019 / Published online: 27 March 2019© Springer Science+Business Media, LLC, part of Springer Nature 2019

AbstractField Programmable Gate Arrays (FPGAs) gain popularity as higher-level tools evolve to deliver the benefits of re-programmable silicon to engineers and scientists at all levels of expertise. In order to use FPGAs efficiently, new CADtools and modern architectures are needed for the growing demands of heterogeneous computing paradigms. Overlayarchitectures have become a popular option to support a variety of high-performance computing applications implementedon heterogeneous computing platforms. However, most of these architectures cannot offer an efficient way to dynamicallydebug and repair them. In this paper, we propose a superimposed virtual coarse-grained reconfigurable architecture,embedded with on-demand debug and self-healing capabilities. The proposed method automatically creates flexibletechniques for in-circuit error detection and correction of generic Processing Elements and Virtual Channels. The debugginginfrastructure is integrated in the design with tailor-made CAD tools, making feasible to rapidly debug and repair virtualarchitectures with minimal use of additional FPGA resources.

Keywords FPGA · CGRA · In-circuit debugging · Repair · Parameterized configuration · On-silicon debug · Verification ·FPGA overlay

1 Introduction

Field-programmable gate arrays (FPGAs) continue to gainmomentum with the rise of complex system designs thatrequire optimum flexibility at the cost of performance. Sincetheir invention, FPGAs have evolved from being simple gluelogic chips and systems sued for chip prototyping to actuallyreplacing custom application-specific integrated circuits(ASICs) and processors in various applications, such asdigital signal processing, big data processing, in-missioncritical systems and security applications. Moreover, thepopularity of FPGAs has increased in the recent years. Thisis evident by Intel’s acquisition of Altera and various effortsto integrate FPGA technologies in the cloud, such as F1-Amazon Web Services and Microsoft Azure [1, 32, 36]. Inorder to be efficiently used in this variety of applications,

Responsible Editor: L. M. Bolzani Pohls

� Alexandra [email protected]://www.ugent.be

Extended author information available on the last page of the article.

new CAD tools are needed for the growing demands ondesign, verification and debugging efficiency.

So far, in order to debug and verify a design, the designerswould indulge in long verification cycles. Initially, thiswas done via simulation. Simulation provides full internalobservability and software-like debugging capabilities.However, it is slow and cannot always demonstrate realisticscenarios [11, 35]. FPGA emulation can provide higherverification coverage compared to simulation, allowingdesigners to test their design using realistic scenarios. Themain drawback of using FPGA emulation is the limitedinternal signal observability. While simulation gives fullvisibility, FPGA emulation allows the designer to observeonly the signals that are driven through the scarce outputpins by employing on-chip RAM blocks known as trace-buffers [5, 14, 24]. This limits the productivity of debuggingvia FPGA emulation, as the designer has to recompile theentire FPGA in order to observe different signals throughthe output pins. Moreover, the cost of emulators is too high.

In-circuit debuggers provide ways to observe moreinternal signals and therefore offer in-circuit debugginginfrastructure that enables the designer to trace and observea set of signals [23, 39, 45]. This is done by storing datain the trace-buffers and then debugging it off-line. In-circuit debugging is an increasingly difficult problem, as

http://crossmark.crossref.org/dialog/?doi=10.1007/s10836-019-05786-z&domain=pdf

http://orcid.org/0000-0003-3430-1007

https://www.ugent.be

174 J Electron Test (2019) 35:173–189

the FPGA capacity becomes larger and FPGA architecturesbecome more complex. Thus, it is no longer efficient forengineers seeking to accelerate custom applications. In thatcase, flexibility and efficient debugging and verificationare needed, without extensive knowledge of the underlyinghardware.

A new methodology, aligned to that trend is to use virtualarchitectures, and overlay fabrics in order to offer versatilityand complexity to custom applications. These architectureshave reconfigurable pre-synthesized circuits implementedon FPGAs. With the use of virtual architectures andoverlays, faster compilation times are achieved, due to theiroptimizations. Overlays have been used in various areas ofresearch [3, 6, 12, 25, 30, 31]. However, overlays are limitedby their area and performance overhead. The ParameterizedConfigurations technique has been presented as a techniquethat can help with the resource oveheads of a type ofoverlay called Virtual Coarse Grained Reconfigurable Array(VCGRA) [17].

Despite the increasing applicability of FPGAs, thecurrent debugging techniques cannot efficiently debug thenew designs. The use of current in-circuit debuggers willhave two possible effects on debugging FPGAs (andVCGRAs). First, there is large area needed for trace-buffersin order to store all the debugging information. Also, thedesigner will need to sit through long recompilation cyclesto debug a design with the conventional tools. Hence, for ahigh-performance application the design will need at least aday to recompile, to adjust a new set of signals. The secondeffect is the fact that, if a conventional debugger is used,the high-performance application designers will have to doa hands-on debug.

By using overlays for in-circuit debugging, we canreduce both the area and recompilation times. An overlaycan be compiled to an FPGA after the circuit iscompiled and then rapidly configured to implementthe instrumentation at debug time. A debugging-overlaycan be designed efficiently by matching the overlay tothe underlying fabric and adapting the overlay to useparameterized logic and routing resources [27, 28], or byadapting the overlay to use only those logic and routingresources left unused by the user circuit [8, 10, 20]. As aresult, the area overhead of the debugging infrastructure dueto these overlays can be minimal.

In this paper, we propose a semi-automated in-circuitdebugging method for theoretical and commercial FPGAarchitectures, that integrate FPGA overlays with debuggingproperties, with minimal user intervention. We propose acustom two level overlay architecture where the debugginginstrumentation is virtual, automated and it is integratedin the design during the original compilation. The debug-ging infrastructure is added incrementally and optimizedalongside the target FPGA. The first overlay level is the

VCGRA, whereas the second overlay implements the pro-posed debugging functionality.

2 RelatedWork

2.1 In-Circuit Debugging

One of the most important requirements for (in-circuit)FPGA debugging is to achieve software-like on-chip signalobservability. In the past few years, functional debug hasmade significant progress [43]. Design for debug has givendebug engineers increased observability of an embeddedsystem’s internal operation. There are severe constraints,however, on the amount of debug observability that it canprovide for error localization. The designer needs to detectthe signals that change during the hardware execution. Forlarge designs, it is impossible to observe all the internalsignals while the design is running at-speed. In order toovercome that, the values of the signals can be stored in on-chip memories. Moreover, a technique that efficiently tracessignals and records possible changes for offline analysis ishighly needed.

Most FPGA designers use simulation and analysis toolsbefore downloading a design into the physical device. How-ever, it is not possible to reproduce realistic environmentalconditions that a design will encounter, using simula-tion [11]. In order to overcome the limitations of soft-ware simulation, circuit designers have to perform in-circuitdebugging during FPGA emulation. The FPGA designsoperate several orders of magnitude faster than softwaresimulation and therefore allow early access to hardwaretests and faster debugging in real environments. There arefour basic methodologies for in-circuit FPGA debugging,namely Embedded Logic Analyzers (ELAs), Embeddedscan chains, external test equipment and readback.

Debugging with External Test equipment, such asoscilloscopes and external logic analyzers is convenient ifthere is access to the necessary equipment [18]. ExternalTest equipment operates by monitoring the external outputpins of the integrated circuit. It analyzes how the states ofthe external output pins respond to certain signals placedon the external input pins, in order to determine if an errorhas occurred within the integrated circuit. However, sincethe external logic analyzers are capable of accessing theintegrated circuit only through the external pins, internalconditions of interest for debugging cannot be monitoreddirectly. With these models, it may be impossible toaccurately determine the precise source of a bug, as thereis no direct access to the internal inputs and outputs of thecomponents under test.

Debugging with Embedded Logic Analyzers (ELAs)is one of the basic methodologies for in-circuit FPGA

J Electron Test (2019) 35:173–189 175

debugging [23, 39, 45]. These techniques are providedby commercial vendors, as debug IPs. Both Xilinx’sVivado and Intel’s Quartus use general-purpose incrementalcompilation to avoid full recompilation if the debug core isupdated during debug iterations, whereas Intel’s SignalTapII compiles the debug core with the user-circuit as aseparate partition (and recompiles this partition separatelyif the property of the debug core is updated). Synopsys’Identify requires a recompilation if new logic is introducedto the debug core. Debugging with ELAs offers limitedobservability as their area overhead increases with thenumber of signals to be observed.

Embedded scan chains on the other hand offer veryhigh observability and operate in virtual-time, but theyhave reduced performance [26, 44]. Most commercialFPGAs offer a way to read out the contents of itsflip-flops and block-RAMs through a suitable interface,without destroying the machine-state. It does so without anyoverhead on the FPGA fabric and hence no perturbationof the application. These FPGAs can support readback-based techniques. Readback stands between ELAs and scanchains, as they introduce less area overhead and they don’tneed to restore to the initial state. Readback-based solutionshave an advantage over their counterparts, as they offer fullobservability, without requiring resynthesis when changingor adding debug signals [2, 41, 46]. However, they cannotprovide though the same level of flexibility as do overlayarchitectures. Moreover, they also need to halt the circuitbefore scan-out. This can greatly slow down their use forreal-time debugging.

All the above-mentioned techniques are significantlyslow with large area overheads. Also, they offer limitedobservability, as their area overhead increases with thenumber of signals to be observed. On-chip observabilitycan be enhanced using trace-based techniques. Trace-basedtechniques operate by allocating a significant amount ofthe FPGA’s memory resources to record a small subsetof internal signals while the FPGA is operating in realtime. For the trace IP solutions that are provided by thecommercial vendors [23, 39, 45], the subset of signals thatare connected to on-chip memories, called trace-buffers,must be determined by the designer before the circuit isimplemented. There are tools that avoid recompilation, suchas Certus by Mentor Graphics [34]. However, these toolsstill require the designer to predetermine a few signals thatthey can observe. Therefore, complete on-chip visibility isstill not provided and a full recompilation is not alwaysavoided.

Academic works have used overlay-based incrementaltechniques for signal tracing while avoiding full recompila-tion during debug [8, 15, 20, 21]. In [15, 21] the researchersused incremental routing techniques to connect signals fromthe user circuit to trace buffers, whereas in [8, 20] the

researchers extended these techniques into a virtual over-lay routing network and an accompanying routing algorithmto route trace signals to trace-buffers, by connecting allon-chip signals to the available trace buffers. Then, dur-ing debug, the designer can choose any set of the designsignals for observation. If this set has to be changed, thenonly the control signals of the overlay’s routing multiplexershave to be updated. In that way, the need for a full recom-pilation is eliminated and the original mapping remainsunaffected. In [9, 19, 22] academic works demonstrate thepost-implementation insertion of debugging infrastructure.All these techniques distribute the debugging logic overunused resources. However, these works demand adequateempty regions to install their circuitry. These techniques canbe proven problematic in highly utilized FPGAs. Addition-ally, they can create a new longer critical path and causerouting congestion.

In order to handle the resource overhead, researchersuse the properties of compiler-optimized HLS circuits andadaptable trace-buffers to enable dynamic tracing [13, 14]with limited resource overhead. They also adapt the trace-buffer architecture for even more efficiency. Dynamic signaltracing techniques increase observability. However, on ahighly-utilized FPGA, it may be hard or even impossibleto find a region large enough to implement the additionallogic. In that case as well, a trigger mechanism, canalso potentially introduce a new long critical path. Mostimportantly, this technique is only available for compiler-optimized HLS designs.

The virtual and overlay architectures provide more flex-ibility. The designers should be able to debug their virtualarchitecture designs in-system, without detailed knowledgeof the hardware’s limitations. However, none of the men-tioned techniques can currently support efficient debuggingof virtual architectures without extensive resource overheador time-consuming recompilation. Thus, in order to effi-ciently debug FPGA overlays and virtual designs in-system,the current limitations of the in-circuit debuggers need to beaddressed.

2.2 Parameterized Configurations

Parameterized Configurations (PConf) is a methodologyused for implementing an application whose input values(called parameters) change infrequently, on an FPGA [4].Instead of implementing the FPGA’s inputs as regularinputs, with PConf these inputs are implemented asconstants and they form a Boolean function. Hence, aPConf contains bits that are not only static binary (0’sand 1’s) but also multi-valued Boolean functions of theconstants (infrequently changing parameters). The FPGAis then optimized for these constants. Then, for a setof specific parameter values, specialized configurations

176 J Electron Test (2019) 35:173–189

can be instantly derived by evaluating these Booleanfunctions.

Generally the FPGA’s SRAM cells contain only binarybits (0’s or 1’s). Therefore, the Boolean functions have tobe evaluated before reconfiguring the FPGA. Instead ofreconfiguring the entire FPGA, it is enough to evaluatethe Boolean function. For every change in the parameterinput values of the parameterized application, the functionsare evaluated resulting in specialized bitstreams. In thatway, multiple specialized configurations can be generatedby evaluating the Boolean functions instead of compilingthe bitstreams from scratch using the conventional FPGAtool flow. In more detail, the FPGA’s intra-connectsare mapped onto (parameterized) physical Switch Blocksand Connection Blocks that are implemented using thePConf [42]. This results in lower compilation costsper configuration and reduces the amount of storageneeded in the configuration database. By generating aspecialized configuration there is no need not needto undergo time-consuming steps of the FPGA flowand to solve computationally hard problems such asplacement and routing as is the case in the conventionaltool flow. These problems are already solved when theparameterized configuration is generated. However, thistechnique can not be applied in commercial FPGAs, dueto the fact that it is not possible to directly accessspecific reconfiguration infrastructure for reconfiguring theinterconnects in commercial FPGAs.

2.3 FPGA Overlays

An FPGA overlay is a virtual reconfigurable architecturethat overlays on top of the physical FPGA configurablefabric and can carry out certain computations. VirtualFPGAs are built virtually or physically on top of thecommercial FPGA fabrics. The virtual FPGA overlayshave different set of configuration and features than thecommercial FPGAs. Therefore, having a virtual FPGA layerover an FPGA fabric improves the application portabilityand compatibility. The virtual architectures proposed in [7,12, 16, 25] as well as the PConf are all examples of virtualFPGAs.

An FPGA Overlay may be designed as a virtual FPGA,a processor, a GPU, or as a (virtual) coarse-grainedreconfigurable array (CGRA). Virtual CGRAs (VCGRAs)are FPGA overlays that can bridge the gap between FPGAimplementations and high-level application descriptions,as with VCGRAs the time consuming design cycle ofthe FPGA (synthesis, mapping, place and route) can bemoved forward to pre-compilation times. Hence, the entiredevelopment cycle is reduced, as (re)compilation is avoidedfor the (re)construction of the VCGRA. VCGRAs havebeen proposed before, either as optimized architectures, as

a solution for long compilation times, or as a facilitator forhigh-level synthesis [6, 37]. This is achieved mainly becauseVCGRAs allow the designer to write the code in a higherabstraction level language, without requiring knowledge ofthe underlying hardware.

VCGRAs consist of a large number of processingelements (PEs), laid out in a grid pattern, and VirtualChannels (VCs), that are a communication networkconnecting the PEs. Each PE is a coarse-grained element(realized mainly using LUTs) and it is capable of computingincoming data and pass it on to the next PE, via an adjacentVC. The VCGRA is implemented using reconfigurableconnections, with the assistance of the ParameterizedConfigurations (PConf) flow. PConf for VCGRAs is amethodology used for implementing an application whoseinput values (called parameters) change infrequently, onan FPGA. The VCGRA needs to be changed/adaptedinfrequently. Hence, instead of implementing the VCGRA’sinputs as regular inputs, with PConf these inputs areimplemented as constants and the FPGA overlay isoptimized for these constants [30].

Academic works have leveraged FPGA overlays toimprove their debugging methods. In debugging with theuse of FPGA overlays, the authors have used mainlyincremental compilation, to insert trace-buffers, after thedesign is mapped and finalized on the FPGA [8, 9, 20].These tools support only academic FPGAs and they use theremaining FPGA logic to construct virtual routing networksfor signal tracing and to add debugging circuitry. However,the size of the overlay and the amount of its resources androuting needed can make its construction challenging andcan affect the original design’s timing and critical path.Moreover, in the above-mentioned methods, the researcherscreate overlays to assist the debugging process, without thepossibility to debug the FPGA overlay itself.

3 Contributions of this Work

As the designs become more complex, there is a growingneed for software-like debugging. The designer should havea hands-off debugging without a detailed knowledge of theunderlying hardware. Moreover, the designer should haveearly and easy access to all signals with an automated tool.Last but not least, as the designs scale, the available FPGAresources can become scarce. In this work we elaborate onand significantly improve an earlier version of the in-circuitdebugger [28].

Our method combines the advantages of enhancedinternal signal observability and fast FPGA reconfigurationin one tool flow. It has four main parts: a parsing step tocreate the VCGRA, a design step to add the debugginginfrastructure incrementally, a parameterization step that

J Electron Test (2019) 35:173–189 177

is used in the case of virtual FPGA architectures in theDesign Under Test (DUT) and an online step that adapts thedebugging methodology to a target design. This approachhas some benefits over other debugging methods:

– Extended internal observability: With integration withParameterized Configurations approach, the designercan trace a new set of signal fast, completing thedebugging process within time constraints.

– Less area resources: By adding the debugging infras-tructure on a higher-abstraction level (overlay archi-tecture) and dynamically adapting the added on-chipmemories, the extra resources are reduced.

– An automated tool: The designer has just to selectthe DUT and to run the flow to create a VCGRAand perform hands-off signal tracing. Semi-automatedinterventions reassure that the DUT’s debugginghardware remains minimal and fully parameterized.

In order to increase the adaptability of our technique,we extend the functionality of our previous work, beyondthe theoretical FPGAs, by adding methodologies thatsupport commercial FPGAs. The main contribution inthe debugging integration is the possibility of addingdebugging hardware on a higher abstraction level. This isdone when the application is transformed to a VCGRA.In this way, the signal tracing hardware is co-designedalongside the VCGRA. We generate a netlist that hasintegrated debugging functionality and can be adaptedwithout affecting the VCGRA under test.

The added functionality enables the user to debug largedesigns mapped on FPGAs without significant area over-head. We use a realistic application to validate the fea-sibility of the debugging architecture. In the experimentssection, we perform an exploration of the area and run-time overhead to validate the scalability of the methodand of the signal integration algorithms, while increasingthe design size and decreasing the size of the availableFPGA resources. In addition to the added functionality,we explore the efficiency of the proposed tool for bothacademic and commercial FPGA architectures. Moreover,we propose an on-the-fly adaptation of the debugginginfrastructure, that achieves a balanced trade-off betweenarea overhead, signal observability and compilationtime.

The previous paper [28] showed the theory of how asecond FPGA overlay can be applied on a specific overlay(VCGRA) architecture, in such a way that it will providein-circuit debugging functionality and without alteringthe original design. In our previous work we proposedparameterization of the internal signals on a VCGRAlevel, of a given design under test. Additionally, signalcompression of mutually exclusive signals (during runtime)reduced the area overhead of the debugging circuitry,

while increasing its flexibility. In this work, the entireconcept of debugging with FPGA overlays is revisited.Thus, we extend the previous work with three additionalcontributions:

1. This paper goes beyond theoretical FPGAs and solvesproblems that existed while directly applying thesemethods in commercial FPGAs. The target in thiswork is state-of-the-art commercial FPGAs. Integrationwith the Vivado platform is also presented, that allowsresearchers to map their designs in state-of-the-art(Xilinx) FPGAs.

2. We leverage the unique architecture of the VCGRAand its layer-by-layer data path, to selectively adddebugging functionality in such a way that withminimal monitoring, can perform on-silicon debug,where the designer can observe the results and adaptthe VCGRA. The parameterized configurations conceptis replaced by dynamic reconfigurations and an AXI-wrapper that packages the VCGRA-SDA system. Thisis now introduced as a third-party IP. The creation of anIP will allow real applications of the proposed system.This method can be used by integrating the IP in adesign.

3. A tool flow that applies the above-mentioned tech-niques to a given design automatically. A more com-plete tool that co-creates the VCGRA and the debug-ging circuitry is provided, where the same automationsthat can create a VCGRA can generate an SDA as well.Here, the concept of SDA is completely different thanthat of earlier work. Instead of creating a second layerthat provides dynamic signal tracing in the theoreti-cal architecture, the tool co-designs the VCGRA andthe SDA in such a way that they are interconnected.The tool is not limited to the basic steps of the FPGA-PConf flow, but it is also deeply integrated in the Vivadotool flow and it can rapidly generate (offline) the filesneeded for the tool to create the IP and install both theVCGRA and the SDA.

4 In-Circuit Debugging for FPGAOverlays

This section presents our approach to enhance the observ-ability of VCGRA-based designs for functional debug-ging. The Superimposed Debugging Architecture (SDA)approach is a custom-made in-circuit-debugger that is usedto rapidly trace functional errors in high-performance com-puting applications, that can be implemented as VCGRAs.In order to use the SDA, we first need to construct aVCGRA from a dataflow graph representation (or a VHDLdesign) of an application. Hence, this section describes firsthow the VCGRA is constructed and then how the SDA can

178 J Electron Test (2019) 35:173–189

be integrated in a VCGRA, so that the VCGRA can bedebugged.

4.1 VCGRA Overview

In order to build a VCGRA custom application, thefirst step is to design the PE and VC, according to thetarget application. Different functionalities form differentPEs. Multiple PEs are connected with VCs, forming onelevel of VCGRA. Then, multiple levels of VCGRA formthe VCGRA grid. The grid’s structure is described bythe number of PEs in each level of the architectureand the elements’ input and output bandwidths. The VCconfiguration controls the data flow among the VCGRA’slevels, and the PE configuration controls the operation of aPE. The size of the configuration bitstreams is defined bythe number of PE operations and the number of PEs withina VCGRA level and is determined during design time. TheVCGRA is depicted in the middle layer of Fig. 1 and in theleft part of Fig. 2.

All components of the VCGRA are described in VHDLand they contain parameters that later will be used by thePConf backend, in order to reduce the utilization of LUTsin the target-FPGA. The PEs are created by parsing andconverting a textual description of a synthesized applicationinto a netlist of PEs. PEs are designed as state machines.Their two inputs perform an operation and the one outputis a buffer that saves the result for one cycle. The inputsand outputs of the PE can be multi-bit and they aredefined during the implementation stage, based on the targetapplication. VCs have one multiplexer at each output inorder to connect one specified input with its configuredoutput. In order to select the inputs, a multiplexer withmultiple select inputs is added (Fig. 1). Consequently,this architecture requires a lot of routing resources. Fortheoretical FPGAs in order to reduce the routing resources,all select-lines of the multiplexer are considered slowly-changing constants, and subsequently parameterized. Thus,

Fig. 2 The architecture that integrates the SDA in a VCGRA

the amount of resources needed are reduced, due to theresource sharing of the PConf technique.

4.2 SDA Overview

The SDA is a network of multiplexers that will connect totrace-buffers that store inputs and outputs of VCGRA layer.This circuitry is co-designed during the initial VCGRAimplementation. The SDA technique aims to tackle thedrawbacks of the current in-circuit debuggers in thefollowing way:

– The use of on-chip resources (which can also impactthe design’s timing performance and create additionaldebugging requirements) is reduced, by co-designingthe SDA alongside the VCGRA and by using a smallamount of multiplexers and trace-buffers, tailored to therequirements of the VCGRA application.

Fig. 1 Overview of themultiple-level architectureshowing the FPGA and the twovirtual levels (the VCGRAapplication and its adjacentsuperimposed debugger)

J Electron Test (2019) 35:173–189 179

– The need to recompile and reprogram the design toobserve different sets of signals (which can add hoursor even days to the debug schedule) is eliminated withthe use of the PConf approach for academic FPGAs.

– By creating a layer-by-layer debugging approach andreconfiguring to debug each VCGRA layer, the areaoverhead is controlled, in commercial FPGAs.

– The observability of the internal signals is enhancedwith the debugging meta-layer (SDA).

The size of the trace-buffers needed for debugging isequal to the number of PEs per layer. The tracing cycles canvary based on the demands of the application. Some of thedebugging infrastructure’s elements, such as multiplexers,are incrementally added at the outputs of the PEs, in sucha way that they do not use minimal LUTs. This is achievedby tracing only specific signals of each layer (based on theapplication) and not all possible signals. Additionally, foracademic FPGAs the debugging circuitry is implementedin the FPGA’s reconfiguration resources. This networkof multiplexers will subsequently be connected to trace-buffers to perform functional debugging. The SDA isshown in the upper layer of Fig. 1 and in the right partof Fig. 2.

The SDA distinguishes from the state-of-the-art debug-gers in two basic parts: First, the basic reconfigurableelements that are observed are layers of PEs and not solelysignals. Hence it operates on a higher abstraction level,accelerating the debugging process. The second distinguish-ing element is the fact that the SDA is installed on a minimalspecific part of the design (layer) and rapidly checks forpossible bugs, compared to the long recompilation cyclesneeded to check different sets of signals at a lower abstrac-tion level in the whole design, as in conventional in-circuitdebugging.

The SDA is structured in layers, similar to the VCGRA.Each layer has a number of multiplexers that is inaccordance to the number of PEs per layer. Each multiplexeris connected with one PE. All layers are then interconnectedwith trace-buffers using parameterized connections. Theseare statically determined at design time. The SDA’scomponents are described in VHDL and contain annotationsthat can also be used by the PConf backend to control thereconfiguration, determining which VCGRA layer is beingtraced at a certain time.

The VCGRA-SDA system mitigates the use of on-chipresources in two different ways. For academic FPGAsby creating a PConf. This is achieved by compressingsignals that are not used at the same time. Multiple signalscan be described in one (tunable) LUT, as these LUTscontain Boolean functions that upon evaluation can describedifferent signal sets. This is more thoroughly explained inprevious work [27, 28].

For commercial FPGAs, we leveraged the VCGRAarchitecture to build an efficient SDA. We detect thearchitecture of the VCGRA and install signal tracinginfrastructure on specific signals (outputs of the PEs of eachlayer). In that way, not all signals need to be observed,but only a subset, that will show after tracing, if an errorhas occurred. Due to the architecture of the VCGRA, thedebugging can be done layer-by-layer, allowing massivesavings in area resources as the installation of debuggingcores is avoided. Additionally, the internal observability isincreased. This is an added benefit that is an obstacle duringconventional debugging of a design, where the internalobservability of the third-party IPs is limited.

We extend the in-circuit debugger with a (minor)repairing functionality. In that case, we don’t aim to locatethe source of the bug in order to correct the design error, butwe detect the location of the bug and reload the bitstream,without correcting its function. The elements of one PE canbe reconfigured to be repaired, instead of repairing the bugby redesigning. This can be used in case there is a soft-error and not a design error. This technology can be usedas a fail-safe mechanism that can be enabled in the case ofsafety-critical applications.

This functionality can be applied with microreconfigu-ration [29]. Microreconfiguration is a three step techniquethat involves reading, modifying and writing back specificframes: using the frame address, a set of four consecutiveframes containing the truth table entries of a PE or VC,are read from the configuration memory. Then, the currenttruth table entries are replaced by the requested bits (savedin an FPGA memory). Finally, by using the same frameaddress, the modified four frames are written back to theconfiguration memory, thus accomplishing the microrecon-figuration and efficiently repairing the erroneous subgrid.Hence, as long as the bug is located in four (or less) consecu-tive frames, we reload these frames of the erroneous PE andcreate the self-healing overlay. However, this functionalityis recommended for soft-errors (that can appear in safety-critical applications in radiation environments), which is asubset of functional errors.

4.3 VCGRA-SDA Generator

In order to create the VCGRA-SDA architecture (or similararchitectures), from a target application, it is necessary touse a toolset that allows the creation of such architectures,their configurations and debugging functionalities, theirhardware implementations and integration with the vendortools; with minimal user intervention. Therefore, we haveextended the CGRA-Generator [12] and the PConf tool [30],into a novel tool flow that can integrate the SDA inthe target application automatically. Figure 3 is a high-level representation of the tool flow that can analyze

180 J Electron Test (2019) 35:173–189

Fig. 3 The tool flow thatenables the SDA-integratedVCGRA implementation

and instrument systems with VCGRA and SDA. Here, acomplete flow that co-creates the VCGRA and the SDA isprovided. It is not limited to the basic steps of the FPGA-PConf flow, but it rapidly generates the files needed tocreate the IP and install both the VCGRA and the SDA in theapplication, providing on-silicon debugging functionality.

4.3.1 Application Mapping Toolset

In order to map an application in a target (pre-defined)VCGRA architecture, an algorithm (represented as a graphor as a VHDL design), the properties of the overlay andthe library of operations are given as inputs. In more detail,the library of operations is an HDL representation of allpossible operations needed to create the functionality ofthe PEs and the debugging infrastructure. The propertiesof the VCGRA are the shape and the number of inputs ofthe VCGRA. From this file, the properties of the SDA areautomatically extracted (the shape and number of inputs andoutputs). The input graph is a dataflow graph and can be

automatically obtained using a small tool, which has beendeveloped with the help of third-party libraries, such as thegraph-tool [40], in the case that the application is describedin a language such as C/C++. However, it is also possible toprovide as an input a VHDL design. The output is a VCGRAarchitecture that can be mapped either with the PConf tool(for less resource utilization) or with the vendor tools (forbetter performance).

4.3.2 VCGRA-SDA Toolset

While processing the HDL templates of the architectureand its components, the tool generates VHDL and otherassociated files, that facilitate the mapping of a targetapplication to a target architecture, the creation of the SDAand the configuration vectors. The SDA Generator createsthe modules and the architectural components for the SDAintegration in the VCGRA. The proposed tool requires thedescription of the applications and the description of theVCGRA as inputs.

J Electron Test (2019) 35:173–189 181

First, the tool analyzes the inputs required by theapplications and whether they can fit in the VCGRA. If theyfit, the tool adds multiplexers in the appropriate positionsto interconnect PE outputs and trace-buffers. The output ofthe SDA tool is the description of the connection of inputs(PE outputs) and outputs (trace-buffer inputs) between PEsand trace-buffers and the necessary multiplexing and theVCGRA configurations of the target applications. This isdepicted in Fig. 3. Additional inputs required in orderto construct the SDA are the shape of the target overlay(VCGRA), the number of the VCGRA’s inputs and thelevels of PEs/VCs. Moreover, a library of operations isalso needed, that describes the functionality of the PEs andthe debugging elements. The output is a configuration thatcan be later translated into bitstreams with the use of theCGRA-Configuration Generator [12].

At the end of the SDA-generation step, the SDA toolhas created a meta-layer with the debugging infrastructureon-the-fly, that is subsequently added in the VCGRAincrementally, before synthesis. In order to create thedebugging infrastructure, we install trace-buffers equal tothe number of PEs of one layer, as it is adequate to traceone layer of the VCGRA at any given moment, sincechanging between VCGRA layers is very efficient, due tothe rapid reconfigurations. Instead of pre-reserving an areafor trace-buffers (as in state-of-the-art ELAs) or insteadof using all possible leftover resources after place androute (as in state-of-the-art academic tools), we use theminimum possible amount of trace-buffers and additionallogic needed to debug just one VCGRA layer. Therefore,this architecture can reuse the same trace-buffers amongdifferent layers. In fact, the reconfiguration is used toconstruct the routing between the SDA, the trace-buffersand the target VCGRA layers. Therefore, the number ofthe trace-buffers scales based only on the size of one layerand not of the entire VCGRA. Similarly, the debuggingrequirements scale in a linear way for larger applications,as they are based on the number of PEs per layer and noton the application. In that way, the architecture can scalewell.

4.3.3 PConf Adaptations and Limitations

In the case of theoretical architectures instead of using thevendor tools, we use the PConf toolflow instead. This optioncan be used when the FPGA resources are scarce, and moreflexibility (more integration and faster reconfiguration)is needed for the SDA. Here, the user determines theVCGRA’s settings that will reconfigure the VCs and PEs torealize a target application, as described above. Then, duringsynthesis and mapping, a textual description of the targetapplication is converted into PEs. Then, the PEs are mappedinto virtual PEs of the VCGRA. Next, with a custom router,

optimal connections are created between the PEs and VCs.The multiplexers connecting the signals to the trace-buffers(whose values change infrequently compared to the rest) ofthe PEs and VCs are mapped on virtual LUTs and virtualConnections with the PConf [42]. In that way, the utilizationof the LUTs is reduced, as the debugging infrastructure islocated in the reconfiguration resources (that is usually notavailable to the user).

As soon as the architecture of the VCGRA is constructed,textual settings are also extracted from the VCGRA, inorder to construct the SDA. The settings needed are thedata bandwidth and the basic PE operations. Then, themultiplexers are inserted into the VCGRA’s reconfigurationresources. At this point the granularity and the functionalityof the VCGRA is defined, alongside all the possible waysthe PEs and VCs can be interconnected with the debuggingnetwork.

The PConf technique for parameterized LUTs is aconcept introduced in [4], and has been adapted for aseries of Xilinx FPGAs, such as Virtex 2,V,7 and Kintex.However, the more recent version of the PConf withparaterised interconnects is only applied in theoreticalFPGAs. In terms of debugging, PConf is very efficientfor debugging theoretical architectures due to the factthat the reconfiguration resources are manipulated. Thiscannot be done in a commercial FPGA [27]. Therefore,it is possible to use PConf for commercial FPGAs butwith some restrictions. Additionally there are limitationsdue to memory requirements. The academic FPGA cansupport a VCGRA of a few hundred LUTs. Therefore,it is able to synthesize one PE (of approximately 200LUTs) but not an entire VCGRA (few thousand LUTs).This limitation has been resolved with the contributions ofthis paper. Hence, due to these two limitations (memoryshortage and inefficient handling of debugging resources),we have implemented the commercial version of theapproach without the PConf, but leveraging architecturalbenefits of the VCGRA, the Vivado toolflow and dynamicreconfigurations instead of parameterized.

4.3.4 Online SDA Generator

In order to be able to reconfigure a VCGRA architecturein vendor tools, certain steps need to be followed. Afterthe generation of the VCGRA and the SDA, an AXI-Lite template is adapted specifically for the VCGRA andthe SDA. Then, the entire application that is now mappedon the VCGRA and has the adjacent SDA is added asa user IP in vendor tools. Hence, all associated IP files,alongside the hardware description of the application arepackaged in a form of a user IP and an AXI-wrapperis created. This wrapper is needed, because the VCGRAitself has been implemented in pure VHDL and is not

182 J Electron Test (2019) 35:173–189

limited to a specific FPGA architecture. It is used to enablecommunication between the VCGRA and the host systemand provides access to the inputs needed to configurethe VCGRA and to the data inputs and outputs. Theinterface-templates provided can be used to create interfaceswhich are usable on Xilinx Zynq-SoCs, further interfacesfor different FPGAs from other vendors are possible. Alittle software library we provide eases access to the AXIinterfaces used for transferring configuration and data.Additionally, some binary signals are also provided by theinterface and are used to synchronize data transfers and to

report the hardware state. The online part of the flow isshown in Fig. 4.

Multiple tracing cycles may be needed in order tosuccessfully trace a bug on the VCGRA. During this loop,the SDA is continuously adapted to trace different VCGRAlayers. The processor examines each new layer (set ofsignals). In case a bug is detected, the process can beinterrupted in order to microreconfigure the VCGRA, (ifthe repair functionality is enabled in the case of soft-errors).In that way, the debugging has integrated self-correctioncapabilities, that can be used for safety-critical applications.

Fig. 4 IP design flow for theSDA-integrated VCGRA

J Electron Test (2019) 35:173–189 183

5 Experimental Study

For the experiments we have used as DUT an imageprocessing application that implements a Sobel EdgeDetection filter [38]. Edge detection is applied in imageprocessing applications with the use of convolution filters,such as the Sobel filter. A pixel is calculated by thefollowing equation:

Gx(x, y) =3∑

i=1

3∑

j=1

I (x + i − a, y + j − a)

×Sx(n + 1 − i, n + 1 − j) (1)

Where the filter’s set-point is in the middle of the filter-mask, and:

– Gx(x, y) is a pixel in the result image of a convolutionin vertical or horizontal direction.

– I is the input image. The corresponding pixels under-neath the mask are addressed relatively, depending onthe filter set-point.

– Sx is a filter in either vertical or horizontal direction.

For the experiments, first, the application’s architectureis converted into a VCGRA and then we add the debugginginfrastructure incrementally. We show how it can beimplemented with a smaller area using the proposedtechniques.

Starting from a graph representation of a Sobel edgealgorithm, we can create a VCGRA grid. The implementa-tion depends on the requirements of the design, such as thegraph’s depth, the PE-operations, the data paths’ bitwidthand whether there is a need for (parameterized) reconfigura-tion. If the target-FPGA is large enough, there might not bea need for reconfiguration. However, when the target FPGAis not large enough, PConf can be used.

The DUT has two different operations (add, mul),between two neighboring pixels and their correspondingfilter coefficients. Hence, the operations are modeled asPEs, with 2 inputs: a pixel and a filter coefficient. Theedges of the graph can be modeled as VCs. Therefore,we can construct a VCGRA. The architecture has beenimplemented with the proposed tool-flow. Hence, the PEsdescribe the different mathematical operators to be used.The tool can use the different operators to create theVCGRA grid. In this case the SDA needs only a trace bufferand some additional logic. The tracing infrastructure candebug one layer. Hence, the added circuitry is equivalent tothe maximum number of PEs per layer.

Here, the minimum possible SDA that we need ismultiplexers that connect the output signals to trace-buffersfor each layer. With a conventional ELA we would have

needed a fixed amount of trace-buffers pre-installed, thattrace a large fixed number of signals, before the nature ofthe VCGRA application is known. Here, after a trigger, theSDA is reconfigured and the multiplexer network realisespoint-to-point connections between the trace-buffers and thePEs performing the operations.

5.1 Offline Comparison of Area Utilization

For the comparison of the area utilization, we synthesizedthe DUT with Vivado 2016.1 and the PConf tool flow.In order to compare the two tools we used the PEs, VCsand the whole VCGRA target application of the DUT. TheFPGA architectures used where a Virtex 7 (in Vivado) anda theoretical architecture that is supported by the PConftool flow. All architectures contain 6-input LUTs for faircomparison.

5.1.1 Commercial FPGA Architectures

In order to debug with the SDA on commercial FPGAarchitectures (Xilinx, Virtex-7), instead of tracing sole PEsper reconfiguration (as in academic FPGAs in previouswork), we trace a VCGRA layer per reconfiguration. Thisdesign choice has been realized in order to reduce thereconfigurations and to increase the performance, as thetechnique works better when the VCGRA is infrequentlyreconfigured. The results are depicted in Table 1.

Different variations of the application have beendesigned to evaluate the area of the proposed technique:

– A conventional implementation (without a VCGRA) ofthe target application, for fair comparison (static).

– A VCGRA wrapper with an AXI-Lite module withoutany debugging functionality.

– A VCGRA-SDA wrapper with an AXI-Lite modulethat has integrated debugging functionality (proposedapproach).

Table 1 Comparison of the VCGRA designs implemented on a Virtex-7 with Vivado 2016.1

Component Area Power

Design LUTs FF W

Static 1230 1420 15.775

VCGRA 768 1542 3.604

VCGRA-SDA 1006 1542 1.324

VCGRA-ILA 5303 8093 3.844

The proposed (VCGRA-SDA) technique is smaller than the staticimplementation (with no debugging infrastructure)

184 J Electron Test (2019) 35:173–189

– A VCGRA wrapper with an AXI-Lite module and anILA connected that can trace the VCGRA’s internalsignals (conventional approach).

We observe, that with Vivado 2016.1 the VCGRAimplementations on a Virtex-7 is 39% smaller than thestatic implementation. The proposed architecture withthe integrated debugging functionality (VCGRA-SDA) is5.2× smaller than the VCGRA design integrated withVivado’s ILA and 18% smaller than the static application(without any debugging functionality). In the VCGRA,there is an area penalty of 25%, to add the debuggingfunctionality. However, this is considered small, sincethe area penalty to add debugging functionality with theILA core is up to 860% for a small design (PE). Aschematic representation of these results in Table 1. Here,we can observe that even though the area penalty of theproposed technique (VCGRA-SDA) is 23%, it remains22% smaller than the original application. Therefore, bytransforming an application into a VCGRA, the debuggingcan be included in the application without any additionaloverhead. Therefore, the on-silicon debugging circuitry canbe integrated in an application, even post-development,and can be re-activated, to debug bugs that escape afterdeployment, which is the case in most designs [11].

5.1.2 Power

The power results are depicted in Fig. 5 and in Table 1.The power consumption of using a VCGRA instead of theconventional implementation of the Sobel filter is decreasedby 4.3×. There is no power increase for the VCGRA-SDAarchitecture. In fact, it needs 2.9× less power compared tointegrating a logic analyzer (ILA core) with Vivado.

5.1.3 Theoretical FPGA Architectures

For this work, we assume that the application requirementsexceed the available FPGA’s resources, hence the (param-eterized) configurations approach is selected over a fullyapplication-specific design. The minimum SDA architec-ture (the one attached to a VC) needs 16 parameterizedconnections and no LUTs. If it is implemented with theconventional tools, it needs 17 LUTs [28]. The proposedarchitecture has 0% impact on physical LUTs for PEs,VCs and the grid. It introduces 18% routing overhead inVC, 66% in fixed point PEs, and 5.6% in floating pointPEs. The total routing impact in a VCGRA grid is 10.2%.The total wirelength increases by 4% for a VC, by 1%for a fixed point PE, by 0.6% for a floating point PE and0.08% for the total grid. Here, we can observe a LUT

Fig. 5 Schematic representation of the results after the implementation of the target application

J Electron Test (2019) 35:173–189 185

reduction after the VCGRA parameterisation. There is noincrease in the number of physical FPGA resources (LUTs)after the installation of the SDA. There is an increase inthe virtual FPGA resources (TLUTs and TCONs), withno impact on the design’s performance and a minimalincrease in the wirelength. These results are further ana-lyzed in [28]. The results indicated as PConf are comparedwith the results obtained from ABC, that is able to syn-thesize the academic FPGA used for the PConf. For theproposed technique there is a dicrease of 25% of the FPGAresources with the proposed technique compared to imple-mentation with the conventional. This is achieved due tothe usage of the reconfiguration interface for the debug-ging infrastructure, that is possible with the PConf but notwith ABC.

In general, the implementation of a PE uses fewerLUTs in commercial tools compared to the PConf tool.This is due to the fact that these tools probably containbetter optimizations for numerical operations and have farmore optimizations than explicitly for the specific target-architectures, compared to the generic theoretical FPGAarchitectures that are used in academic tools. However,designs with more multiplexers are better optimized usingthe PConf tool. Especially tunable connections are heavilyused for the implementation of the VCs and the debuggingarchitecture, because multiplexers are well suited foroptimizations of shared resources. The results are shown inTable 2.

Table 2 Logic Utilization for VCGRA-SDA designs implementedwith different tools

Component Commercial Academic

LUT FF ABC PConf

VCGRA

PE 61 27 85 76

VC 10 34 34 16

VCGRA 768 1542 – –

VCGRA-SDA

PE 62 35 102 75

VC 10 50 66 16

VCGRA 1006 1542 – –

VCGRA-ILA

PE 3318 4468 n/a n/a

VC 3267 4475 n/a n/a

VCGRA 5303 8093 n/a n/a

The area results are shown in terms of Look-Up Tables (LUTs) andflip-flops (FF). For academic FPGAs the area is compared in LUTswith the ABC tool [33]

The two tools are not directly comparable, as they areimplemented on different architectures. At this point wehave to mention that due to some limitations with the PConftool currently it is impossible to implement large designs.Therefore, we are unable to show results for the completeVCGRA using this tool.

5.2 Online Execution Times

5.2.1 Commercial FPGA Architectures

We implemented the design on a Zedboard with XilinxVivado 2016.1, as this technique is based on dynamicallyreconfigurable SRAM-based FPGAs. For this work wetargeted Xilinx FPGAs. However, other commercial toolsthat support SRAM-based FPGAs can be used, such asQuartus II. With some amount of engineering effort itwill be possible to support this technique for any SRAM-based FPGAs, such as Intel’s, but not tools such as Libero,that support flash-based FPGAs. In that case, we willneed to revisit the entire concept of debugging, VCGRAsand SDA.

The four different design variations (static, VCGRA,VCGRA-SDA and VCGRA-ILA) of the application areused for comparison of its execution times. In more detail,an AXI-Module is added to read and write GPIOs of thePS for the synchronization between the PS the VCGRAand the SDA. Additionally, a VCGRA wrapper and aVCGRA-SDA was created with four AXI-Lite interfaces tosend and receive data and to parameterize the implementedVCGRA. Using this interface, the configuration data canbe transferred from the PS to the programmable logicwhere it is saved locally within configuration registers.Currently, the obtainable performance is limited by theAXI-lite interface, because AXI-lite does not support burstsor data-streaming. Thus, for each data package of four bytes,the overhead of a full AXI-communication sequence isneeded.

5.2.2 Theoretical FPGA Architectures

In order for a new PE to be instantiated for debug, the toolneeds 280ms to place the SDA architecture and 27ms toroute it. Hence, SDA can be in-place to start tracing theerroneous behaviour of a new virtual channel in 307ms.Then, after we confirm if there is a bug, we need 11.28ms to microreconfigure one frame. Therefore, we need318.28ms to install a new SDA element in a new VCGRAlocation and self-heal the erroneous element, in case theself-healing functionality is enabled.The timing informationwas measured with the PConf tool.

In theoretical FPGAs the debugging infrastructure ison a separate layer. In that way it has no effects on the

186 J Electron Test (2019) 35:173–189

interconnects and any other element that can influence thetiming of the design itself. In commercial FPGAs, the SDAis connected with the VCGRA. They are co-implementedand co-exist. That means that a specific VCGRA is pairedwith an SDA and they have a slower (or faster) timingcompared to another VCGRA-SDA pair. This will not affectthe entire system in any way, since the entire VCGRA-SDAwill be faster or slower. Therefore, the on-silicon debugginginfrastructure will not impact the timing of the originalimplementation

5.2.3 PConf Approximation

With a PConf adaptation of a VCGRA, we don’t needto reconfigure the FPGA in order to create a newreconfiguration. We only need to perform an evaluation ofa parameterized configuration. Then, the design needs toperform only microreconfigurations in the FPGA and notfull reconfiguration. One microreconfiguration of 1 TLUTtakes 64.1μs. Therefore, in order to reconfigure the entireVCGRA we need approximately 82ms. This is calculatedbased on an approximation of the total number of TLUTsin a VCGRA-SDA design. This metrics is closer to thetheoretical FPGA architectures reconfiguration time and is3 orders of magnitude faster than the reconfiguration timewithout parameterized configurations and the conventionalAXI-HWICAP reconfiguration controller. Since we havesome limitations with the PConf tool for large designs, thecomplete VCGRA results with the PConf adaptations arebased on approximation.

5.2.4 Different Reconfiguration Controllers

The speed of configuration is directly related to the size ofthe BIT file and the bandwidth of the configuration port.Here, we compare the reconfiguration times between dif-ferent reconfiguration controllers and the target application.Since the bitstreams that are generated from the applica-tion have similar size, the impact on the reconfigurationtime is negligable. This is depicted in Fig. 5. Therefore, ifwe integrate the SDA in the VCGRA there is no impactin the configuration time for a Virtex-7 FPGA. However,there is a significant acceleration if we use the MiCap-Procontroller [29]. This is shown in Table 3.

5.2.5 Internal Signal Analysis

If we compare the results of the signals selected fromVivado ILA and VCGRA-SDA we have the followingconclusions. With Vivado’s ILA, 60% of the internal signalswere able to fit in the ILA cores. However, only 10%were able to be fully integrated in the design. Moreover,with ILA, the designer needs to manually select all thesignals of the synthesized design. Then the tool, emergeson an additional compilation overhead to install the extrainfrastructure. This dramatically increases the overhead ofthe FPGA in terms of LUTs (the design becomes 7.7×larger) and an additional compilation overhead is needed toinstall the debugging functionality (minutes). However, withthe VCGRA-SDA method, the integration of the debuggingfunctionality is done without any compilation overhead.The SDA is automatically integrated in the VCGRA beforesynthesis. Additionally, the designer doesn’t pre-selectsignals. All internal signals are selected automatically. Thenthe signals can be transferred for analysis. Therefore, we areable to trace more signals with our proposed technique (upto 100%), in comparison with ILA that can trace up to 10%of the target application’s signals.

Debugging with the SDA is on-silicon debug. Therefore,it can also be introduced during the implementation stage toassist in debugging the data path, as according to [11], bugsalways escape after debugging, post-development, even forsafety-critical applications. The structure of the co-designedVCGRA-SDA is developed in that way that the concerncan be bugs in both the control path and the data path.It depends on the signals that will be determined to betraced (control or data). In this work, the signals that wereselected to be traced were part of the data path. However,for a different VCGRA study, control path signals wouldbe of (debugging) interest. In that case, the VCGRA-SDAdesign will adapt accordingly, as it is relatively easy withthe accompanying tools to add tracing infrastructure onthe VCGRA. The acquisition can start either manually, orvia triggering. This work involved on-silicon debugging,that is realised while the VCGRA is being developed. Inthat way it can collect raw debug data for offline analysis,to finalise the design. In that case, no specific eventsare in need to start the acquisition, as raw data can beused.

Table 3 Configuration time (in s) with different reconfiguration controllers

Design AXI-HWICAP MiCap MiCap-Pro JTAG SelectMAP

Static 8.49 7.34 0.59 2.4 8.49VCGRA 8.493 7.4 0.596 2.457 8.493VCGRA-SDA 8.493 7.33 0.593 2.445 8.493

J Electron Test (2019) 35:173–189 187

6 Conclusion

In this paper, we described a low-power / low-overheadsuperimposed debugging architecture for heterogeneouscomputing platforms. This technique is used to integratedebugging infrastructure to efficiently debug FPGAs. Inorder to create this novel architecture, a supporting flow anda new two-level virtual architecture have been constructed.This technique transforms a design into a VCGRA andintegrates in-circuit debugging functionality in an FPGAoverlay. The tailor-made tools introduce resource sharingon the two-level virtual architecture, reducing the actualFPGA’s resources. Hence, we are able to integrate adebugging mechanism in FPGA architectures, without anyarea, power or reconfiguration overhead.

7 FutureWork

For future work the proposed work can be expanded toinclude (detect or predict) timing errors. For commercialFPGAs the technique can be extended for on-silicon, real-time monitoring of data path signals. This can be done eitherby monitoring the transitions arriving after the clock edgeto detect a timing error, or by monitoring the outputs (fora specified time period) before the clock edge, to predicta potential timing error. Additionally, for future work wewill investigate how these techniques can be applied for awider variety of FPGAs (Intel, flash-based) and vendor tools(Quartus, Libero).

Acknowledgments This work has been supported by the EuropeanCommission in the context of the European Union Horizon 2020Framework Programme (H2020-EU.1.2.2.) under grant agreementnumber 671653.

References

1. Amazon (2018) Amazon ec2 f1 instances: enable faster FPGAaccelerator development and deployment in the cloud. retrievedfrom: https://aws.amazon.com/ec2/instance-types/f1/. Accessed:2019-01-07

2. Angepat H, Eads G, Craik C, Chiou D (2010) Nifd: non-intrusiveFPGA debugger – debugging FPGA ’threads’ for rapid hw/swsystems prototyping. In: 2010 International Conference on FieldProgrammable Logic and Applications, pp 356–359

3. Brant A, Zuma GL (2012) An open FPGA overlay architecture.2012 IEEE 20th international symposium on field-programmablecustom computing machines, pp 93–96

4. Bruneel K, Heirman W, Stroobandt D (2011) Dynamic datafolding with parameterizable FPGA configurations. ACM TransDes Autom Electron Syst (TODAES) 16(4):43:1–43:29

5. Calagar N, Brown SD, Anderson JH (2014) Source-leveldebugging for FPGA high-level synthesis. In: 2014 24thinternational conference on field programmable logic andapplications (FPL), pp 1–8

6. Coole J, Stitt G (2010) Intermediate fabrics: virtual architecturesfor circuit portability and fast placement and routing. In: Pro-ceedings of the 8th IEEE/ACM/IFIP international conference onhardware/software codesign and system synthesis, CODES/ISSS’10. ACM, New York, pp 13–22

7. Coole J, Stitt G (2010) Intermediate fabrics: virtual architecturesfor circuit portability and fast placement and routing. In: 2010IEEE/ACM/IFIP international conference on hardware/softwarecodesign and system synthesis (CODES+ISSS), pp 13–22

8. Eslami F, Hung E, Wilton SJE (2016) Enabling effectiveFPGA debug using overlays: opportunities and challenges. corr,arXiv:1606.06457

9. Eslami F, Wilton SJE (2015) An adaptive virtual overlay forfast trigger insertion for FPGA debug. In: 2015 internationalconference on field programmable technology (FPT), pp 32–39

10. Eslami F, Wilton SJE (2017) An improved overlay and mappingalgorithm supporting rapid triggering for FPGA debug. SIGARCHComput Archit News 44(4):20–25

11. Foster HD (2015) Trends in functional verification: a 2014 industrystudy. In: Proceedings of the 52nd annual design automationconference, DAC ’15, vol 48. ACM, New York, pp 1–48:6

12. Fricke F, Werner A, Shahin K, Huebner M (2018) CGRA toolflow for fast run-time reconfiguration. In: Voros N, Huebner M,Keramidas G, Goehringer D, Antonopoulos C, Diniz PC (eds)Applied reconfigurable computing. Architectures, tools, andapplications. Springer International Publishing, Cham, pp 661–672

13. Goeders J, Wilton SJE (2015) Using dynamic signal-tracing todebug compiler-optimized HLS circuits on FPGAs. In: 2015 IEEE23rd annual international symposium on field-programmablecustom computing machines, pp 127–134

14. Goeders J, Wilton SJE (2017) Signal-tracing techniques for in-system FPGA, debugging of high-level synthesis circuits. IEEETrans Comput Aided Des Integr Circuits Syst 36(1):83–96

15. Graham P, Nelson B, Hutchings B (2001) Instrumentingbitstreams for debugging FPGA circuits. In: The 9th annual IEEEsymposium on field-programmable custom computing machines(FCCM’01), pp 41–50

16. Grant D, Wang C, Lemieux GGF (2011) A cad frameworkfor malibu: an FPGA with time-multiplexed coarse-grainedelements. In: Proceedings of the 19th ACM/SIGDA internationalsymposium on field programmable gate arrays, FPGA ’11. ACM,New York, pp 123–132

17. Heyse K, Davidson T, Vansteenkiste E, Bruneel K, StroobandtD (2013) Efficient implementation of virtual coarse grainedreconfigurable arrays on fpgas. In: 2013 23rd internationalconference on field programmable logic and applications, pp 1–8

18. Hopkins ABT, McDonald-Maier KD (2006) Debug support forcomplex systems on-chip: A review. IEEE Proc Comput DigitalTechn 153(4):197–207

19. Hung E, Todman T, Luk W (2014) Transparent insertion oflatency-oblivious logic onto fpgas. In: 2014 24th internationalconference on field programmable logic and applications (FPL),pp 1–8

20. Hung E, Wilton SJE (2013) Towards simulator-like observabilityfor fpgas: A virtual overlay network for trace-buffers. In:Proceedings of the ACM/SIGDA International Symposium onField Programmable Gate Arrays, FPGA ’13. ACM, New York,pp 19–28

21. Hung E, Wilton SJE (2014) Incremental trace-buffer insertion forFPGA, debug. IEEE Trans Very Large Scale Integr VLSI Syst22(4):850–863

22. Hutchings BL, Keeley J (2014) Rapid post-map insertion ofembedded logic analyzers for xilinx fpgas. In: 2014 IEEE 22ndannual international symposium on field-programmable customcomputing machines, pp 72–79

https://aws.amazon.com/ec2/instance-types/f1/

http://arxiv.org/abs/1606.06457

188 J Electron Test (2019) 35:173–189

23. Intel Corporation (2019) Quartus prime standard edition hand-book, volume 3: Verification: Design and debugging with theSignalTap II logic analyzer. retrieved from: https://www.mouser.com/pdfdocs/qts-qps-5v3.pdf, 2019. Accessed: 2019-01-02

24. Ko HF, Nicolici N (2009) Algorithms for state restoration andtrace-signal selection for data acquisition in silicon debug. IEEETrans Comput Aided Des Integr Circuits Syst 28(2):285–297

25. Koch D, Beckhoff C, Lemieux GGF (2013) An efficient FPGAoverlay for portable custom instruction set extensions. In: 201323rd international conference on field programmable logic andapplications, pp 1–8

26. Koch D, Haubelt C, Teich J (2007) Efficient hardware check-pointing: concepts, overhead analysis, and implementation. In:Proceedings of the 2007 ACM/SIGDA 15th international sympo-sium on field programmable gate arrays, FPGA ’07. ACM, NewYork, pp 188–196

27. Kourfali A, Stroobandt D (2016) Efficient hardware debuggingusing parameterized FPGA reconfiguration. In 2016 IEEEinternational parallel and distributed processing symposiumworkshops (IPDPSW), pp 277–282

28. Kourfali A, Stroobandt D (2018) Superimposed in-circuitdebugging for self-healing FPGA overlays. In: 2018 IEEE 19thLatin-American test symposium (LATS), pp 1-6

29. Kulkarni A, Stroobandt D (2016) Micap-pro: a high speed customreconfiguration controller for dynamic circuit specialization. DesAutom Embed Syst 20(4):341–359

30. Kulkarni A, Stroobandt D, Werner A, Fricke F, Huebner M (2017)Pixie: A heterogeneous virtual coarse-grained reconfigurablearray for high performance image processing applications. In:3rd international workshop on overlay architectures for FPGAs(OLAF 2017), pp 1–6

31. Landy A, Stitt G (2012) A low-overhead interconnect architecturefor virtual reconfigurable fabrics. In: Proceedings of the 2012international conference on compilers, architectures and synthesisfor embedded systems, CASES ’12. ACM, New York, pp 111–120

32. Lindtjorn O, Clapp R, Pell O, Fu H, Flynn M, Mencer O(2011) Beyond traditional microprocessors for geoscience high-performance computing applications. IEEE Micro 31(2):41–49

33. Luu J, Goeders J, Wainberg M, Somerville A, Yu T, NasartschukK, Nasr M, Wang S, Liu T, Ahmed N, Kent KB, AndersonJ, Rose J, Betz V (2014) Vtr 7.0: next generation architectureand cad system for fpgas. ACM Transactions on ReconfigurableTechnology and Systems (TRETS) 7(2):6:1–6:30

34. Mentor Graphics (2016) Certus silicon debug35. Mitra S, Seshia SA, Nicolici N (2010) Post-silicon validation

opportunities, challenges and recent advances. In: Proceedings ofthe 47th design automation conference, DAC ’10. ACM, NewYork, pp 12–17

36. Putnam A, Caulfield AM, Chung ES, Chiou D, ConstantinidesK, Demme J, Esmaeilzadeh H, Fowers J, Gopal GP, Gray J,Haselman M, Hauck S, Heil S, Hormati A, Kim J-Y, LankaS, Larus J, Peterson E, Pope S, Smith A, Thong J, Xiao PY,Burger D (2014) A reconfigurable fabric for accelerating large-scale datacenter services. SIGARCH Comput Architect News42(3):13–24

37. Sekanina L (2003) Virtual reconfigurable circuits for real-worldapplications of evolvable hardware. In: Proceedings of the 5th

international conference on evolvable systems: from biology tohardware, ICES ’03. Springer, Berlin, pp 186–197

38. Sharifi M, Fathy M, Mahmoudi MT (2002) A classified andcomparative study of edge detection algorithms. In: Proceedings.International conference on information technology: Coding andcomputing, pp 117–120

39. Synopsys (2019) Identify: simulator-like visibility into hardwaredebug. retrieved from: https://www.synopsys.com/implementation-and-signoff/fpga-based-design/i%dentify-rtl-debugger.html, 2019.Accessed:2019-01-10

40. Tiago P (2014) Peixoto the graph-tool python library41. Tzimpragos G, Cheng D, Tapp S, Jayadev B, Majumdar A

(2016) Application debug in FPGAs in the presence of multipleasynchronous clocks. In: 2016 IEEE international conference onfield-programmable technology,(FPT). Proceedings

42. Vansteenkiste E, Farisi BA, Bruneel K, Stroobandt D (2014) Tpar:Place and route tools for the dynamic reconfiguration of theFPGA’s interconnect network. IEEE Trans Comput Aided DesIntegr Circuits Syst 33(3):370–383

43. Vermeulen B (2008) Functional debug techniques for embeddedsystems. IEEE Des Test Comput 25(3):208–215

44. Wheeler T, Graham P, Nelson B, Hutchings B (2001) Usingdesign-level scan to improve FPGA design observability andcontrollability for functional verification. Springer, Berlin

45. Xilinx (2019) Programming and debugging: Vivado design suiteuser guide, ug973 (v2018.1). retrieved from: https://www.xilinx.com/products/design-tools/vivado.html, 2018

46. Xilinx Inc (2015) Configuration readback capture in ultrascaleFPGAs. Application note, XAPP1230

Publisher’s Note Springer Nature remains neutral with regard tojurisdictional claims in published maps and institutional affiliations.

Alexandra Kourfali received her Diploma in Computer and Commu-nications Engineering from University of Thessaly, Volos, Greece, in2012. She is currently pursuing her Ph.D. degree from Ghent Uni-versity, in Ghent, Belgium. She is affiliated with the Department ofElectronics and Information Systems, Computer Systems Laboratory,Hardware and Embedded Systems Group within Ghent University. Hercurrent research interest include improving reliability, fault-toleranceand in-circuit debugging in FPGAs, with novel tool flows that facilitatedynamic reconfiguration. She targets the back-end of the tool-flowsand FPGA architectures with multiple overlays. Her main interests arenovel in-circuit debugging, fault-tolerance and fault injection methods.

Florian Fricke received his Bachelors degree in 2009 at the SouthWestphalia University of Applied Sciences in Electrical engineeringand his Masters degree in 2012 at the Ruhr-University Bochum inelectrical engineering and information technology. Since 2012 he isa PhD student at the Ruhr-University Bochum. His main researchinterests include adaptive processor platforms, performance analysisand computer architecture.

https://www.mouser.com/pdfdocs/qts-qps-5v3.pdf

https://www.mouser.com/pdfdocs/qts-qps-5v3.pdf

https://www.synopsys.com/implementation-and-signoff/fpga-based-design/i%dentify-rtl-debugger.html

https://www.synopsys.com/implementation-and-signoff/fpga-based-design/i%dentify-rtl-debugger.html

https://www.xilinx.com/products/design-tools/vivado.html

https://www.xilinx.com/products/design-tools/vivado.html

J Electron Test (2019) 35:173–189 189

Michael Huebner (IEEE Senior Member) is full professor at theBrandenburg University of Technology - Cottbus - Senftenberg. Heis leading the Computer Engineering Group since october 2018.From 2012-2018 he lead the Chair for Embedded Systems forInformation Technology (ESIT) at the Ruhr-University of Bochum(RUB). He received his diploma degree in electrical engineering andinformation technology in 2003 and his PhD degree in 2007 fromthe University of Karlsruhe (TH). Prof. Hubner did his habilitation in2011 at the Karlsruhe Institute of Technology (KIT) in the domainof reconfigurable computing systems. His research interests are inreliable and dependable reconfigurable computing and particularlynew technologies for adaptive FPGA run-time reconfiguration and on-chip network structures with application in automotive systems, incl.the integration into high-level design and programming environments.Prof. Hubner is main and co-author of over 200 internationalpublication. He is in the steering committee of the IEEE ComputerSociety Annual Symposium on VLSI, organized more than 25 eventslike workshops and symposia, and is active as guest editor in manyjournals like e.g. IEEE TECS, IEEE VCAL and IEEE TNANO.

Dirk Stroobandt received the Ph.D. degree from Ghent University,Ghent, Belgium, in 1998. He is currently a Professor at thesame university, affiliated with the Department of Electronics andInformation Systems and Computer Systems Laboratory. He currentlyleads the Hardware and Embedded Systems Research Group, thatis focused in semiautomatic hardware design methodologies andtools, runtime reconfiguration, and reconfigurable multiprocessornetworks. Dr. Stroobandt is the inaugural winner of the ACM/SIGDAOutstanding Doctoral Thesis Award in Design Automation in 1999.He also initiated and co-organized the International Workshop onSystem-Level Interconnect Prediction since 1999. He has also beenthe Associate Editor and Special Issue Guest Editor of internationaljournals.

Affiliations

Alexandra Kourfali1 · Florian Fricke2 · Michael Huebner3 · Dirk Stroobandt1

Florian [email protected]://www.esit.rub.de

Michael [email protected]://www.b-tu.de

Dirk [email protected]://www.ugent.be

1 ELIS Department, Computer Systems Lab,Ghent University, Ghent, B-9052, Belgium

2 Chair for Embedded Systems for Information Technology,Ruhr University of Bochum, Bochum, 44780, Germany

3 Chair for Computer Engineering, Technical Universityof Cottbus, Senftenberg, 01968, Germany

http://orcid.org/0000-0003-3430-1007

http://www.esit.rub.de

http://www.b-tu.de

https://www.ugent.be

an integrated on-silicon verification method for fpga...

Documents