cad flow for hotspot management in fpgas
TRANSCRIPT
CAD FLOW FOR HOTSPOT MANAGEMENT IN FPGAS
by
Rohit Dnyaneshwar Somwanshi
APPROVED BY SUPERVISORY COMMITTEE:
Dinesh K. Bhatia, Chair
Poras T. Balsara
Benjamin Carrion-Schaefer
CAD FLOW FOR HOTSPOT MANAGEMENT IN FPGAS
by
ROHIT DNYANESHWAR SOMWANSHI, B.Tech., MS
THESIS
Presented to the Faculty of
The University of Texas at Dallas
in Partial Fulfillment
of the Requirements
for the Degree of
MASTER OF SCIENCE IN
ELECTRICAL ENGINEERING
THE UNIVERSITY OF TEXAS AT DALLAS
August 2017
ACKNOWLEDGMENTS
I would like to express my deepest gratitude to my thesis adviser Dr. Dinesh Bhatia for his
direction and inspiration for research. I am thankful to him for encouraging me to write a
thesis and steering me in the right direction. He never sent me back whenever I found myself
in trouble spot and needed his advice. Dr. Bhatia is not only a successful researcher but
also a very kind and big hearted person. Thank you, Professor!
I would like to thank Dr. Poras Balsara and Dr. Benjamin Carrion-Schaefer for being on
my committee and sharing their innovative ideas. I learned simplicity and comprehensibility
from Dr. Poras. I am thankful to Girish Deshpande who allowed me to work with him on a
small part of his Ph.D. topic. His expertise in using CAD tools and prior work in this area
allowed me to take pace. Girish pushed me for research and provided moral support.
Next, I would like to thank all my IDEA lab members. I enjoyed brainy discussions with
Sneha, Masoud, Devang, Abhishek, and Athul. They helped me in many stages during
research. I am thankful to Pingakshya Goswami who inspires me to learn more. I would
like to thank Deepika Pai for holding up with me during my thesis and group project. Big
thanks to Uma Maheshwari Subramanian and Pragya Shukla for encouraging my idea to do
a thesis. I would like to thank my friends and roommates. Lots of thanks to Mrs. Tulika,
Manu, Manasi. I enjoyed their hospitality and time spent on Thanksgiving. I am most
thankful to Geetanjali Sakore, she never stopped believing in me and stood as a constant
energy source.
I do not have words to thank my family for their sacrifice and encouragement. They always
prayed to God for my success. Dad, Mom, Sam, Vaibhav and Pradnyesh: I am deeply
grateful for your motivation and love.
May 2017
v
CAD FLOW FOR HOTSPOT MANAGEMENT IN FPGAS
Rohit Dnyaneshwar Somwanshi, MSEEThe University of Texas at Dallas, 2017
Supervising Professor: Dinesh K. Bhatia, Chair
Field Programmable Gate Arrays (FPGAs) are used not only for rapid prototyping but also
for implementing very high-performance computing engines. Advancements in big data, ma-
chine learning, and neural network are increasing FPGA devices in popularity. The statistics
show that most of the Integrated Circuits and FPGAs are damaged due to overheating or
high thermal gradient. Many companies declare breaks to Moore’s law due to very high
power densities in ICs. GPUs have an ability to work at few thousands GFLOPS processing
power. Single GPU device consumes hundreds of watts of power. FPGAs are cheap and
power efficient. However, power consumption and dissipation trend in FPGAs in increasing
for many applications.
This thesis focuses on thermal aware placement for FPGA mapped designs. In order to
accurately model the effect of physical locations of a logic block on the overall thermal
profile, we need to have accurate power and temperature estimation methods. We present
the results of a modified placement algorithm, that takes accurate logic block level power
estimates into account. We have presented a modified design flow that is built on top of
the Versatile Placement and Routing framework from University of Toronto. Our results
implicate a well improved thermal profile for FPGA mapped designs.
vi
TABLE OF CONTENTS
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Field Programmable Gate Arrays (FPGAs) . . . . . . . . . . . . . . . . . . . 1
1.1.1 Logic Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Routing architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 FPGA Physical Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Design Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Physical Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.5 Configuration bit-stream generation . . . . . . . . . . . . . . . . . . . 6
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Research Contributions and Thesis Organization . . . . . . . . . . . . . . . . 7
CHAPTER 2 POWER ESTIMATION IN FPGAS . . . . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Sources of power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Static (Leakage) power . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Dynamic power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Short circuit power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 ACE 2.0 Activity estimation tool . . . . . . . . . . . . . . . . . . . . 18
2.2.5 FPGA SPICE for power measurement . . . . . . . . . . . . . . . . . 21
2.3 Power and Thermal modeling with CAD . . . . . . . . . . . . . . . . . . . . 24
2.3.1 3D-ICE thermal simulation tool . . . . . . . . . . . . . . . . . . . . . 26
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vii
CHAPTER 3 CAD FOR PLACEMENT AND ROUTING ALGORITHMS . . . . . 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Packing: VPack for logic block packing . . . . . . . . . . . . . . . . . . . . . 29
3.3 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Versatile Placement and Routing (VPR) . . . . . . . . . . . . . . . . 35
3.4 Linear congestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 CAD for Routing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Delay measurement techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 Motivation for optimization in floor planning for thermo-power aware CAD . 46
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
CHAPTER 4 THERMAL-AWARE FPGA PLACEMENT . . . . . . . . . . . . . . 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Various thermal aware design methods for FPGAs . . . . . . . . . . . . . . . 48
4.2.1 Heat-sink based cooling . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Xilinx Power Analyzer for heat sink performance . . . . . . . . . . . 52
4.3 Thermal gradient in integrated circuits . . . . . . . . . . . . . . . . . . . . . 54
4.4 Modified Simulated Annealing cost function . . . . . . . . . . . . . . . . . . 56
4.4.1 Temperature Flattening . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.2 Thermal benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Custom Thermal-Aware Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6 Result analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
CHAPTER 5 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . 65
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
APPENDIX A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
CURRICULUM VITAE
viii
LIST OF FIGURES
1.1 Basic Logic Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Schematic representation of a generic island style FPGA . . . . . . . . . . . . . 3
1.3 Modern FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 CMOS Ciruit for dynamic power consumption analysis [1] . . . . . . . . . . . . 11
2.2 Monte Carlo simulation flow [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 ACE Modified Algorithm [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Simplification phases for reducing time complexity in ACE algorithm [3] . . . . 20
2.5 FPGA-SPICE CAD tool flow [4] . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Grid level and Component level simulation testbench blocks in FPGA SPICE [4] 23
2.7 Effect of temperature on leakage current in CMOS devices [5] . . . . . . . . . . 24
2.8 Variation in power consumption and temperature with activities at input [3] . . 25
3.1 Combining two BLEs for reducing total inputs to a cluster. . . . . . . . . . . . . 30
3.2 VPack algorithm pseudo-code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Pseudo-code for simulated annealing based placer . . . . . . . . . . . . . . . . . 34
3.4 VPR CAD tool flow [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 VPR assumed FPGA model [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Position of router in VPR CAD . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Traditional router methodologies of expansion and sinking wires . . . . . . . . . 41
3.8 VPR router methodologies of expansion and sinking wires . . . . . . . . . . . . 42
3.9 Circuit representation of routing switches and routing network . . . . . . . . . . 43
3.10 Intuitive example to solve for Elmore delay for RC routing network . . . . . . . 44
3.11 Complete flow diagram of Versatile Placement and Routing evaluation . . . . . 45
4.1 Heat-sink structure for stacked or 2D FPGAs . . . . . . . . . . . . . . . . . . . 50
4.2 Variation of Tj with various VDD for airflow rates in Linear Feet per Meters . . . 54
4.3 Variation of Tj with different size and different quality heat-sinks . . . . . . . . 55
4.4 Illustration of the temperature flattening flow during placement. . . . . . . . . . 58
4.5 Placement of high power blocks from the golden run (a) and the location of thesame blocks in the custom run (b) with the new power numbers. The change inthe power values is due to the change in dynamic and leakage temperature values.Note that the scale for figure b is much smaller compared to a. . . . . . . . . . . 60
ix
4.6 Comparison of the heatmaps for alu4 showing the reference VPR design (a)against the one with the worst performance but the best thermal profile forγ = 0.5 (b). One of the several clusters of hot spots is circled in (a) for clar-ity. All temperatures shown are in Kelvin. . . . . . . . . . . . . . . . . . . . . . 61
4.7 The best and worst delay designs generated for each benchmark. The best(mindelay) design corresponds to the reference design using VPR default flow and theworst(max delay) design is the one at the power threshold that includes around20% of high power blocks for each benchmark. . . . . . . . . . . . . . . . . . . . 62
4.8 Normalized critical path delays versus the high power block spreading factor γ.The best design is obtained for γ = 0.2, 0.875 and the worst is at γ = 0.5 forthe custom cost function. The ratio of delays of custom run by golden run isrepresented by the upper plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
x
LIST OF TABLES
2.1 Performance comparison of activity estimators [3] . . . . . . . . . . . . . . . . . 18
2.2 Performance comparison of activity estimators [8] . . . . . . . . . . . . . . . . . 21
2.3 Performance comparison of activity estimators [3] [8] . . . . . . . . . . . . . . . 23
4.1 Vendor Recommendations for FPGA Cooling Solutions . . . . . . . . . . . . . . 52
4.2 Heatsink selection for benchmarks tested. . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Thermal Resistances of commercial heatsinks . . . . . . . . . . . . . . . . . . . 53
4.4 Vendor Recommendations for heatsinks . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Thermal flow simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.1 Benchmark Characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
xi
CHAPTER 1
INTRODUCTION
Gordon Moore made an observation in 1965 about the density of transistors in integrated
circuits (ICs). Moore's law states that the number of transistors in an IC doubles after 18
months. Moore's observation proved accurate for many decades. NVIDIA GP100 Pascal
GPU contains more than 15 billion transistors. One way to invest this many transistors is
multi-core processing units. Multi-core processing increases the performance of the appli-
cation. However, if we look at the trend of cores added, it is clear that the gain in terms
of performance by adding more number of cores is diminishing after a point. Additional
cores require adjustment to the existing software and OS. But simply adding more cores
is at times not very useful and sometimes wastes power. Field Programmable Gate Array
(FPGA) is the reconfigurable device which contains uncommitted silicon. FPGAs are tra-
ditionally developed for prototyping or functional testing the ASIC designs before sending
out to foundries. Advancements in FPGA architectures and CAD tools over the years have
made FPGA devices as preferred architecture for many search engines as well as primary
machines to run machine learning algorithms.
1.1 Field Programmable Gate Arrays (FPGAs)
The island style FPGA is an uncommitted pre-fabricated silicon semiconductor device formed
of uniform structures of arrays in island style Logic Blocks (LBs) or Configurable Logic Blocks
(CLBs)[9]. Each CLB consists of a set of Basic Logic Elements (BLEs) (see figure 1.1). Each
BLE has Look Up Table (LUT), D-Flip Flop, and a multiplexer to select the output out of
the BLE. The most common architecture of an FPGA is shown in figure 1.2. This common
island style FPGA architecture (see figure 1.2) is composed of three fundamental blocks,
namely :
1
Figure 1.1. Basic Logic Element
• Reconfigurable programmable logic blocks
• Internal wire routing network, programmable switch and connection boxes
• Input and Output pads that connect internal routing network to external connections
1.1.1 Logic Blocks
A logic block or Configurable logic block comprises of a number of BLEs (see figure 1.1).
If we zoom into each BLE, it will have k-LUT. In this notation, k is a number of inputs to
LUT. A k input LUT will require 2k SRAM bits for configuration. k-LUT can implement any
k-input logic function. The latest Xilinx series 7 FPGAs consists of memory slice (SLICEM)
and logic slice (SLICEL). Latest FPGAs include typically 4 to 10 Basic Logic Elements
in one cluster. There are fracturable BLE architectures proposed that contains possible
combinations and mixture of number of inputs. Logic block category also include memory,
Arithmatic and Logic Unit, DSPs, etc.
2
Figure 1.2. Schematic representation of a generic island style FPGA
1.1.2 Routing architecture
An island-style FPGA architecture as shown in figure 1.2 depicts the island of configurable
logic blocks in a sea of routing interconnect. Unified 2D grid array of CLBs are connected
by configurable routing metal wires. Network also connects primary inputs and outputs
to logic blocks. Re-programmability in pre-fabricated wiring segments comes from switch
and connection box connections. Typical routing network occupies 8090% of FPGA area.
Connection boxes (CB) route the output of LUTs to a switch box (SB). Limited number
of wiring tracks and SB connections pose the network congestion problem. The term Fc is
associated with connection box flexibility. It denotes the number of connections from the
logic block to route. Similarly, Fs is the flexibility of switch box. It is the possible connections
available inside SB for an input pin to route. Detailed discussion on routing algorithm and
architecture is provided in chapter 4.
3
1.2 FPGA Physical Design Flow
In this section we briefly describe the mapping process for FPGAs. There are following 5
stages involved in FPGA physical design flow.
• Design entry
• Technology mapping
• Physical placement
• Routing
• Configuration bit-stream generation
1.2.1 Design Entry
This is the first step in the process of mapping design on the FPGA device (figure 1.3
describes heterogeneous modern FPGA architecture). The hardware description language
depicts the circuit description design specification. There are various synthesis tools available
for entering the design. Most of the CAD tools supports Verilog or VHDL programs for
describing Register-Transfer Level (RTL). There are other ways to map the design such as,
adding schematics (.sch), net-list (.ngc, .edif) and microprocessors (.xmp) directly to the
project. Various custom or pre-defined IPs can be included in this method. For the synthesis
of the entered design, synthesis technology (for example XST) converts entered RTL or IPs
to a netlist that will be used for implementation. Technology and RTL schematics provide
information to the user about synthesized design. Synthesis tool parses the HDL code and
checks for the syntax. There are various types of special functions are available to used in
a HDL program like keep, case, OPT LEVEL, etc. for synthesizing constraints. Synthesis
tool can detect finite state machines in the program and apply the encoding style in order to
4
optimize the area and boost the speed. Macro generation, flip-flop re-timing, partitioning,
incremental synthesis, re synthesis, etc. are the schemes used by synthesis tool for FPGA
optimization.
1.2.2 Technology Mapping
FPGA architecture includes unified array of logic blocks. After translating the design in
the form of gate level netlist, technology mapping packs the design in the form of K input
Look Up Tables (K-LUTs) and combine LUTs to form Configurable Logic Blocks (CLBs).
Generally technology mapper optimizes for area utilized and minimum delay. Translate phase
takes the design constraint file (.ucf or .xdc) and relates it with the netlist. This translation
will produce native generic database (.ngd). Map deals with extracting the input-output
buffers (IOB) and CLBs defined in the .ngd file. Map produces native circuit description
file (.ncd). There are different types of algorithms for various optimization goals. Network
flow, bin packing algorithms are among the popular algorithms for technology mapping.
1.2.3 Physical Placement
Each block in the Configurable Logic Block (CLB) level netlist will be assigned a physical
location on FPGA device. The assignment of CLBs to unique x, y location process is called
as Physical Placement. Placement algorithm place the co-related blocks closer in order to
reduce delay and interconnection complexity.
1.2.4 Routing
Physical placement assigns a x, y location to each of the logic block in netlist. Figure 1.2
depicts the interconnection network connecting Logic Blocks through connection box and
switch box. Detailed routing architecture and algorithms are explained in chapter 3. Routing
algorithm need to consider congestion in network, critical path, and advance issues include
5
not only but false path, cross talk, power optimization. Multiple parameters such as the
architecture of routing network, logic block geometrical orientation, flexibility of switch and
connection box has an effect on performance of routing resource utilization and performance.
1.2.5 Configuration bit-stream generation
Bitgen generates a bitstream (.bit) which can configure the SRAM bits in FPGA. After
configuring the FPGA device, there are ways to test and debug the design. ChipScope
software can be used which allows adding debug cores in design. Integrated logic analyzer
(ILA), virtual input output (V IO) software cores make is possible to add probes at nets
which are to be considered for on-chip testing. ILA allows real-time probing and observing
the nets inside the logic.
1.3 Motivation
Growing size of FPGAs and performance hungry applications are draining a lot of power.
CAD tools consider wire-length and area of the chip utilized for the optimization problem.
However many research groups in the industry as well as in academia are now considering
power consumption and temperature of the chip as one of the major metric for optimization.
The problem of power consumption and a tremendous increase in temperature are major
concerns in proposed 3D FPGA designs. Because of which making the CAD tools power
and thermal aware is an open and very demanding field for research.
The key motivation for this research is to reduce the density of the hotspots. Reduction
in junction temperature can justify the purpose. The current research on power and ther-
mal aware designs is moving towards post implementation analysis and employing different
cooling mechanisms. However, this research focuses towards the early stage of the design
flow for thermal and power-aware placement algorithm. To achieve early awareness in CAD
tools for thermal and power-aware designs, it is most essential to obtain accurate simulation
6
Figure 1.3. Modern FPGA Architecture
model for power calculations. FPGA-SPICE [4] creates a hspice model of each of the block
in FPGA architecture. The uniformity in logic blocks and interconnection networks soothes
the process of determining power profile of FPGA. Xilinx Power Estimator (XPE) is a good
way to obtain vector-based or vector-less power numbers though simulating benchmarks.
This research is the demand in industrial as well as military application in FPGAs.
1.4 Research Contributions and Thesis Organization
In order to overcome the device failure due to hotspots and high junction temperature, we
came up with a few methods at different design stages.
7
• We have developed a power and thermal aware placement method. The placement
algorithm implemented in VPR [7] and T-VPack [10] (Versatile Packing Placement
and Routing) uses simulated annealing. We have modified the placement algorithm
(simulated annealing) that consider power as one of the optimization constraint. We
modified the cost function associated with this algorithm with adding extra we studied
the initial placement of the simulated annealing. For accurate results, we then obtained
CLB level power of benchmarks. Activity estimation of nets in design is carried out
with the help of ACE2.0 activity estimator [3].
• We studied heat sink performances with different board sizes and air-flows. The
sweep of airflow rate for different board size and qualities of the heat sink for maxi-
mum junction temperature is carried out. Simulation results shows ineffectiveness of
the active and passive heat-sinks in high performance computational FPGA devices.
The thesis is organized into following chapters:
In chapter 2 we present the brief study of different FPGA architectures and study of power
and temperature associated with components of FPGA.
In chapter 3 we describe different placement algorithms and form a base foundation for
understanding delay and power measurement techniques.
In chapter 4 we outline key research and prior works in FPGA thermal management field.
In chapter 5 we formulate and explain the proposed algorithm for placement scheme. We
also discuss the benchmarks used for capturing results. This section includes in detail study
of accurate power measurement tools, CAD tools used for research and CAD flow.
In chapter 6 we present conclusions, remarks and discuss future scopes of this work in
different dimensions.
8
CHAPTER 2
POWER ESTIMATION IN FPGAS
2.1 Introduction
This chapter describes power consumption and associated power and thermal measurements
in FPGA devices. We present literature and prior work on power analysis in VLSI design as
well as FPGAs. In high performance computing applications, there are two popular choices
when it comes to implementing algorithms on hardware. GPUs (graphical processing units)
are extremely common as they offer small multi-cores that enable extreme parallelism for
a large number of applications. GPUs are marred by their excessive demand for power
consumption. The alternative solution that seems to be emerging is the use of FPGAs.
FPGAs are popular due their low power consumption, low cost, and an ability to pro-
totype a large class of applications. An excessive computation demand is becoming more
and more responsible for rise of the on chip temperature, thermal failures, and is forcing the
design and use of aggressive cooling methods.
In order to attain accurate thermal characterization, it is important to study and under-
stand the mechanism of power dissipation in FPGAs and the the tools that are used for early
estimation and accurate evaluation of power associated with FPGA mapped applications.
We first present the general sources of power in CMOS technology and then present a survey
on literature for power estimation.
2.2 Sources of power
2.2.1 Static (Leakage) power
• PN junction Reverse Bias leakage This leakage is caused due to the leakage current
through reverse biased p-n diode junctions of the transistors. It is the current be-
tween the drain (or source) and the substrate. At the room-temperature this type of
9
leakages are minimum but as at the higher temperature, this leakage current can be
non-negligible.
• Sub-threshold leakage Subthreshold leakage is caused due to the flow of current in
weakly inverted source and drain terminal. The direction of current is from drain to
source hence it is also referred as D-S leakage. The exponential relation in leakage
current and Vth, with very small increase in Vth effects in a reduction in subthreshold
current.
Isub =µCoxW
LVT (2.1)
Where, µ is the electron mobility, Cox is oxide capacitance
• Gate leakage This leakage is due to the current because of low thickness of the oxide and
a tunnel through the gate oxide. Scaling down with the transistors, this type of leakage
is no more negligible. There are ways to reduce the gate leakage viz., not reducing the
oxide thickness below the limit with the process scaling, insulating the gate terminal
with the material of high dielectric constant.
Igate ∝ Weff. (2.2)
2.2.2 Dynamic power
Dynamic power is also called as switching power since it is the power dissipated in charging
and discharging the load capacitance by the CMOS circuit (see figure 2.1).
ic(t) = CLdccdt
(2.3)
Energy dissipation from VDD on every input transition from 2.3
EV DD =
∫ ∞0
iVDD(t)VDDdt = CLVDD
∫ VDD
0
dvC = CLV2DD (2.4)
10
Figure 2.1. CMOS Ciruit for dynamic power consumption analysis [1]
Similar calculations prove that energy stored in load capacitor is half of the energy
dissipated by VDD,
ECload=CLV
2DD
2= EDynamic (2.5)
The remaining half energy is dissipated in switching action (see equation2.5). Power is
integration of energy per unit time.
Pdynamic =CV 2fclk
2(2.6)
From the above equation of dynamic power dissipation, following techniques can be
implemented to reduce Pdynamic:
• Supply voltage has a square relation with Pd. Hence the reduction in supply voltage is
one of the most effective ways to reduce total dynamic power consumption. However,
there is a limit to reduce the supply voltage since it has a direct impact on circuit
performance and highest operating frequency. Chow et al. [11] proposes multiple
power supplies and scaling of the voltages as an alternative to a reduction in supply
voltage.
11
• Operating frequency and the switching activities have a direct impact on dynamic power
dissipation. Using pipe-lined architectures, implementing multiple clock domains, fre-
quency scaling is few of the successful techniques implemented in order to reduce
switching component. However, latest technologies are capable of working at tremen-
dously high clock frequencies for performance. Hence the demand for dynamic power
consumption is increasing.
• The last component in equation 2.6 is Cload. Cload is an additive effect of output capaci-
tance and self-capacitance of the CMOS devices in the circuits. This component also
includes wire capacitances. Hence optimally sizing the transistors and minimizing wire
capacitances will help in reducing overall Cload.
2.2.3 Short circuit power
Rise and fall at the input of a transistor cannot be a sharp edge. Input slew rate is responsible
for short circuit current. During the transition from low to high or high to low at the input
of the CMOS logic, it is possible that both pull up and pull down the network is active
which creates a path for current from VDD to Gnd. This conduction of current from supply
to ground is undesirable and called short circuit current.
ESC = VDDIpmostSC (2.7)
PSC = tSCVDDIpmosfclk (2.8)
This type of leakage can be minimized linearly by lowering the VDD. Also, designing less
sensitive output rise and fall time with respect to input slew rate will operate either PUN
or PDN at one instance which will avoid the short circuit path.
12
In early phases of technology scaling till 130 nm, dynamic power dominated the total
power dissipation. Moving towards low node process technologies, static leakage has played
a dominant role. The increase in static power dissipation component is due to lowering of the
channel width and overall transistor sizes. This phenomenon increases the leakage current
since the threshold voltage is continuously lowering with the scaling down of transistor sizing.
In this section, we focus on academic tools for estimation of power measurement in FPGAs.
We will also present prior works done in this area of research.
Power estimation performed at coarse grain architecture is less accurate than the estimation
performed fine grain architectures. Also, as we go down in the levels of abstraction, it
becomes increasingly difficult as well as computationally inefficient to perform accurate power
estimation. Power analysis and estimation can be sub-divided into two categories:
1. Simulation based power analysis: In this type of technique for power analysis,
Simulation Program with Integrated Circuit Emphasis (SPICE) based simulations are
used for estimation. Since SPICE simulations are performed at the device level, these
estimations have higher confidence and are accurate. It stimulates the dynamic be-
havior at the transistor level and demands SPICE netlist of components in the device.
High granularity and huge computation for large FPGAs makes simulation based anal-
ysis very time-consuming. To boost the execution time of the SPICE simulations, often
a transistor-level netlists will be combined to form high-level abstraction units. In the
case of FPGAs, a technology mapped CLB level netlist will always perform faster than
Basic Logic Element (BLE) or Logic Gates level netlist. However, these simulations
are highly dependent on quality and quantity if inputs to the circuit. Simulation-based
estimation procedure can be described as below:
(a) Since inputs pattern has a major impact on simulation results, decide on a number
of input vectors and simulate circuit under test with those patterns.
13
(b) In order to obtain an accurate estimation for dynamic power, it is important
to accurately model the switching behaviour of nets involved in a design. The
probability that the net will flip from a low to a high or vice versa is estimated
using an activity estimator block.
(c) Table based or statistical model based leakage power calculation model is used for
estimating the leakage power. The leakage power is related to temperature and
the sizing of the device and is easily modeled by a statistical model.
(d) Total power will include the dynamic, leakage and routing (global as well as local)
powers.
The major drawback of the simulation-based power estimation technique is set of in-
put pattern vectors. A bad choice of a set of vectors will have a high error in total
power measurement. A random vector generator is proposed in order to estimate most
accurate power profile. The algorithm comes from decision making and risks impact
theory. It is a mathematical model applicable to many such problems, in which out-
come depends on a large set of inputs. The algorithm starts with the random set
of values and calculates the results. The process of random generation of vectors for
simulations can perform many thousand iterations till it encounters a terminating con-
dition. Every iteration of the algorithm calculates mean and variance of the estimated
outcome. Loop condition determines confidence in the output and jumps to a termina-
tion or iterate with another random vector. The first publication of the above Monte
Carlo based method was published in 1940. Later on, Monte Carlo method [2] (see
figure 2.2) was used for solving a wide range of problems in computer programming
world. Simulation-based power estimation requires a set of input vectors for SPICE
simulation.
14
2. Probability based estimation: It has been studied in the past that the simulation
based estimation techniques are very time to consuming since SPICE execution for
large integrated circuit design is computationally expensive. Most research in the past
that is related to fast simulations and power estimation techniques publish about the
probabilistic estimation of power. Each net or wire in the circuit has static probability
P1(x) which can be defined as the probability that the value of net x will stick to logic
one. Static probability value has an importance in measuring leakage power since it is a
direct measure of how long a transistor can hold same logic value. The dynamic power
of a circuit is a direct function of switching activity of a net. Hence, the probability
of switching the steady state value on net from a logic high to logic low or low to high
during one clock cycle is defined as switching probability Ps(x). Notation As(x) of any
net x can be explained as an average number of times a net value transitions from low
to high or otherwise before attaining a steady state value and it is called as transition
density. It is clear that activity of the net is required for achieving accurate power
characteristics since dynamic power dissipation depends on the switching activity.
Probability based estimation techniques are also called as vector-less activity estimation
techniques. These approaches are computationally faster than the simulation based
estimations. Few probabilistic estimation methods ignore the temporal and spacial
correlation between two or more nets in same circuits. In the case of re-convergent
fan-out nets, we observe special correlation. In sequential feedback loops from the
output of the flip-flops to input forming network, we observe temporal correlation.
Lamoureux et al. [3] illustrated the error percentage by ignoring correlations. Ignoring
special correlation may result in 8 to 120 percent error and 15 to 50 percent possible
error with the ignoring of temporal correlations. There are techniques which consider
temporal as well as spacial correlations. [12] [13].
16
• Measuring accuracy of the activity estimator
There are a number of activity estimation tools available. Accuracy and execution
time for calculating activity estimate are the empirical quantities. Three metrics
are used for evaluating the accuracy of the activity estimator. The average relative
error is the average of the error in estimated activity of every wire in a design. It
can be calculated as follows (see equation 2.9):
Average relative error =
∑n∈circuit
Asest (n)−Assim (n)
Assim (n)
|circuit|(2.9)
Where, Asest is the estimated activity of net and Assim is the simulated activity
of net. Another evaluation measure is r2 correlation. It measures the co-ordinal
relationship between estimated and simulated values of activities. Formula for r2
correlation is:
r2 =ss2xy
ssxxssyy(2.10)
Here, ss denotes sum of squared values. The values are calculated co-ordinally
for xi which is estimated activity of net and yi is simulated activity in net space.
For each data point, xi (the estimated activity) and yi (simulated activity) is
considered. Desired value of term r2 is one since, when r2 is equal to one, it
describes the exact match in simulated and estimated activity values i.e. 100%
accurate estimation. Third evaluation metric is activity ratio. It is simply the
ratio of the summation of the estimated activities to simulated activities. Activity
ratio one describes most accurate estimation. The formula can be expressed in
below equation (see equation 2.11):
Activity ratio =
∑n∈circuitAsest(n)∑n∈circuitAssim(n)
(2.11)
Where, Asest , Assim is the estimated activity of net and the simulated activity of
net respectively.
17
Table 2.1. Performance comparison of activity estimators [3]Simulation Sis1.2 ACE1.0Comb Seq Comb Seq Comb Seq
Activity Ratio 1 1 1.17 1.40 1.21 2.37Avg error 0 0 0.02 0.09 0.07 0.08r2 1 1 0.88 0.72 0.89 0.47Runtime Sec 122 191 ∞ ∞ 31 108
2.2.4 ACE 2.0 Activity estimation tool
Above three parameters serve as metrics to measures for the accuracy of an estimator.
Lamoureux et al. [3] who proposed ACE 2.0 activity estimator compare the results of
two estimators shown in the table (see table 2.1). The author states that even for the
smallest sequential circuits, because of the insufficient storage Sis 1.2 [8] is unable to produce
estimations in finite time. Required execution time grows exponentially with the number of
inputs to the sequential elements. One observation to be made is that even though tools are
not able to produce estimates in finite time for many sequential circuits, the estimates for
combinational logic are relatively accurate and the speed of execution is better compared to
sequential circuits. To overcome the speed of estimation for sequential circuits, the author
has proposed a new ACE-2.0 [3] estimator for activity. Since Sis-1.0 is exponentially slow
with increasing number of inputs and ACE-1.0 is not very accurate, power aware CAD
tools demands fast and accurate estimator. ACE-2.0 tries to demonstrate both speed and
accuracy. New proposed algorithm works in three phases as detailed in figure 2.3.
Static power and switching power are the two components that are most important in
the calculation of power consumption. As seen in results earlier from table 2.1, sequential
feedback takes the longest time for execution. Usually, circuits with large number of flip-flops
are error prone in the calculation of probabilities. The new ACE tool works in three phases
(see figure 2.3). The first phase works on simplifying the calculations for a reduction in exe-
cution or simulation time. As discussed, sequential addition is one of the major contributors
18
Figure 2.4. Simplification phases for reducing time complexity in ACE algorithm [3]
for extra simulation time. This phase ignores the sequential feedback and then simulates
remaining logic. The assumption made here is that logic present in simulating sequential
circuits is very small compared to the remaining logic. Another method that is implemented
for reducing time complexity is a simulation of the switching probability instead of switch-
ing activities. A net can switch more than once during one clock cycle before reaching a
steady state value. But, observing the last steady state value on the wire after complete
cycle determines the switching probability. It is computationally cheaper to simulate for
probabilities than monitoring every single change occurred on every wire. We will call the
first simplification as simplify1 and second one as simplify2. (see figure 2.4)
In the phase2 of the algorithm, it calculates the ignored and remaining switching prob-
abilities in the sequential feedback loops. Lag-one model is used to calculate the switching
probabilities. For faster simulations, the gate level network is collapsed in the smaller net-
work. Binary Decision Diagram and Partial collapsing methodologies are used for forming
collapsed network. The partial Collapsing method can be understood in the figure 2.4. Since
20
Table 2.2. Performance comparison of activity estimators [8]ACE 2.0 Power Model CAD
Avg r2 Act rat Rel err Runtime Diff P saveseq 0.97 0.97 0.03 2.3 −0.1 8.8comb 0.86 1 0.02 25.9 0.6 14
it is possible that collapsed network will include special correlations, the activities estimated
will be un-precise. Collapsing into smaller networks will assure more accuracy. However, the
trade off with speed of simulation and accuracy is the key consideration. In the BDD prun-
ing method, a pruning threshold probability (minimum sw prob) is decided. Any probability
below the pruning threshold will not be considered in the calculation. Such sub-trees will be
removed and assigned P s=0 for the parent node of sub-tree. The algorithm is executed in
a recursive manner starting from root to the leaf of the tree. Phase3 of this algorithm works
on glitches. Many optimization tools ignore the propagation delay or the race conditions.
Effect propagation of glitches is studied, and using the lag-model, the existing algorithm is
modified for betterment in the calculation of switching probabilities. Results shown in table
2.2 describe the ACE-2 accuracy and performance.
2.2.5 FPGA SPICE for power measurement
Probability based power estimators are less accurate and not very flexible to architectural
modifications in the FPGA model. FPGA SPICE tool developed by Xifan Tang and Giovanni
De Micheli in 2015 [4] outperforms existing VersaPower [10] tool built inside VPR engine
[7]. This specific tool is built on top of the VTR suite. In the SPICE flow (see figure
2.5), the VersaPower [10] is replaced by the FPGA-SPICE [4] engine and it outputs the
SPICE level netlist to HSPICE simulator. This simulator also takes pre-built spice models
for components in FPGA. The tool has an ability to generate the netlist at three levels of
abstractions. With reference to an FPGA architecture, a full-chip, or a grid, or a component
level netlist can be generated. Full chip level netlist will require large memory storage to
21
Figure 2.5. FPGA-SPICE CAD tool flow [4]
execute. SPICE tool generates a transistor level netlist of components present in the FPGA
architecture. Granularity of SPICE simulation models is illustrated in (figure 2.6).
SPICE performs better for power calculations. It predicts the power numbers that con-
sider parasitic delays and activities as well. FPGA SPICE estimated power values are less
time complex and results can be analyzed from (table 2.3). In table 2.3, benchmark s298 is
evaluated for 3 different types process technologies. Chip level power estimation takes the
maximum time and it requires maximum memory compared to grid and component level
SPICE simulations.
22
Figure 2.6. Grid level and Component level simulation testbench blocks in FPGA SPICE [4]
Table 2.3. Performance comparison of activity estimators [3] [8]s298 runtime in min memory in Mb Power in mW
process tech 22nm 45nm 180nm 22nm 45nm 180nm 22nm 45nm 180nmChip 129.48 106.15 102.56 4780 4827 4306 1.56 4.13 15.63grid 10.27 9.82 8.25 768 768 825 1.41 3.37 18.03component 7.42 6.97 6.23 589 584 621 1.45 3.21 17.57
23
Figure 2.7. Effect of temperature on leakage current in CMOS devices [5]
2.3 Power and Thermal modeling with CAD
Total power consumption in VLSI circuits has two major components: dynamic power and
static power. We have seen the detailed description for both power dissipation components
in the previous section. Static power consumption is mainly a contribution of leakage cur-
rent (gate and sub-threshold). The leakage current and temperature relation are plotted in
figure 2.7. Looking at the equation of dynamic power dissipation (equation 2.6), the power
dissipation is not strongly related to temperature.However, static dissipation is related to
sub-threshold and gate leakage. Which directly are functions of temperatures (2.7).
From figures (2.8 and 2.7) it can be clearly seen the direct relationship in temperature and
power consumption. Pdyn is not exponential with temperature, but since higher frequency
components are associated with dynamic power consumption and same leads to increase in
junction temperature as well. (see equation 2.12)
Tj = Tamb + Ptotalθja (2.12)
24
Figure 2.8. Variation in power consumption and temperature with activities at input [3]
Here, Tj is the junction temperature, Tamb is ambient temperature, Ptotal is total power
consumption, and θja is junction to ambient thermal resistance.
The HotSpot thermal simulator [14] is popular for its use in block level thermal simu-
lations. Many CPU and computer architecture designing community utilizes HotSpot tool.
Re-configurability for different architectures is one of the advantages of HostSpot technology.
In the thermal model, a layer of an active semiconductor is divided into a number of units.
Each newly formed unit can be considered as a node. Each node has the thermal resistance
in vertical and horizontal direction. It accounts for lateral heat spreading complexity. Exist-
ing thermal simulators have the ability to simulate the system or package level. Since with
the high active silicon power consumption, more detailed and accurate simulation model was
needed. HotSpot takes reference from Runge-Kutta RC circuits in the fourth differential
order [15]. The differential equation solver is self-adaptive for different size of units. SESC-
Therm [16] is another thermal simulation model works on the principal of Finite Difference
Model (FDM) [17]. The principal is dividing the problem into multiple segments. Each
segment is evaluated separately. Solving a differential equation for the very large segment is
time expensive. However, FDM complexity exists in analyzing thermal conduction over the
25
segments. The core of this algorithm is similar to HotSpot. Each node is characterized by
source, sink, thermal resistance, and capacitance. Characteristics updates with the change
in temperature and on boundary conditions. Steady state power blurring algorithm [18] [19]
has an ability to solve thermal models using image processing tools. The methodology imple-
ments a filter or so-called thermal mast which convolve the power value matrix. Simulation
of the thermal mask is look up the table based implementation for reducing execution time.
2.3.1 3D-ICE thermal simulation tool
We use the 3D-ICE thermal simulator [20] for generating temperature values for FPGA.
This tool is developed by Embedded Systems Laboratory (ESL) research group in EPFL.
The primary contribution of 3D-ICE tool is to propose a compact transient thermal model
(CTTM) mainly for micro or macro fluid cooling. The modeling of FDA is not in in the
scope of the thesis. The software is built in C programming. It uses Flex, Bison tools for
computation and runs on gcc engine. Tool interface to the user takes physical parameters
and description of the IC layer(s): called stack description file, floorplan file which includes
power dissipation for each logic unit, and time of simulation.
VPR [7] and FPGA SPICE [4] tools generated power distribution is parsed. From the
architectural input provided to 3DICE [20], it generates a thermal grid. The output file
(thermal profile.tmp) will dump the temperature value for logic elements. Visual plotting
can be performed with the help of MATLAB tools.
An example FPGA stack file is described below:
material SILICON:
thermal conductivity 1.30e-4 ;
volumetric heat capacity 1.628e-12 ;
material BEOL :
26
thermal conductivity 2.25e-6 ;
volumetric heat capacity 2.175e-12 ;
top heat sink :
heat transfer coefficient 1.0e-7 ;
temperature 300 ;
dimensions :
chip length 12000, width 12000 ;
cell length 100, width 100 ;
layer PCB :
height 10 ;
material BEOL ;
die BOTTOM IC :
layer 10 BEOL ;
source 2 SILICON ;
layer 100 SILICON ;
stack:
die FPGA DIE 1 BOTTOM IC floorplan ”./alu4 golden.dat.flp”;
layer CONN TO PCB PCB ;
solver:
steady ;
initial temperature 300.0 ;
output:
Tmap ( FPGA DIE 1, ”alu4 golden.tmp”, final ) ;
27
2.4 Summary
In this section, we discussed the main sources of power dissipation in integrated circuits and
FPGAs. We also looked at the simulation based and probability based power estimation
techniques. We the concluded that simulation-based power estimation techniques are more
accurate than probability based power estimations. Power consumption is dependent on the
activity associated with the nets in a design. The modified ACE activity estimation tool
performs better than Sis and ACE 1.0 version. We will use the ACE2 tool in the thermal
and power aware CAD flow for FPGAs. We have described the relation between power
and temperature in circuits. Then we have described different thermal models available.
Comparing the performance and accuracy of various models, 3D ICE thermal simulation
tool is easy to use for stacked and 2D FPGAs.
28
CHAPTER 3
CAD FOR PLACEMENT AND ROUTING ALGORITHMS
3.1 Introduction
We have discussed the island style architecture of FPGAs and typical FPGA design flow
in section 1.2. The FPGAs will be programmed with the binary stream (bit-stream) which
configures the hundreds of thousand to million number of programmable switches to connect
to form a routing network. To perform this kind of programmability, an engineer need to
look into the state of every switch and connection configuration. If this was done manually,
implementing small design of hundred equivalent gates will take an enormous amount of
time. We looked into the FPGA implementation flow, in that an engineer writes RT-Level
program in HDL and XST then implementation tools (Computer Aided Design tools) take
care of generating configurations of millions of the connections and routes. However, the
researchers have delved into the problem of the most efficient and optimum algorithm for
placing, routing, and packing. After extensive research on architecture (value of K, N , I)
now most of the commercial FPGA companies are following a common trend of architecture.
The CAD tools are now aware of real time (transient) speed and area efficiency. The focus
of the CAD tools is now making the design power and thermal aware at the implementation
phase or even earlier than that [21] [22]. In this section, we will look in the FPGA flow in
detail and CAD tools perspective for implementing each if the stage in the flow.
3.2 Packing: VPack for logic block packing
Figure 3.1 explains the basic logic element (BLE) and formation of a cluster of multiple
BLEs [7]. The total input to single look-up table (LUT) is denoted by k. Each cluster
will be consisting of total N BLEs. I number of inputs wires to cluster inputs which further
propagate to BLEs inside the cluster. Mclk is defined for a number of clock inputs to a cluster.
29
Figure 3.1. Combining two BLEs for reducing total inputs to a cluster.
Most research and publications come to a conclusion that k equal to four, Mclk equal to one.
Total I inputs which are passed through multiplexers are feeding as external inputs. The
output of each BLE is fed back to multiplexers in order to have a local connectivity inside
a cluster. Since all inputs and outputs are connected in such a way that every possible
connection of input or output to any BLE is possible. This makes logic a fully connected
structure. Adding one more box of the logic element to already existed cluster sometimes
decreases the I for the cluster (see figure 3.1).
VPack [7] [6] packs the LUTs into clusters. It takes LUT net-list as an input. The tool
also demands the parameters explained above (size of LUT, Number of BLE per cluster,
Input size, Mclk) as input from the user. The algorithm can be explained in two phases. The
first phase is called as packing of BLEs. The pattern matching algorithm to bind a register
and LUT (or LUTs) are used to pack possible components in a single BLE. For instance, one
register and one LUT can be packed in a single BLE. However, if the design has two LUTs
30
and one register then it will be packed in two separate BLEs. One BLE will be consisting of
a register and LUT. The other BLE will include a LUT. The second phase works on packing
the newly formed BLEs into the clusters. There are couples of optimization goals tied with
this process. First is to minimize the number of logic clusters needed. This can be stated in
a different way as to utilize all used cluster to maximum capacity (N). However, the FPGA
routing network connects the cluster globally. As we have discussed earlier in section 3.2,
I inputs are external to a cluster which potentially are the outputs of other cluster or the
same cluster. Optimizing for a minimum number of cluster inputs (I) will hugely reduce
the routing complexity. The pseudo code is explained in the figure 3.2. To form a cluster,
there are a few filters and conditions required. The maximum basic logic elements (BLEs)
cannot be more than N . The maximum number of different clock sources incoming to a
cluster cannot be more than Mclk. The total number of inputs to a cluster cannot be more
than I. Phase two of packing starts with a greedy implementation of packing of BLEs into
the clusters. If the greedy algorithm is unable to fill a cluster to its capacity, then a hill
climb implementation is used to utilize maximum of the cluster. Initially, a BLE is selected
such that it is not yet clustered (i.e. un-clustered). If there exist more than one such BLE,
then the one with maximum used input is selected. This ensures that BLE will utilize most
of the (I) inputs appearing at the cluster. The attraction of a BLE to a cluster is directly
proportional to a number of inputs outputs BLE have common with cluster (equation 3.1).
After the greedy and hill climb, high attraction BLEs will pack to clusters.
Attraction = [ inputs and outputs(BLE) ∩ inputs and outputs(C) ] (3.1)
In phase2 i.e. adding such BLE to cluster so that input set to the cluster does not grow
us is a critical phase. That BLE will be added which has least ∆BLE.
31
Figure 3.2. VPack algorithm pseudo-code
3.3 Placement
Placement stage in the FPGA CAD flow places logic blocks within a physical device. Op-
timum placement algorithm will have different optimization goals depending on the appli-
cation. Wire length driven placement optimizes for minimum distance and minimum wire
required in global as well as local routing network. It packs and places the design in a dense
manner and ensures dependent blocks are placed closer. Timing driven placer places the
blocks considering maximum frequency and circuit speed. Min-cut placer will divide the
number of logic blocks into two or more partitions such that the number of nets or wires
crossing the partition are minimized. Analytical placement algorithm [23] solves constraint-
based equations to minimize the wire length. It works with co-ordinates and geometrical
equations to solve objective function. We will discuss Simulated Annealing algorithm in
detail.
32
3.3.1 Simulated Annealing
In the year 1930, the traveling salesman problem (TSP) was presented which aims of cal-
culating the optimum tour for a set of multiple locations. Every location should be visited
and the tour ends at the starting point. The nearest neighbor approach could not result in
optimum path tour. The problem is known as non-deterministic in polynomial time com-
plete problem (NP-Complete). The difficulty of this problem solution grows when the set of
locations increases since the search space can be formulated by (n− 1)!. In the native form,
an annealing process is related to condensed matter physics. It is a process in which a solid
in a heat bath is energized by increasing the temperature. At high temperature, particles
are at high energy level with high entropy value. Solids undergo transformation to a liquid
state at very high temperature. The solid annealing process starts at such high temperature
(at high particle energy) level called initial temperature. Temperature is reduced in steps
and with each step reduction in temperature material is allowed to reach the thermal equi-
librium. The probability of the solid being at energy E, at a temperature T, can be given
by Boltzmann Distribution [24].
P (Energy = E) =1
Z(T )∗ exp(−E
kβT) (3.2)
Z(T ): Partition function used for normalization
kβ: Boltzmann constant
(−E/kβT ): Boltzmann factor
From the equation 3.2, with a reduction in temperature, solid reaches to minimum energy
and at a temperature (T=0) solid will have the minimum of all energy. However, if the
temperature reduction steps are fast that solid does not attain equilibrium at every step,
solid will not reform in low energy dense crystal but will have meta-stable defect prone brittle
structure.
33
Figure 3.3. Pseudo-code for simulated annealing based placer
Simulated annealing for placement The pseudo-code for SA based placement is de-
scribed in (figure 3.3) [6].
Cost = α∆wiring cost
wiringold+ (1− α)
∆time cost
time costold(3.3)
α decides the magnitude of the weight of wiring optimization cost which is subtracted
from magnitude of the weight of delay optimization cost for placement. Whenever the change
in cost (∆Cost = ∆C) is negative; implies that cost of the swap is reduced. In that case,
the move is accepted. The hill-climbing process may have a local minimum and after a hill a
global minimum. Hence, all of the increasing cost solutions may not be rejected. So increase
in cost is accepted with probability of acceptance given by,
P (acceptance) = e−∆C/T . (3.4)
The temperature reduction rate of the Simulated Annealing determined the terminating
condition of loop1. Loop2 determined the number of swaps (equilibrium) condition for
simulated annealing. For different design and circuit netlist, the loop conditions should vary
34
accordingly. Adaptive annealing schedule is required for loop termination and temperature
update conditions. The first presented schedule here is proposed by Huang et. al. [25].
This scheduler algorithm was developed for the VLSI standard cell placement algorithm.
The methodology starts at a random placement and randomly moves blocks for N number
of times (N is usually number of nets). After a finite number of moves, it calculates the
standard deviation of the cost over moves. For Simulated Annealing loop execution, the
initial temperature is set to 20σ [26]. The value of /lamda is typically kept 0.7. The update
step temperature is governed by the following equation [27]:
Tupdate = Told ∗ eλTold/σ (3.5)
One of the major drawback of implementing Huang and Lam [24] scheduler is the required
time complexity. The solver machine takes infinitely long time for large sized netlists. Swartz
and Sechen [27] have modified the scheduler (equation 3.6):
Tnew = [1− (α− 0.44)
40] ∗ Told (3.6)
Here, number 0.44 is statistically calculated desired acceptance rate and number 40 is
damping coefficient.
3.3.2 Versatile Placement and Routing (VPR)
In Versatile Placement and Routing tool flow [7] (see figure 3.4), an FPGA device is viewed
as many empty boxes or the blocks in which clusters are placed (refer figure 3.5). The VPR
tool on command line takes the .xml file input which describes the underlined architecture
of FPGA in which a design place and route will follow. This architecture XML includes
information about a number of i/o pins, pad limited or core limited IO pads, and routing
flexibility. Few extra details about the size of each component (logic block or IO pad) are
35
also defined in XML file. Architecture file also allows setting or changing locations of IO
pins on FPGA. However, many commercial FPGA vendors develop PCB and FPGA in
different teams, so locked pin configuration limits this ability to change the locations of the
pins in FPGA architecture. The tool uses simulated annealing algorithm for placement of
logic blocks. The algorithm details are explained in the FPGA flow section. Next section
introduces VPR version of the algorithm. The performance of simulated annealing algorithm
is dependent on three most important parameters. First is the temperature reduction rate
in the annealing process. Second is the number of moves in each of the temperature reading,
and third is exit condition for outer loop of annealing. Adaptive annealing scheduler was
one of the famous topics for research in past. The VPR tool considers the research work
in scheduling algorithm [28] [29] [27] for developing the explained three adaptive process
parameters. The initial temperature is calculated 20 times the standard deviation of first N
swaps. A number of operations in one temperature change is usually calculated as:
10 ∗ (Number of blocks)4/3 (3.7)
For third parameter to calculate the exit condition of annealing loop is expressed by:
T <ε.Cost
Nnets
(3.8)
ε that is used in VPR is 0.005.
Rupdated = Rold(1− 0.44− α) (3.9)
Block interchange policy implemented in such a way that the interchange of the two
blocks will be accepted only when R exchange is less than or equal to R max. Alpha (α)
is the control nob for range limiter. This function allows smaller or local moves easily and
that results in a probability of acceptance.
36
Figure 3.4. VPR CAD tool flow [6]
3.4 Linear congestion
One of the major issues in global routing the design is congestion in the path. As we
have described in FPGA architecture, a device has a flexibility number for a number of
routable paths or metal tracks. Wider the channel for routing, it will be easy to route the
design and there will be multiple ways to route design efficiently. However, the trade-off
of adding extra track comes with area and wire length. An optimum number of tracks for
an FPGA device can be dependent on many other parameters, logic blocks, switch-box,
aspect ratio, etc. Optimization goal for a router is to route complex design in minimum
wire and connections. This process of optimization can create congestion in one or many
37
Figure 3.5. VPR assumed FPGA model [7]
paths since critical path will have least slack and adding another logic block may not have
routing channel. Congestion problem is solved with assigning a cost to each path. More cost
dictates high usage and will route the design in other paths. The mathematical equation for
congestion cost is given as:
Cost of congestion =Nnets∑n=1
q(i)
[bbx(n)
Cavgx(n)β+
bby(n)
Cavgy(n)β
](3.10)
In this equation the quantity Nnets is total number of nets. Here, x and y components are
given by bounding box (BB) model. q is the factor multiplicand. For nets with more than 3
terminals, bounding box estimation underestimates the wire length. The q value is a function
of total nodes to which a net is connected. For three or fewer number of nodes, q is equal to
one. Bounding box estimates perimeter of the box i.e. maximum wire-length. C quantity
38
is average channel width in x as well as y direction. Exponent (β) value allows modifying
the cost of the wide or broad channel. C is usually a fixed value for a particular family
or particular FPGA device. Linear congestion model outperforms the non-linear models
proposed in time complexity and require less number of channel tracks [30]. The efficiency
of this algorithm is increased with the addition of incremental bounding box updates. This
addition saves the old bounding box data. After a swap of one or more nodes in that
bounding box, it will calculate just the congestion of the swapped node. The new addition
could achieve average 4.99X speed-up for 10 MCNC benchmarks.
3.5 CAD for Routing algorithms
3.5.1 Introduction
It is known fact that the most of the CAD back-end works on routing algorithm optimization.
VPR developers have implemented two different types of routing algorithms [30]. We have
discussed the timing driven and the routability driven calculations. The delay and routability
calculations are used in routering algorithms. Figure 3.6 is CAD flow that includes routing
stage after the placement stage. Routing is one of the most important phase in FPGA CAD
flow, since the delay, power, hot-spots, and mant side-channel parameters are calculated
after routing stage. It is observed that the routing network is a major contributor for power
consumption. Routing program requires access to FPGA device architectural details (see
figure 3.6):
• Total I/O pins to logic
• Orientation of logic blocks and availability of I/O pins to how many sides
• Core or pad limited I/O pads
• Fs: flexibility of routing channel
39
Figure 3.6. Position of router in VPR CAD
• Widths of metal used in routing, clock and power routing details
• Widths of vertical and horizontal routing channels
• Switch-box type and connectivity structure
• Lengths in between logic blocks
• Length in connections in switch-box (internal population)
• Connection box internal population wire length
VPR routability-driven router use pathfinder algorithm [7] (see figure 3.7 and figure 3.8).
It utilizes linear congestion model explained in section 3.4. Each of the path has a cost
associated that will be considered for routing the nets. Connection criticality and the slack
40
Figure 3.7. Traditional router methodologies of expansion and sinking wires
in routing network is calculated in pathfinder program. Cost of a logic block (node) is given
by,
Cost(n) = ConnCrit(i, j) delay(n) + [1 - ConnCrit(i, j)] . [b(n) + h(n)] - p(n) (3.11)
b(n): Basic fixed cost of node
h(n): History value of cost
p(n): Present congestion
Details about pathfinder algorithm can be found in [31].
Delay measurement techniques explain about Elmore delay model [32] based calculation.
Routing network present in FPGA can be characterized in the form of resistances and ca-
pacitors (see figure 3.9). Metal widths and lengths decide resistance (linear). Two parallel
metal wires are responsible for capacitances. The RC time delay can be calculated with
Elmore delay model. The total delay of a routing network can be described as Linear and
Elmore delays of buffers, switches, pass transistors. Elmore delay model based measurement
41
Figure 3.8. VPR router methodologies of expansion and sinking wires
is an accurate method for calculating delay with a high fidelity. The time complexity of
calculation is linear to the number of nets. For the circuit with millions of gates, linear time
complexity is not desirable. VPR keeps the history of capacitor and resistor values and with
the change in logic placement; it does not need to start from zero nodes. Result analysis by
[6] for 20 MCNC benchmarks concludes that a time-driven router is very fast.
3.6 Delay measurement techniques
Delay measurement and timing calculations are one of the most important information
needed for performance evaluation of any VLSI system. Two different algorithm imple-
mentations are evaluated on the basis of delay calculation. Best and most accurate way
of characterizing precise delay is SPICE simulation. However, simulating a transistor level
netlist takes an exponential amount of time for thousands of nets. In literature, RC-network
in tree or pi configurations is evaluated. Penfield and Rubinstein [33] proposed a delay model
that considered RC tree network.
42
Figure 3.9. Circuit representation of routing switches and routing network
VPR delay extraction algorithm works on the principal of Elmore delay. This algorithm
calculates the downstream capacitances. We have used following Elmore delay model based
delay calculation example to explain the delay calculation steps and the time complexity
associated for RC network (see figure 3.10).
Delay(source to sink) =∑
i∈interconnect
RiC(subtreei) + Td,i (3.12)
Here, Td,i is intrinsic delay of buffer. Looking into out1, delay can be calculated as,
Delay = R ∗ (C +C1 +C2 +C3) +R ∗ (C1 +C2 +C3) +R ∗ (C2 +C3) +R ∗ (C3) (3.13)
This simplifies to,
Delay = R ∗ (C + 2C1 + 3C2 + 4C3) (3.14)
43
Figure 3.10. Intuitive example to solve for Elmore delay for RC routing network
Also, for out2 node,
Delay = R ∗ (C + 2C4 + 3C5) (3.15)
Total delay to sink in case of FPGA devices is Elmore delay and fixed and finite switch and
connection box delay.
Critical path of a circuit calculation is also one of the important aspect of timing-driven
routing. Static timing analysis and the speed of the design has a direct impact of a critical
path. Delay of every individual component in hierarchy (gate or block), delay between the
two nodes (interconnection delay), and then path-based analysis with this two values are
three steps involved in measurement of critical path. The CAD tools calculate the delay
and static timing after the routing phase. Delay of logic blocks is a known quantity at
the beginning of the placement stage, since we know number of logic blocks after packing.
However, routing delay can be measured only after a successful routing phase. CAD flow for
delay measurement and routing tracks evaluation is described in the flow chart (see figure
3.11).
44
3.7 Motivation for optimization in floor planning for thermo-power aware CAD
Power and thermal aware designs are in the demand since with the advancements in FPGA
architectures and need of the FPGAs in highly computational workplaces such as search
engines and data centers is increasing. Even with the high-quality heat sinks and cooling
methods, the high-performance computing devices are exceeding the maximum junction
temperature. Static and dynamic power dissipation is increasing drastically with scaling
down in the transistor size. Despite low power design methods and cooling techniques, the
reliability and life span of integrated chips is falling mainly because of the thermal gradient.
For academic research purpose, very few accurate power and temperature estimators are
available. We utilize the open source tools and perform the thermal awareness in design flow
at the floor planning stage. Early estimation and optimization for power and temperature
pave a way for high-performance architecture. Traditional placement and routing algorithms
considered wire lengths and delay as the major optimization goals while placing and routing
the cluster and slice netlist. We plan to add a power aware constraint for placing the logic
blocks in simulated annealing cost function. The addition comes at the cost of delay and
area. We plan to propose an algorithm and analyze the simulation results in the next chapter.
3.8 Summary
In this chapter, we have discussed detailed pack, placement, and routing algorithms in FPGA
design flow. We then described Versatile placement and routing tool for placing and rout-
ing FPGA logic blocks. FPGA-SPICE tool is used for calculating grid level power numbers.
FPGA-SPICE tool simulates grid, full chip or component level netlist for power estimation at
different hierarchy. The proposed flow replaces the VersaPower from VPR flow by HSPICE
based power simulator. Newly proposed ACE2.0 activity estimator is modified for accurate
power and for FPGA cad designs. We analyze the timing and routability based routing
46
algorithms. We simulate Simulated annealing with only wire-length and delay optimization
constraints and compared with modified constrained cost. Various formulations and com-
plexities for terminating condition and loop scheduling of annealing algorithm are studied.
This chapter sets CAD tool perspective and knowledge which will help in understanding
CAD aware flow explained in the next chapter.
47
CHAPTER 4
THERMAL-AWARE FPGA PLACEMENT
4.1 Introduction
Out of many types of failure modes in electronic components, thermal shock in the integrated
circuit chip is the phenomenon when the thermal gradient across the chip causes irregular
expansion due to stress or strain. It sometimes results in an arbitrary power profile thereby
burning or cracking the integrated circuit chip. When these chips are developed in computa-
tionally intense environment, the reliability of the system is greatly impacted by the thermal
issues.
FPGAs are starting to see greater usages in backend data center applications. Recent
attempts by Microsoft Corporation on implementing the Bing search engine algorithm on
FPGAs [34], Baidu migrating towards Xilinx FPGA based search, and many more are a
clear evidence that FPGAs will be more placed in the mainstream data center. Workloads
associated around search algorithms, deep learning neural networks, and image processing
demand extreme performance. Generally, performance comes at the cost of thermal side
effects that can impact reliability. Rising thermal and power budget demands with the high
performance computation systems requires improvement in cooling as well as power aware
design flow.
4.2 Various thermal aware design methods for FPGAs
There are 4 different levels of abstractions of techniques for low power and thermal aware
FPGA designs. At the architectural level: LUT sizing, routing network optimization, switch-
box, connection box designs are considered for optimizing the power. Typically, a designer
makes good trade off between power and performance. CAD deals with the power aware
design flows. In these design flows the optimization function captures the parameters related
48
to power dissipation and includes them to arrive at low power designs. System level considers
pipelining and low power modes. Device level designing is related to low operating voltages,
multiple voltage thresholds. To reduce the static power consumption, gating strategies are
often employed to switch off parts of the device that are not in use. With advancements in
the process technology, lowering the supply voltage (VDD) does not reduce the total power
consumption. For high performance design, clock frequency is increasing and hence also the
leakage power. Clock gating is used to reduce the glitch power. Dynamic voltage scaling
(DVS) is another popular technique used to reduce dynamic power [11]. System design
techniques such as package and heat sink design also play an important role in ensuring that
the design meets the thermal budget. Heat sink have proven the efficiency for keeping the
thermal budget when power density is increasing. Statistical vector-less power estimation
techniques are used by designers to gain an idea of the power dissipation and junction
temperature TJ during the design phase.
The term thermal design power (TDP) [35] is defined as the highest power that results
in heat dissipation for which a cooling system is designed to dissipate the produced heat.
TDP may not reflect the maximum power consumption by the chip but it is the maximum
power a chip can handle in the operational ambient conditions and the presence of available
cooling mechanism. Typically the peak power for a chip is approximately 1.5 times the
TDP. Heat sink and more aggressive cooling mechanisms such as liquid (water) cooling
are considered best to manage the heat dissipation associated with the high-performance
systems. Embedded applications and mobile devices may not be able to use such cooling
methods.
4.2.1 Heat-sink based cooling
Convection, radiation, and conduction are primary ways of heat dissipation in integrated
circuits. Heat transfer phenomenon is observed when two different thermal resistive materials
49
Figure 4.1. Heat-sink structure for stacked or 2D FPGAs
transfer the energy from hot object to cold. Thermal resistivity can be defined as capability of
the material to transfer heat. Heat sinks consists of the low thermal resistive material placed
on integrated circuit chip for the purpose of reducing chip temperature below maximum
thermal capability of device (Tjmax).
Heat Sink design plays an important part in thermal management. The heat sink profile,
TIM (Thermal Interface Material) and auxiliary cooling mechanisms all contribute to the
effectiveness of the thermal management strategy [36]. Heat sinks are of two types: Passive
heatsink consisting of a metal plate and an Active heatsink contains the airflow unit-for
instance a fan to circulate air through the heatsink assembly.
The primary components of a typical heat sink based thermal management system are
the heat sink, spreader, thermal interface material (TIM) and the package. Since air has a
high thermal resistance, the TIM is used as the interface between the die and the spreader
50
and the spreader and the heat sink (see figure 4.1). Some examples of TIMs are grease,
adhesives, thermal tape etc. These materials usually with have a thermal resistance of 0.15-
1.0 W/◦C. The heat dissipation depends on the total junction to ambient thermal resistance
θJA as shown in equation 4.1 [37]
θJA = θJC + θCS + θSA (4.1)
Where, θJC is the junction to case thermal resistance, θCS is the TIM resistance, and θSA
is the heat sink to ambient thermal resistance. The junction temperature TJ is the actual
temperature of the active die. The device manufacturer specifies the maximum junction
temperature for each device family (TJmax). The thermal resistance of a device without
and with a heatsink is given by equations 4.2 and 4.3 respectively (where P is the device
operating power and TA is the ambient temperature).
θJA = θJC + θCA =(TJ − TA)
P(4.2)
θJA = θJC + θCS + θSA =(TJ − TA)
P(4.3)
If an application running on FPGA which has a high thermal resistance dissipates a huge
amount of power, it will possibly exceed Tjmax. This necessitates the use of a heatsink to
maintain the device within the thermal operating budget. From the above equations, the
required thermal resistance of a heatsink for a given design and FPGA device is calculated
as:
θSA =(TJmax − TA)
P− θJC − θCS (4.4)
For stacked FPGAs, conventional air cooled heatsinks may not be sufficient and more
suitable hence effective alternatives have to be considered. One of the most promising
51
Table 4.1. Vendor Recommendations for FPGA Cooling SolutionsBenchmark Power dissip. Therm Resis Cooling technique
Low end 1-6 W 8-10 ◦C/W bare packageMid range 4-10 W 5-10 ◦C/W passive heatsink + spreaderHigh end 8-25 W ≤ 3◦C/W active heatsink + cooling
Table 4.2. Heatsink selection for benchmarks tested.Benchmark Total Power(W) θrequiredSA
stereovision1 17.1 3.12stereovision3 16.08 3.38blobmerge 11.03 4.92
approaches is thermo-power aware CAD. A number of academic works have looked at various
aspects of placement and routing optimization.
According to vendor documentation [38], the heat sink profiles (listed in table 4.1) are
suggested depending on the application: Actively cooled heat sinks require an external power
supply (5V-12V) for the fan. To compensate for designs with high power dissipation, higher
airflow rates and/or larger heat sinks have to be used. This can add to the overall weight of
the system and also make it noisy during operation. In conclusion large heat sink surfaces
and external power supply is necessary for effective cooling (from table 4.2). Active cooling
is not a feasible option for mobile high power devices.
4.2.2 Xilinx Power Analyzer for heat sink performance
Very large heat sinks are required to cool computation extensive design also heat sink air flow
pump consumes a finite amount of power, one of the research paths is to use micro-fluidic
cooling techniques. Fluid cooling is not in the scope of this thesis. Another alternative
to reduce the junction temperature is placement and routing optimization. Early stage
optimization goal reduces the risks to some extent and also allow using different packaging
and cooling techniques. To obtain cooling performance data we chose a set of large open
source benchmarks from the VTR toolset.
52
Table 4.3. Thermal Resistances of commercial heatsinksHeatsink θSA
HS1807EBP3805K52 4.1HS2132DBP3828K52 6.4HS2103DBP3828K52 3.9
INC35001-7/1.5K52+Li98 5.9
Table 4.4. Vendor Recommendations for heatsinksPower dissip. Cooling mechanism
1-4 W bare package4-10 W passive HS+spreader10-25 W active HS + cooling
We used the Xilinx FPGA flow to synthesize and implement the designs on a Virtex-6
FPGA. The toolset includes a power analysis utility for thermal management that can help
designers choose the appropriate thermal management solutions for their designs. For each
design, we used the highest performance settings supported by the manufacturer and we
chose the smallest device that could accommodate the design to ensure maximum utilization
and highest total power. The table 4.2 summarizes the results of our experiments. The
required θSA column from table 4.2 shows the required thermal resistance of heat sinks to
maintain the operating temperature within the thermal budget for the benchmarks tested.
The thermal resistance values of some commercial heat sinks for the Virtex family of devices
(at 200 LFM airflow) are also listed in table 4.3. Vendor recommended cooling cooling
solutions for range of power dissipation is presented in table 4.4.
A comparison of θSA required with the available heat sink values shows that it may be
difficult for the designer to meet the thermal operating budget using active or passive air
cooled heat sinks (see figure 4.2). One way to meet the thermal budget is to increase the
airflow rates (see figure 4.3) which will consume more power and also increase the system
operating noise. Also, this problem will be compounded for 3D FPGAs since stacking these
devices will result in much higher power dissipation.
53
Figure 4.2. Variation of Tj with various VDD for airflow rates in Linear Feet per Meters
4.3 Thermal gradient in integrated circuits
Reliability and lifespan of a device are major considerations during design and package
phase. Difference in distribution of heating in integrated circuits causes thermal gradient.
Large thermal gradient is observed in high performance design implementation. In the earlier
section we have discussed the direct relation in junction temperature and power consumption.
Thermal gradient is a result of uneven power dissipation in the local unit blocks which in
turn produce heat. High percentage of the chip failure is because of the power and thermal
gradient. For a transistor under constant power stress, the equation 4.5 is directly related
to power consumption and effective area responsible for conduction.
Pf = (PiKρCp)1/2Aj[Tc − T0]t
−1/2f (4.5)
54
Figure 4.3. Variation of Tj with different size and different quality heat-sinks
Pf= Failure power
ρ= Material density
Tf= Time to failure
K=Thermal Conductivity
Aj=Junction Area
T0=Temperature of the surrounding ambient
Tc= Critical Temperature 1415 0C melting point of Si and 675 0C
Cp= Specific heat capacity at constant pressure
Above addressed issues makes thermal gradient aware design as a high order optimiza-
tion goal in chip design flow. Temperature is non-neglectable parameter in design flow since
the chip reliability and tradeoff with performance are optimization goals. The Black’s equa-
tion(see equation 4.6) for abstract mathematical model of median time to failure (MTTF).
55
MTTF =A
jne( Q
jT) (4.6)
Where,
A= Constant
j= Current density
n= Architecture dependent model parameter
Q= Activation energy
k= Boltzmann’s constant
T= Temperature in K
Above relation in mean time to failure and temperature. Long interconnect networks
and associated capacitance in programmable switches can be significant component in power
budget. Hotspot is local high temperature area on a part of a chip. High activity blocks
consume more power which in turn produce heat. Placing high power blocks closer will result
in hotspot. Temperature is additive quantity. Closeness of many power consuming blocks
will result in hotspot. From the evaluations of heat-sink based designs, we have seen that
even large heat sinks are not enough efficient for high performance designs. Active heat-sinks
perform better than passive heat sinks. However active heat-sink needs external power and
use is difficult in-feasibility in embedded applications. We propose CAD solution which deals
with placement algorithm. CAD solution considers accurate power and thermal characteri-
zation. Modified simulated annealing cost function performs power aware placement at the
cost of wire-length and delay.
4.4 Modified Simulated Annealing cost function
4.4.1 Temperature Flattening
Temperature Flattening may be defined as the indirect manipulation of on-chip temperature
through various design techniques such that the temperature gradient does not exceed spec-
56
ified levels. Regions of local temperature maxima are called hotspots. Hotspots can hamper
chip performance and in extreme cases can cause the chip to fail. Hotspots can also lead to
increased leakage current which in turn can lead to elevated temperature[39]-this cyclic effect
can often cause designs that rely solely on external heat dissipation methods to experience
thermal failure. In the case of FPGAs, power profiling is usually done at the functional block
level where individual architectural blocks are simulated and are strategically placed during
the floorplanning stage [40]. The place and route stages of the flow usually do not take into
account any thermal or power related criteria since it is difficult to obtain an accurate esti-
mate of temperature at the post placement stage. The FPGA design flow consists of several
steps including design entry, technology mapping, clustering and packing followed by actual
place and route. For the experiments described in this research, we use the open source
academic VPR flow [7]. Consider a set B of packed logic blocks obtained using a packing
engine such as TV-Pack. A valid placement of the set B requires that each block is assigned
a unique location (x, y) on the FPGA while meeting constraints for wire length and timing.
These constraints are set by the designer which enables the EDA tool to narrow down the
set of valid placements for a given design. Consider that each block b ⊂ B dissipates power
bpower where bpower is the sum of dynamic and leakage power associated with the block.
We define the high power block set H as a subset of B where ∀hi ⊂ H, phi > pbi . The
on-chip hotspots are caused due to these high power blocks and therefore we focus our op-
timization strategy around the placement of these high power blocks. Each legal placement
of B, is characterized by three figures of merit (fmax, Tmax, ∆T ) which determine the QOR
(quality of result ) of the placement. Here fmax is the maximum operating frequency, Tmax
is the maximum observed junction temperature and ∆T is the difference between the max-
imum and minimum recorded on-chip temperatures. In a conventional FPGA flow, fmax is
usually the only figure of merit that is considered but for thermal-aware design flows, it is
also important to consider the on-chip spatial variation of temperatures to ensure proper
57
Figure 4.4. Illustration of the temperature flattening flow during placement.
functionality of the design. Therefore, the goal of our custom flow is to find a solution that
has a (fmax, Tmax, ∆T ) which meets the design specifications.
4.4.2 Thermal benchmarking
The custom thermal optimization flow is shown in figure 4.4. The VPR placement algorithm
tries to obtain a tradeoff between wire length and timing to obtain the tightest possible
58
placement. Although such a placement might result in a design that gives the best perfor-
mance, the resulting thermal profile is probably sub-optimal. Timing-aware placements are
agnostic of power and the best placement obtained using VPR might result in clusters of
high power blocks placed close together leading to hotspots. To test our hypothesis, we simu-
laated MCNC benchmarks with the VPR CAD flow at the default settings to obtain a set of
golden design. The golden design represent the best possible design that VPR can produce
for optimizing delay and wire-length for each of the benchmark. A grid level power profile
is then extracted from the golden placement with the help of FPGA-SPICE [4]. We chose
FPGA-SPICE tool over VPR because the native power estimation (VersaPower) in VPR’s
tool [10] is overly optimistic since it does not take into account parasitics as demonstrated
in [4]. The VPR placement algorithm’s cost function is modified as described in equation
4.7 for incorporating a spreading factor (γ) for the blocks that have power dissipation values
greater than the predetermined power threshold. The Cspread is the total separation between
all the blocks in H. In each case, the threshold was chosen such that around 20% of the high
power blocks were included in the set H. The values ∆Cwire, ∆Cdelay and ∆Cseperation refers
to the wire-length, delay and high power dissipating block separation costs for current iter-
ation of simulated annealing loop respectively, with α, β, and γ being the weights assigned
to each of the optimization cost metric. ∆C is the change in the overall placement cost in
the simulated annealing loop.
∆C = α · ∆CwireColdwire
+ β · ∆CdelayColddelay
+ γ ·Coldseparation
∆Cseparation(4.7)
For the purposes of this work, we define a grid as a CLB along with the associated muxes
and switches. A heat map is generated for the golden power dissipation values using an
academic thermal simulator [20]. This is the baseline thermal characteristic profile of the
golden (or reference) VPR design.
59
Figure 4.5. Placement of high power blocks from the golden run (a) and the location of thesame blocks in the custom run (b) with the new power numbers. The change in the powervalues is due to the change in dynamic and leakage temperature values. Note that the scalefor figure b is much smaller compared to a.
4.5 Custom Thermal-Aware Flow
To improve the thermal profile, we did a sweep of γ for various threshold values while
maintaining α = β. The figure 4.8 shows the variation of design performance with the
threshold for a specimen benchmark (alu4). Each data point on the curve represents a
complete successfully routed design. We chose the worst design in terms of performance in
each case since that design is likely to have the highest separation between the blocks in H
and used the normalized power values [41] from both the designs to generate a heat map
using a state of the art academic thermal simulator [42]. In almost all cases we observed
gains in terms of both junction temperature and the hotspot occurrence as shown in fig 4.5.
60
Figure 4.6. Comparison of the heatmaps for alu4 showing the reference VPR design (a)against the one with the worst performance but the best thermal profile for γ = 0.5 (b).One of the several clusters of hot spots is circled in (a) for clarity. All temperatures shownare in Kelvin.
4.6 Result analysis
As shown in figure 4.5 the golden designs generated by VPR showed the presence of clusters of
hotspots which confirms our initial hypothesis that the VPR designs may not be optimal from
a thermal perspective. Thermal map generated for alu4 benchmark by 3D ICE simulator can
be pictorially represented as shown in figure 4.6. We have discussed earlier about the trade-
offs in delay and power constraints. When we optimize for the power distribution, delay of
the circuit increases. Figure 4.7 depicts the tradeoffs with power aware flow. The figure 4.8
shows the variation of the circuit performance (critical path delay) with varying values of γ.
61
Figure 4.7. The best and worst delay designs generated for each benchmark. The best(mindelay) design corresponds to the reference design using VPR default flow and the worst(maxdelay) design is the one at the power threshold that includes around 20% of high powerblocks for each benchmark.
Every data point on the plot represents a circuit with a different QoR. The ideal case would
be selecting a solution based on (fmax) which also has the best ∆T . Since this is not possible
the solution space can be bounded by determining the limit on the circuit performance that
the designer is willing to accept. In our study, we chose the worst possible design at each
threshold so that we can obtain the lower bound on performance and compare the thermal
profile of that solution to the fastest design. It is worth noting that an examination of the
sweep results showed diminishing returns with values of threshold that included more than
20-30 % of all the blocks in the set H. This is due to the fact that VPR sizes the FPGA
to fit the circuit and therefore including too many blocks in H will result in the clusters of
62
Figure 4.8. Normalized critical path delays versus the high power block spreading factor γ.The best design is obtained for γ = 0.2, 0.875 and the worst is at γ = 0.5 for the custom costfunction. The ratio of delays of custom run by golden run is represented by the upper plot
high power blocks merely shifting to a different spatial region on the chip. Identifying the
right value of the threshold is a challenging task and is also benchmark dependent.
The table 4.5 shows the results for the benchmarks tested. The nHP is the percentage of
high power blocks in each benchmark and change in ∆ Tmax is the change in maximum overall
on-chip temperature. Benchmark characteristics are tabulated in appendix (table A.1). ∆
Separation is change in the distance between sum of all high-power blocks with different
placement. On average, we see a reduction up to 8 % in the junction temperature with the
most reduction observed for alu4. This data suggests that it is possible to achieve a reduction
in chip temperature through the judicious selection and placement of a small subset of high
63
Table 4.5. Thermal flow simulation resultsCircuit Total cells % nHP %∆ Tmax ∆Separation
alu4 1523 27 −8.0 4.3Xapex2 1879 15 -3.21 2.5Xapex4 1263 8 −2.67 2.6Xfrisc 3557 19 -2.12 2.0X
elliptic 3604 21 −6.36 4.1Xbigkey 1708 30 −4.90 3.8Xtseng 1048 25 −4.78 3.4X
power blocks. The power threshold value is the key component that determines the QoR of
this optimization strategy.
64
CHAPTER 5
CONCLUSION AND FUTURE WORK
5.1 Conclusion
Based on the results obtained, the thermal optimization strategy seems to merit further in-
vestigation. Effective simulated annealing scheduler can achieve and optimize the placement
for power-aware designs. Modification in CAD flow, junction temperature reduction with an
increase in γ, implementing the thermo-power placement on top of existing VPR tool, better
accurate power estimation than VPR which then provide accurate temperature profile are
few points to note out of the research.
We develop thermal management scheme at the early stage of the design flow. For the
process of extracting delay and power values from CLB granularity, SPICE models for DFF,
LUT, SRAM, BUFs, INV, MUXs are used. Static and dynamic power consumption of
combinational, sequential, switch and connection boxes and routing network are accurately
extracted with the help of FPGA-SPICE tool.
ACE 2.0 activity estimation tool which uses probability-based activity simulation model
for determining net switching probabilities and activities is explored and used for determining
dynamic power component.
The simulated annealing cost function adds γ constraint for optimization for power. It
increases the separation between high power blocks. Trade off with delay and wire-length is
plotted and results analyzed. Our final results show 8% reduction in junction temperature.
This reduction is with not-normalizing any power values.
5.2 Future work
Further refinement and automation of the flow would also prove to be a useful addition
to FPGA designers who are looking to implement high power designs. One aspect that
65
we are investigating is using a combination of physical design and CAD techniques for more
aggressive thermal flattening methods. Similar to the power-aware placement, the packaging
and routing stages in the design flow can also be modified and optimized for power. SPICE
simulations are time expensive for large benchmarks. Multi-threading can be utilized to
speed-up the simulations. The CAD flow can be embedded with packaging and micro-fluidic
cooling structures. Since the 3D architectures are getting popularity, further study and
optimization for TSV interconnect network formation and placing the circuit on different
layers can bring a new sense to algorithm and tool development. Heterogeneous FPGAs
were not considered for placement scheme.
Researchers are now looking into image processing algorithms for detecting hot-spots.
Recording the thermal map by thermo-camera for the adaptive cooling system can be one
of the big research focus.
66
APPENDIX A
Table A.1. Benchmark Characteristics.Benchmark No of Cells No of Nets No of Pads
alu4 1523 1536 22apex2 1879 1916 41apex4 1263 1271 28bigkey 1708 1935 426clma 8384 8444 144
elliptic 3604 3734 245frisc 3557 3575 136s298 1932 1934 10tseng 1048 1098 174
67
REFERENCES
[1] Sung-Mo Kang and Yusuf Leblebici. CMOS digital integrated circuits. Tata McGraw-Hill Education, 2003.
[2] Christopher Z. Mooney. Monte carlo simulation, volume 116. Sage Publications, 1997.
[3] J. Lamoureux and S. J. E. Wilton. Activity estimation for field-programmable gatearrays. In 2006 International Conference on Field Programmable Logic and Applications,pages 1–8, Aug 2006. doi: 10.1109/FPL.2006.311199.
[4] Xifan Tang, Pierre-Emmanuel Gaillardon, and Giovanni De Micheli. Fpga-spice: Asimulation-based power estimation framework for fpgas. In Computer Design (ICCD),2015 33rd IEEE International Conference on, pages 696–703. IEEE, 2015.
[5] Kaushik Roy, Saibal Mukhopadhyay, and Hamid Mahmoodi-Meimand. Leakage currentmechanisms and leakage reduction techniques in deep-submicrometer cmos circuits. Pro-ceedings of the IEEE, 91(2):305–327, 2003.
[6] Vaughn Betz, Jonathan Rose, and Alexander Marquardt. Architecture and CAD fordeep-submicron FPGAs, volume 497. Springer Science & Business Media, 2012.
[7] Vaughn Betz and Jonathan Rose. Vpr: A new packing, placement and routing tool forfpga research. In International Workshop on Field Programmable Logic and Applica-tions, pages 213–222. Springer, 1997.
[8] Nathalie Chan King Choy and Steven JE. Wilton. Activity-based power estimationand characterization of dsp and multiplier blocks in fpgas. In Field ProgrammableTechnology, 2006. FPT 2006. IEEE International Conference on, pages 253–256. IEEE,2006.
[9] Stephen D. Brown, Robert J. Francis, Jonathan Rose, and Zvonko G Vranesic. Field-programmable gate arrays, volume 180. Springer Science & Business Media, 2012.
[10] Jeffrey B. Goeders and Steven JE. Wilton. Versapower: Power estimation for diversefpga architectures. In (FPT), 2012. IEEE.
[11] Chun Tak Chow, Lai Suen Mandy Tsui, Philip Heng Wai Leong, Wayne Luk,and Steven JE Wilton. Dynamic voltage scaling for commercial fpgas. In Field-Programmable Technology, 2005. Proceedings. 2005 IEEE International Conference on,pages 173–180. IEEE, 2005.
[12] Chi-Ying Tsui, Massoud Pedram, and Alvin M Despain. Exact and approximate meth-ods for calculating signal and transition probabilities in fsms. In Proceedings of the 31stannual Design Automation Conference, pages 18–23. ACM, 1994.
68
[13] Radu Marculescu, Diana Marculescu, and Massoud Pedram. Switching activity analysisconsidering spatiotemporal correlations. In Proceedings of the 1994 IEEE/ACM inter-national conference on Computer-aided design, pages 294–299. IEEE Computer SocietyPress, 1994.
[14] Wei Huang, Shougata Ghosh, Sivakumar Velusamy, Karthik Sankaranarayanan, KevinSkadron, and Mircea R Stan. Hotspot: A compact thermal modeling methodologyfor early-stage vlsi design. IEEE Transactions on Very Large Scale Integration (VLSI)Systems, 14(5):501–513, 2006.
[15] Mathworld–a wolfram web resource Runge-Kutta Method. http://mathworld.
wolfram.com/Runge-KuttaMethod.html. Accessed: 2017-02-05.
[16] Amirkoushyar Ziabari, Ehsan K Ardestani, Jose Renau, and Ali Shakouri. Fast thermalsimulators for architecture level integrated circuit design. In Semiconductor ThermalMeasurement and Management Symposium (SEMI-THERM), 2011 27th Annual IEEE,pages 70–75. IEEE, 2011.
[17] John E. Akin. Finite element analysis for undergraduates. Academic Press, 1986.
[18] Je-Hyoung Park, Ali Shakouri, and Sung-Mo Kang. Fast evaluation method for transienthot spots in vlsi ics in packages. In Quality Electronic Design, 2008. ISQED 2008. 9thInternational Symposium on, pages 600–603. IEEE, 2008.
[19] Ali Shakouri, Kerry Maize, Philip Jackson, Xi Wang, Bjorn Vermeersch, and KazuakiYazawa. Ultrafast submicron thermal characterization of integrated circuits. In Phys-ical and Failure Analysis of Integrated Circuits (IPFA), 2012 19th IEEE InternationalSymposium on the, pages 1–2. IEEE, 2012.
[20] Arvind Sridhar, Alessandro Vincenzi, Martino Ruggiero, Thomas Brunschwiler, andDavid Atienza. 3d-ice: Fast compact transient thermal modeling for 3d ics with inter-tier liquid cooling. In Proceedings of the International Conference on Computer-AidedDesign, pages 463–470. IEEE Press, 2010.
[21] Shilpa Bhoj and Dinesh Bhatia. Thermal modeling and temperature driven placementfor fpgas. In Circuits and Systems, 2007. ISCAS 2007. IEEE International Symposiumon, pages 1053–1056. IEEE, 2007.
[22] Shilpa Bhoj. Thermal management methodology for FPGAs. The University of Texasat Dallas, 2008.
[23] Marcel Gort and Jason H. Anderson. Analytical placement for heterogeneous fpgas. InField Programmable Logic and Applications (FPL), 2012 22nd International Conferenceon, pages 143–150. IEEE, 2012.
69
[24] Peter JM. Van Laarhoven and Emile HL. Aarts. Simulated annealing. In SimulatedAnnealing: Theory and Applications, pages 7–15. Springer, 1987.
[25] Wei Zou, Sudhakar M Reddy, Irith Pomeranz, and Yu Huang. Soc test scheduling usingsimulated annealing. In VLSI Test Symposium, 2003. Proceedings. 21st, pages 325–330.IEEE, 2003.
[26] William Swartz and Carl Sechen. New algorithms for the placement and routing ofmacro cells. In Computer-Aided Design, 1990. ICCAD-90. Digest of Technical Papers.,1990 IEEE International Conference on, pages 336–339. IEEE, 1990.
[27] Carl Sechen. VLSI placement and global routing using simulated annealing, volume 54.Springer Science & Business Media, 2012.
[28] F. Romeo, Vincentelli Ak. Sangiovanni, and Md. Huang. An efficient general cool-ing schedule for simulated annealing. PROCEEDING OF IEEE INTERNATIONALCONFERENCE ON COMPUTER AIDED DESIGN, 1986.
[29] Jimmy Lam and Jean-Marc Delosme. Performance of a new annealing schedule. InProceedings of the 25th ACM/IEEE Design Automation Conference, pages 306–311.IEEE Computer Society Press, 1988.
[30] Vaughn Betz. Architecture and CAD for Speed and Area Optimization of FPGAs. PhDthesis, National Library of Canada= Bibliotheque nationale du Canada, 1998.
[31] Larry McMurchie and Carl Ebeling. Pathfinder: a negotiation-based performance-driven router for fpgas. In Proceedings of the 1995 ACM third international symposiumon Field-programmable gate arrays, pages 111–117. ACM, 1995.
[32] Jason Jingsheng Cong and Kwok-Shing Leung. Optimal wiresizing under elmore de-lay model. IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems, 14(3):321–336, 1995.
[33] Jorge Rubinstein, Paul Penfield, and Mark A Horowitz. Signal delay in rc tree networks.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2(3):202–211, 1983.
[34] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme,H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil,A. Hormati, J. Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong,P. Y. Xiao, and D. Burger. A reconfigurable fabric for accelerating large-scale datacenterservices. IEEE Micro, 35(3):10–22, May 2015. ISSN 0272-1732. doi: 10.1109/MM.2015.42.
70
[35] D. Ganapathy and E. J. Warner. Defining thermal design power based on real-worldusage models. In 2008 11th Intersociety Conference on Thermal and ThermomechanicalPhenomena in Electronic Systems, pages 1242–1246, May 2008. doi: 10.1109/ITHERM.2008.4544402.
[36] H. Zhang, H. Y. Hwang, L. Bu, J. J. L. Aw, and D. M. W. Rhee. Thermal modelingand characterization of sic power module under both air cooling and liquid coolingconditions. In 2014 IEEE 16th Electronics Packaging Technology Conference (EPTC),pages 618–623, Dec 2014. doi: 10.1109/EPTC.2014.7028370.
[37] Thermal management for fpgas. https://www.altera.com/en_US/pdfs/literature/an/an358.pdf, . Accessed: 2017-02-05.
[38] Considerations for heatsink selection xilinx thermal data application. https://www.
xilinx.com/support/documentation/white_papers/wp258.pdf, . Accessed: 2017-02-05.
[39] Haihua Su, Frank Liu, Anirudh Devgan, Emrah Acar, and Sani Nassif. Full chip leakageestimation considering power supply and temperature variations. In Proceedings of the2003 international symposium on Low power electronics and design, pages 78–83. ACM,2003.
[40] Benjamin Carrion Schafer and Taewhan Kim. Hotspots elimination and temperatureflattening in vlsi circuits. IEEE transactions on very large scale integration (VLSI)systems, 16(11):1475–1487, 2008.
[41] Zhiyuan Yang and Ankur Srivastava. Physical design of 3d fpgas embedded with micro-channel-based fluidic cooling. In Proceedings of the 2016 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays, pages 54–63. ACM, 2016.
[42] Arvind Sridhar, Alessandro Vincenzi, Martino Ruggiero, Thomas Brunschwiler, andDavid Atienza. 3D-ICE: Fast Compact Transient Thermal Modeling for 3D ICs withInter-tier Liquid Cooling. In Proceedings of the International Conference on Computer-Aided Design, ICCAD ’10.
71
BIOGRAPHICAL SKETCH
Rohit Dnyaneshwar Somwanshi was born in Pune, Maharashtra, India on 7 November 1993.
He finished his high school in 2009 from C.K.Goyal, Pune. After that, he completed his un-
dergraduate degree (B.Tech) in Electronics Engineering with distinction from Vishwakarma
Institute of Technology, Pune in 2015. He joined The University of Texas at Dallas for his
Master of Science(M.S) in Electrical Engineering in August 2015. He has worked in IDEA
Lab from July 2015. His research interests include reconfigurable computing, CAD, and
VLSI systems.
72
CURRICULUM VITAE
Rohit Dnyaneshwar SomwanshiMay 17, 2017
Contact Information:
Department of Electrical EngineeringThe University of Texas at Dallas800 W. Campbell Rd.Richardson, TX 75080-3021, U.S.A.
Voice: (469) 734-4775Email: [email protected]
Educational History:
B.Tech., Electronics Engineering, Vishwakarma Institute of Technology, Pune, India, 2015M.S., Electrical Engineering, The University of Texas at Dallas, 2017
CAD flow for Hotspot Management in FPGAsMS ThesisElectrical Engineering Department, The University of Texas at DallasAdvisor: Dr. Dinesh K. Bhatia