low-cost hardware profiling of run-time and energy in fpga soft
Post on 09-Feb-2022
1 Views
Preview:
TRANSCRIPT
Low-Cost Hardware Profiling of Run-Time and Energy in
FPGA Soft Processors
by
Mark Lee Aldham
A thesis submitted in conformity with the requirements
for the degree of Master of Applied Science
Department of Electrical and Computer Engineering
University of Toronto© Copyright by Mark Lee Aldham 2011
Low-Cost Hardware Profiling of Run-Time and Energy in FPGA Soft
Processors
Mark Lee Aldham
Master of Applied Science, 2011
Graduate Department of Electrical and Computer Engineering
University of Toronto
Abstract
Field Programmable Gate Arrays (FPGAs) are a reconfigurable hardware platform which en-
able the acceleration of software code through the use of custom-hardware circuits. Complex
systems combining processors with programmable logic require partitioning to decide which
code segments to accelerate. This thesis provides tools to help determine which software code
sections would most benefit from hardware acceleration.
A low-overhead profiling architecture, called LEAP, is proposed to attain real-time profiles
of an FPGA-based processor. LEAP is designed to be extensible for a variety of profiling tasks,
three of which are investigated and implemented to identify candidate software for acceleration.
1) Cycle profiling determines the most time-consuming functions to maximize speedup. 2) Cache
stall profiling detects memory-intensive code; large memory overheads reduce the benefits of
acceleration. 3) Energy consumption profiling detects energy-inefficient code through the use
of an instruction-level power database to minimize the system’s energy consumption.
ii
Acknowledgments
First, I would like to thank my parents for their unwavering support throughout my many years
of education. They have always encouraged me to do what I enjoy, even if it meant filling the
house with the noise of drumming or resulted in sport-related hospital trips.
I would also like to thank Caliope Music Search Inc., especially Bryan Keith, for the advice,
support, and friendship over the past few years. You’ve always made time to help make sense of
my crazy ideas, and always believed that I can solve the problem at hand, even when I don’t yet
understand the question. Also, thank you Bryan and Sarah for the helpful and much-needed
editing of this thesis 1.
I would like to thank my supervisors Jason Anderson and Stephen Brown for the opportunity
to perform this research. You’ve helped to keep me on track throughout this degree and have
provided indispensable insight to help complete this project.
I would finally like to thank Cathleen. You’ve kept me sane through failed experiments, put
up with my sporadic schedules, and always kept me going. Thank you for always giving me
something to look forward to.
1Do Do Do Doooo Doooo!
iii
Contents
Acknowledgments iii
List of Figures vii
List of Tables ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background and Related Work 8
2.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Time-Based Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Energy and Power-Based Profiling . . . . . . . . . . . . . . . . . . . . . . 13
2.2 MIPS Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Processor Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Processor Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Communication Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Cycle-Accurate Profiling 26
iv
Contents
3.1 Profiler Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Method of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.2.1 Operation Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2.2 Call Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2.3 Data Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2.4 Counter Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2.5 Address Hash . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Compilation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.4 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.4.1 Wrapper Function . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.4.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.4.3 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Cycle and Stall Cycle Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Hierarchical Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.1 Configurable Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.2 Experimental Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.3 Comparison with Software Profiler . . . . . . . . . . . . . . . . . . . . . . 43
3.4.4 Comparison with FPGA-Based Profiler . . . . . . . . . . . . . . . . . . . 45
3.4.4.1 Profiling Results Comparison . . . . . . . . . . . . . . . . . . . . 47
3.4.4.2 Area and Power Comparison . . . . . . . . . . . . . . . . . . . . 48
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Energy Consumption Profiling 60
4.1 Instruction Power Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Implementation in Existing Framework . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.2.1 Overall Energy Estimation . . . . . . . . . . . . . . . . . . . . . 70
4.3.2.2 Function-level Energy Estimation . . . . . . . . . . . . . . . . . 74
v
Contents
4.3.2.3 Energy and Cycle Profile Correlation . . . . . . . . . . . . . . . 76
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 Hardware/Software Partitioning 79
5.1 System Creation with LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Conclusions 87
6.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.1 Detailed Function Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.2 Value Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.3 Memory Access Pattern Profiling . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.4 Counter Saturation Detection . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.5 Recursion Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A Complete Benchmark Results 95
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling . . . . . . . . . . . . . . . . 95
A.1.1 Full Results for LEAP vs. gprof . . . . . . . . . . . . . . . . . . . . . . . 95
A.1.2 Full Results for LEAP vs. SnoopP . . . . . . . . . . . . . . . . . . . . . . 104
A.1.3 Full Results for Power Overhead . . . . . . . . . . . . . . . . . . . . . . . 114
A.2 Full Results for Chapter 4: Energy Consumption Profiling . . . . . . . . . . . . . 116
A.2.1 Full Results for Cache Stall Energy Estimates . . . . . . . . . . . . . . . . 116
A.2.2 Full Results for Pipeline Stall Energy Estimates . . . . . . . . . . . . . . 124
A.2.3 Full Results for Energy/Time Correlation . . . . . . . . . . . . . . . . . . 131
B Source Code 140
B.1 Profiling Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
B.2 LEAP Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B.3 SnoopP Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
References 199
vi
List of Figures
2.1 Architecture of the SnoopP Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Architecture of the Airwolf Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Architecture of the AddressTracer Profiler . . . . . . . . . . . . . . . . . . . . . . 12
2.4 RTL schematic for original version of the Tiger MIPS processor . . . . . . . . . . 20
2.5 Host computer source code to send ELF section to processor . . . . . . . . . . . 21
2.6 FPGA-side code to receive section from host computer . . . . . . . . . . . . . . . 22
2.7 Design flow with LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 High-level architecture overview of LegUp . . . . . . . . . . . . . . . . . . . . . . 25
3.1 High-level flow chart for instruction-count profiling . . . . . . . . . . . . . . . . . 28
3.2 Modular view of profiling architecture . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Opcodes used to determine function context switches . . . . . . . . . . . . . . . . 30
3.4 Code example to illustrate the need for function address hashing . . . . . . . . . 33
3.5 Parameterized hashing algorithm used by the perfect hash generator . . . . . . . 35
3.6 Compilation flow to generate programming files . . . . . . . . . . . . . . . . . . . 35
3.7 Example hash initialization file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8 An example memory layout of program memory . . . . . . . . . . . . . . . . . . 37
3.9 Application-independent wrapper to enable initialization and data retrieval . . . 39
3.10 Architecture diagrams of Data Counter module for cycle-accurate profiling . . . . 41
3.11 Area overhead comparison of LEAP and SnoopP . . . . . . . . . . . . . . . . . . 54
3.12 Power overhead comparison of LEAP and SnoopP . . . . . . . . . . . . . . . . . 58
4.1 Instruction-level power database groupings . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Data Counter module configured for power profiling . . . . . . . . . . . . . . . . 68
5.1 Area comparison for hybrid systems in four configurations . . . . . . . . . . . . . 83
vii
List of Figures
5.2 Execution time comparison for hybrid systems in four configurations . . . . . . . 85
5.3 Energy consumption comparison for hybrid systems in four configurations . . . . 86
viii
List of Tables
2.1 Comparison of three candidate soft processors . . . . . . . . . . . . . . . . . . . . 17
2.2 Synthesis results for Tiger processor on Altera Cyclone II FPGA . . . . . . . . . 18
3.1 Maximum depth of nested function calls . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Description of the benchmarks used in experiments . . . . . . . . . . . . . . . . . 44
3.3 Comparative results of gprof vs. LEAP for the benchmark “ADPCM” . . . . . . 46
3.4 Comparative results of gprof vs. LEAP for the benchmark “DFMUL” . . . . . . 46
3.5 Comparative results of LEAP vs. SnoopP for the benchmark “ADPCM” . . . . . 49
3.6 Comparative results of LEAP vs. SnoopP for the benchmark “DFMUL” . . . . . 50
3.7 Repeatability testing for cycle accurate profiling of benchmark “Motion” . . . . . 50
3.8 Area overhead of LEAP for three profiling schemes . . . . . . . . . . . . . . . . . 51
3.9 Area overhead of LEAP for three profiling schemes with Hierarchical Profiling . . 52
3.10 Area overhead comparison of LEAP and SnoopP . . . . . . . . . . . . . . . . . . 53
3.11 Area overhead investigation for SnoopP if counters were implemented with mem-
ory bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.12 Power overhead comparison of LEAP and SnoopP . . . . . . . . . . . . . . . . . 57
4.1 Pruned instruction-level power database for cache stalls . . . . . . . . . . . . . . 63
4.2 Pruned instruction-level power database for pipeline stalls . . . . . . . . . . . . . 64
4.3 Instruction power database for the MIPS1 instruction set . . . . . . . . . . . . . 66
4.4 Area overhead of LEAP in configured for energy profiling . . . . . . . . . . . . . 68
4.5 Energy profiling results when only cache stalls are considered . . . . . . . . . . . 71
4.6 Energy profiling results when all pipeline stalls are considered . . . . . . . . . . . 72
4.7 Energy profiling results when all pipeline stalls are considered and correction
factor applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
ix
List of Tables
4.8 Function-level energy estimation results for benchmark “ADPCM” . . . . . . . . 75
4.9 Function-level energy estimation results for benchmark “DFMUL” . . . . . . . . 76
4.10 Comparative results of percentage energy consumption versus percentage execu-
tion time for benchmark “ADPCM” . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.11 Comparative results of percentage energy consumption versus percentage execu-
tion time for benchmark “DFMUL” . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 Area results for hybrid systems in four configurations . . . . . . . . . . . . . . . . 82
5.2 Time results for hybrid systems in four configurations . . . . . . . . . . . . . . . 84
5.3 Power and energy results for hybrid systems in four configurations . . . . . . . . 86
A.1 Power overhead results for LEAP, measured in milliwatts, for 16, 32, and 64
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.2 Power overhead results for LEAP, measured in milliwatts, for 128 and 256 functions114
A.3 Power overhead results for SnoopP, measured in milliwatts, for 16, 32, and 64
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.4 Power overhead results for SnoopP, measured in milliwatts, for 128 and 256
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
x
Chapter 1
Introduction
Profiling involves the dynamic analysis of software executing on a processor to characterize
aspects of the program’s execution. Profilable aspects can include execution time, data locality,
and energy consumption, the measurement of which can give direction to aid in redesigning
code for improved performance. Software-based profiling approaches, such as GNU’s gprof [1],
provide time-based measurements of a program’s execution that aid in the iterative process of
code improvement. These approaches incur run-time overhead and suffer from inaccuracies due
to sampling, which emphasizes the need for low-overhead and accurate profiling tools.
Field Programmable Gate Arrays (FPGAs) offer a reconfigurable environment to enable faster
time-to-market and ease of development versus application-specific integrated circuits (ASICs).
The advent of soft processors on FPGAs has aided rapid creation of systems that include
both software running on a processor with hardware that accelerates part of the software.
With the ability to combine these computation paradigms comes the need to partition code
into hardware and software portions. A hardware implementation can provide a significant
improvement in speed and energy efficiency over a software implementation (e.g. [22, 35]), but
requires writing complex register transfer level (RTL) code which is error prone and notoriously
difficult to debug. Software design is comparatively straightforward and has mature debugging
and analysis tools freely available. Despite the apparent energy and performance benefits,
hardware design is simply too difficult and costly for most applications, so a software-based
1
1 Introduction
approach is often preferred.
The focus of this thesis is to enable partitioning of software code into a software portion
and a hardware portion to maximize throughput and minimize energy consumption. A Low-
overhead and Extensible Architecture for Profiling (LEAP) is proposed to aid in the process
of compiling software code into hybrid systems. These FPGA-based hybrid systems combine
soft processors with programmable logic (e.g. [15], [50]) and more recently, hard processor cores
included with FPGA chips (e.g. [23], [34]). It is estimated that over 40% of all modern FPGA
designs contain embedded processors [48]. LEAP is used in a larger project called LegUp [20], an
open-source high-level synthesis (HLS) framework which partitions a software application into
a hybrid system; code segments, chosen based on profiling results, are automatically compiled
into hardware accelerators to accelerate a soft processor’s execution.
1.1 Motivation
The advent of Field Programmable Gate Arrays (FPGAs) has narrowed the gap between the
flexible, high-level properties of software programs and the complicated, low-level nature of
hardware circuitry. This middle ground between application-specific integrated circuits (ASICs)
and pure software, FPGA chips, still require relatively low-level hardware design skills, but offer
the advantage of reconfigurability. Since the first FPGAs appeared in the mid-1980s, access to
the technology has been restricted to those with hardware design skills. However, according
to labor statistics, software engineers outnumber hardware engineers by more than 10× in the
U.S. [46]. To take advantage of the vast number of software engineers, as well as retain the
performance benefits and energy efficiencies of hardware, the FPGA development flow must
be able to use software running on a soft processor in conjunction with specialized hardware
accelerators.
To create such a system, inefficient code segments must be identified to provide the maximum
performance benefit to the system. This decision process, called hardware/software partitioning,
is a non-trivial process due to its dependence on dynamic run-time data. Once chosen, the
2
1.2 LegUp
software code segments must be re-implemented as hardware accelerators and connected to the
processor-centric system. Therefore, the processor must be profiled to quantify this run-time
data and enable the creation of an effective partition.
Trends in computer architecture indicate that heterogeneous many-core architectures, incor-
porating processor cores, GPUs and FPGAs, will be heavily utilized in the future. Recent
collaborations between FPGA and processor vendors support this direction; Xilinx has an-
nounced the Extensible Processing Platform [23], which combines the ARM Cortex-A9MPCore
processor with Xilinx 28nm programmable logic, and Intel has announced the configurable
Atom processor E600C series [34], which combines the Intel Atom E600C processor with an on-
package Altera FPGA. These advanced systems further necessitate the partitioning of software
code to take advantage of the wide range of computational elements available.
1.2 LegUp
This research is used within a broader project called LegUp, whose overarching goal is to
create a self-accelerating processor in which code can be accelerated automatically using custom
hardware implementations. LegUp is an open-source HLS framework that has been developed
to provide the performance and energy benefits of hardware while retaining the ease-of-use
associated with software. LegUp automatically compiles a standard C program to target a
hybrid FPGA-based hardware/software system. Some program segments execute on a 32-bit
MIPS soft processor, while other program segments are automatically synthesized into FPGA
circuits (hardware accelerators) that communicate and work in tandem with the soft processor.
The LEAP profiler is incorporated into this processor as the tool to guide the selection of
accelerators, as discussed further in Chapter 3. The LegUp distribution includes a set of
benchmark programs that the user can compile to pure software, pure hardware, or a hybrid
hardware/software system.
The targeted system is a hybrid, as opposed to pure custom hardware, because not all C
program code is appropriate for hardware implementation. Inherently sequential computations
3
1 Introduction
(e.g. traversing a linked list) are well-suited for software, whereas independent or parallel
computations (e.g. addition of integer arrays) are ideally suited for hardware. Incorporating
a processor into the target platform also offers the advantage of increased high-level language
coverage. Program segments that use restricted C language constructs (e.g. calls to malloc/free)
can remain in software to execute on the processor.
1.3 Objectives
The principal objective of this research is to enable hardware/software partitioning decisions
to increase throughput and reduce energy consumption; these decisions are made by creating a
profiling-based flow that produces accurate measurements of desired metrics. Different target
systems may have differing performance objectives, but there are three important goals to
consider when partitioning to choose the best candidates for acceleration:
1. Maximize throughput to execute segments of computation in as few cycles as possible
2. Minimize energy consumption for both battery driven and other applications
3. High data locality is necessary for an accelerator to maximize its potential benefit,
as hardware experiences the same memory access latencies as software and thus cannot
improve performance while waiting for memory
These three objectives of the target system can each be associated with quantifiable met-
rics based on the result of Amdahl’s Law [18], which is used to find the maximum expected
improvement to a system when only a portion of that system is modified. This is expressed as:
Ssys =1
(1 − P ) + PS
where Ssys is the improvement of the overall system, P is the fraction of time (or energy) the
modified portion consumes relative to the entire system, and S is the improvement for that
portion. Benefits of hardware versus software, such as expected speedup and power reduction,
4
1.3 Objectives
are not predeterminable. Guo et al. [28] show that the inherent advantage of FPGAs over
processor models, coined the “instruction efficiency factor,” can range from 6 to 47 on the
tested benchmarks. This demonstrates the impracticality of estimating the performance bene-
fits. Therefore, S is not a known factor and the overall system improvement must be related
to P , the portion of code that can be improved. To maximize throughput or minimize en-
ergy consumption, the quantifiable metrics required are the percentage execution time and the
percentage energy consumption, respectively, of the desired portion. Conversely, data locality
directly affects the potential improvement for a given segment, S; higher memory overhead pro-
longs the completion time, which in turn reduces speedup and increases energy consumption.
Thus the required metric to quantify data locality is the percentage time performing memory
accesses, or alternatively, percentage time attributed to cache stalls. Summarized, the profilable
metrics required for effective partitioning are:
1. Percentage execution time: accelerate long-executing functions to maximize speedup
2. Percentage energy consumption: accelerate energy-inefficient functions to minimize
energy requirements
3. Percentage execution time in cache stalls: accelerate functions with low stall times,
thus high locality, to maximize potential performance benefits
The metrics just described outline the deliverable results desired, but they do not indicate
the manner in which the data is collected by the profiler. To provide useful results and to be
easy to use, the profiler must satisfy the following requirements:� Non-intrusive: the processor’s execution must not be inhibited or affected� Low overhead: the area usage and power consumption must be minimal compared with
that of the system
5
1 Introduction� Accurate: the data collected must be correct without requiring many iterations of the
application; it therefore cannot employ sampling, the act of measuring data only at specific
time intervals� Dynamic: the profiler must be able to profile any application without requiring recom-
pilation of the system� Flexible: the profiler must be configurable to measure any of the above metrics, and
must be extensible for future research
1.4 Thesis Organization
This research focuses on providing an effective mechanism to perform hardware/software par-
titioning through the use of an area- and power-efficient FPGA-based profiler. The remainder
of this thesis is organized as follows:
Chapter 2 provides background material relevant to this research, including previous efforts
in time-, energy-, and power-based profiling, an overview of the LegUp framework, and a
description of the MIPS-based soft processor used in this system.
Chapter 3 describes the design of a scalable and extensible profiling framework. The frame-
work is shown to be both area- and power-efficient when compared to previous profiling systems,
requiring only 12.2% area and 10.7% power consumption when compared to the processor alone
for a typical configuration. This chapter also presents profiling schemes which provide the means
to satisfy two of the performance goals described above, namely throughput and data locality.
Throughput, associated with measuring percentage execution time, is maximized by measuring
the cycles spent in each function and choosing those with the highest cycle count to be ac-
celerated. Data locality, associated with measuring percentage time attributed to cache stalls,
is maximized by measuring the cycles spent in each function while the processor is stalled by
either the instruction or data cache and choosing those with the lowest stall cycle count for
acceleration. Experiments are performed to verify the accuracy of these results.
6
1.4 Thesis Organization
Chapter 4 presents a profiling scheme that enables the system to be optimized for low energy
consumption. An energy lookup database is created based on the average energy consumption
of individual instructions in the MIPS instruction set. This database is used to supplement the
run-time data provided by the profiler to produce an estimate of the total energy consumption
of the system. Experiments are performed to determine the accuracy of the results, which have
been shown to be within 5.95% error, on average.
Chapter 5 describes the methodology of formulating the hardware/software partition based
on the results of the profiler. The partition is a recommendation in the form of an ordered
text file which indicates which functions are to be accelerated, ordered by the likely advantage
a hardware implementation would provide. Results are presented to show the performance
benefits that are attained when LEAP is used to partition software for the creation of a hybrid
system with LegUp.
Chapter 6 presents concluding remarks and suggestions for future work.
7
Chapter 2
Background and Related Work
2.1 Prior Work
All application and system developers share two common goals: functionality and performance.
Performance is especially important when developing embedded systems, where real-time con-
straints and limited resources impose challenges in implementation. These requirements em-
phasize the importance of good profiling tools to aid in the development cycle by quickly,
efficiently, and accurately providing run-time statistics to guide possible optimizations. The
following subsections describe prior work on profiling.
2.1.1 Time-Based Profiling
There have been many previous efforts which have attempted to monitor the execution time
characteristics of applications and attribute time to specific segments within them. Relevant
approaches can be categorized as software, hardware, or FPGA approaches. Software-based
approaches, which can involve simulation or the insertion of monitoring code into an application,
create a performance overhead but can be run without special hardware. Hardware-based
approaches make use of real hardware resources, such as performance counters, to gather run-
time data. FPGA-based approaches use custom hardware that interacts with signals on the
target soft processor in order to quickly attain an accurate profile.
8
2.1 Prior Work
A useful and popular software-based profiling tool is gprof [1], which is supported as part
of the GCC [26] distribution. This tool modifies the desired application by inserting code to
update profiling data and set up interrupts. These interrupts are used to sample the processor’s
program counter (PC), typically at intervals of 10ms, and attribute execution time to individual
functions. Sampling reduces accuracy but allows for a quick approximate profile.
Pin [33] is a software-based dynamic instrumentation system through which, using architecture-
generic APIs, instrumentation tools (including profilers) can be created. The API enables the
observation of all architectural states of a processor, including the contents of registers, memory,
and control flow. Pin uses just-in-time (JIT) compilation, taking a native executable as input,
and compiling traces as required for execution.
Many modern processors include hardware counters that can be used to monitor various
performance aspects of execution. Specifically, Intel has included a Time Stamp Counter (TSC)
in all x86 processors since the Pentium; software can access the TSC through the x86 instruction
RDTSC [31]. Similarly, in the MIPS R10000 architecture, two performance counters are used,
each of which can indepently count one type of event at a time [38]; the MIPS instruction
MTC0 is used to specify the event for each counter, and MFC0 is used to read the counters’
values. In [19], methods for using performance counters to create a profile with more context
than was previously available.
Several FPGA-based profiling tools have been introduced in recent years. These approaches
aim to alleviate the performance overhead of code instrumentation caused by all software-based
approaches, and the accuracy degradation that can be caused by sampling in approaches like
gprof. Four such tools are reviewed below.
SnoopP [43] is an FPGA-based profiler designed for use with the Xilinx [52] MicroBlaze [51]
soft processor on the Virtex-II 2000 FPGA [3] board. It is a non-intrusive profiler, meaning it
does not disrupt the execution of the processor or change the instruction stream. This profiler
is designed to provide developers with the ability to assess which aspects of their design fail to
meet required real-time constraints by allowing arbitrary code regions to be profiled for cycle
9
2 Background and Related Work
counts. The address ranges of each region, as well as the number of regions to profile, are
entered as parameters in the VHDL code of the profiler that must be set pre-synthesis. During
execution, comparators check the PC of the processor each clock cycle to see if it falls in any of
the specified regions. If it does, the 64-bit counter corresponding to that region is incremented
as shown in Figure 2.1. The circuit for this profiler, configured with the maximum 16 profilable
regions, uses 1129 flip-flops and 1719 look-up tables (LUTs), which is about the same size as the
MicroBlaze processor itself (1173 flip-flops and 1548 LUTs). The Virtex-II 2000 FPGA contains
21,504 of both flip-flops and 4-input LUTs, arranged in 10,752 slices; this means that the profiler
consumes 5.3% and 8.0% of the chip’s available flip-flop and LUT resources, respectively. This
area can be broken down as follows: the 16 64-bit cycle counters consume 1024 flip-flops, or 91%
of the profiler’s total flip-flop count, and the 32 32-bit comparators consume 1024 LUTs, or 60%
of the total LUT count. Due to the limited number of code regions that can be simultaneously
profiled with SnoopP, gprof was used to find regions of interest to choose appropriate code
regions for experiments.
Segment 0
Segment 1
PC ≤ hi address
PC ≥ low address
PC Comparators:
Segment N-1
CPU
PC
Clock
valid_PC
Read Bus
Counter:
Increment Enable
Figure 2.1: Architecture of the SnoopP Profiler.
10
2.1 Prior Work
Airwolf [45] is an FPGA-based profiler that connects to the Altera Nios II-fast soft proces-
sor [12]. This profiler minimally impacts the execution of the processor by inserting software
drivers into each function call and return to enable/disable individual counters. These software
drivers consume 20 bytes per function in the instruction stream. Two counters are allocated for
each function to be profiled: a 32-bit hit counter counts the number of times a function is called,
and a 64-bit time counter counts the number of clock cycles spent in that function. As seen in
Figure 2.2, the Time Counter Enable Module connects to the Avalon Interface Bus to read data
from the software drivers and in turn enable or disable counters as required; the appropriate
Time Counter Enabling Lines (TCEL) are set high when a function has been called, and are set
low when the function returns. The Hits Counter Enable Line (HCEL) is held high for a single
clock cycle when the function is called. On a Stratix [16] FPGA, this circuit consumes 3345
logic elements (LEs), which corresponds to almost two times the area of the Nios II-fast (1800
LEs for just the core). Stratix devices range in capacity from 10,570-32,470 LEs, meaning the
profiler consumes 10.3-30.6% of the chip capacity.
AddressTracer [42] is an adaptation of SnoopP. It is a function-level, non-intrusive profiler
that counts the number of times a function is called and the number of cycles spent in the
function. Like SnoopP, AddressTracer profiles the MicroBlaze soft processor, but unlike SnoopP
its cycle counts represent the indicated function plus descendants. As seen in Figure 2.3, each
function to be profiled corresponds to a tracer block which contains two counters: a 32-bit hit
counter that counts the number of times a function is called, and a 64-bit time counter that
counts the number of clock cycles spent in that function and its descendants. During execution,
the PC of the processor is monitored by all tracer blocks, each comparing the PC with the start
and end addresses of the corresponding function. If the PC equals the function’s start address,
the hit counter is incremented and the time counter is enabled. If the PC equals the function’s
end address, the time counter is disabled. It is not specified in [42] how the values for the start
and end addresses are input, but since SnoopP requires these addresses pre-synthesis, it is likely
that AddressTracer functions similarly. The area of the profiler is also not specified but can be
11
2 Background and Related WorkR
ese
t
Re
ad
Wri
te
Ch
ipse
lect
Ad
dre
ss [
15
:0]
Da
ta_
in[3
1:0
]
Time Counter
Enable Module
TCEL #1
HCEL #1
TCEL #2
HCEL #2
TCEL #20
HCEL #20
TCE-EN
HITS-EN
Segment
Counter
#1
TCE-EN
HITS-EN
Segment
Counter
#2
TCE-EN
HITS-EN
Segment
Counter
#20System Clock
32-bit
Output Bus
32-bit
Output BusFrom HCEL
System Clock
32-Bit Hit Counter
64-Bit Time CounterFrom TCELEN
A1 A0
MSB [63:32]
LSB [31:0]
Segment Counter #20
Figure 2.2: Architecture of the Airwolf Profiler.
assumed to be comparable to that of SnoopP.
32-bit
Output Bus
count_enable 32-Bit Hit Counter
64-Bit Time CounterEN
A1 A0
MSB [31..0]
LSB [31:0]
if (addr_bus == start_addr) {
count_en = 1;
} else if (addr_bus == end_addr) {
count_en = 0;
}
addr_bus [31:0]
system clock
Tracer Block
Figure 2.3: Architecture of the AddressTracer Profiler.
Comet [24] is an FPGA-based profiler for the Altera Nios [6] soft processor. It requires the
user to modify the application to be profiled with pragma-like labels to specify the start and
end of the regions of interest. The ranges denoted by these pragmas are used to initialize
the profiler at run-time. The profiler counts the total number of clock cycles spent in each
12
2.1 Prior Work
region and the number of times the region iterates. For the experiment performed, one which
examined a JPEG decoder, the profiler was configured with 25-bit counters to profile 6 regions;
this consumed 486 LEs on an Altera APEX20K200E [5] FPGA. The APEX20K200E device
contains 8,320 LEs, meaning the profiler consumes 5.8% of the chip capacity.
2.1.2 Energy and Power-Based Profiling
Minimizing the energy consumption of mobile devices has become an important focus for soft-
ware developers in order to maximize battery life. In order to optimize applications for low-
power, the energy consumption must be monitored and improved iteratively. Common methods
to measure an application’s power are either invasive, such as directly measuring the current
draw of a system, or very slow, such as simulation. Invasive methods are undesirable because
they require specialized experimental setups and, if a printed board is used, soldering or trace
cutting may be required. Simulation is often prohibitive because the time required to simulate
only seconds of execution can be hours or days for large circuits. Therefore there is a need for
rapid, non-invasive tools to aid in the process of energy consumption profiling and minimization.
Three previous approaches in this direction are described below.
The work in [44] describes a profiling-based methodology for power reduction within non-
FPGA embedded processors. The authors developed and verified instruction-level power models
for the Intel 486DX2 and Fujitsu SPARClite 934 processors by measuring the current drawn
as the processor repeatedly executed certain instructions or instruction sequences. A standard
digital ammeter was used to determine the steady-state current drawn by the processor when
running each instruction; this value became that instruction’s “base cost.” The authors also
describe inter-instruction effects, whereby the mix of the instructions in the processor’s pipeline
affects the current draw. These effects include circuit state, pipeline stalls, and cache misses,
all of which add to the overall energy consumption of the system. Accuracy within 3% error
was achieved for applications containing no stalls or cache misses, but no results were presented
for those containing stalls and cache misses.
13
2 Background and Related Work
A power estimation methodology to predict the power consumption of non-FPGA embedded
processors is presented in [41]. A power data bank, which stores the energy consumption of
built-in library functions, user-defined functions, and basic instructions, is created through the
use of a power simulator and carefully created test code. The authors note that the energy
cost for two instructions is usually different from their average due to inter-instruction effects;
measuring 2-3 instruction sequences increases estimation accuracy. The power data bank is
then used in conjunction with tracing tools to produce a function-level power estimate. The
experimental results reveal a 3.5% average error; however, the tests performed were on simplistic
benchmarks that use 6 library functions and integer/floating point additions and subtractions.
A method to estimate the energy consumption of soft processors on FPGAs, specifically the
MicroBlaze processor, is described in [40]. ModelSim [36] was used to perform timing simu-
lations of software programs representing each instruction in the MicroBlaze instruction set.
These simulations produce Value Change Dump (VCD) files, which are used by XPower [52]
to estimate the energy consumption of the simulated system. VCD files are an IEEE standard
format that use ASCII-based dumpfiles to record a series of time-ordered value changes for sig-
nals in a given simulation model. This data was collected for each instruction, thus creating an
instruction energy look-up table. A maximum difference of 5-6× was found between the energy
dissipations of different instructions in the instruction set. Using the Xilinx Microprocessor
Debugger (XMD) [52] TCL interface, as opposed to on-chip profiling, an instruction trace was
combined with the lookup table to estimate the overall energy consumption of an application.
Through four test programs, an average estimation error of 5.9% was achieved.
2.2 MIPS Processor
In order to develop and test the hardware profiling architecture, a soft processor was required to
produce real-time profiling data. This processor must be able to interface with the prospective
profiler, so it needed to provide direct access to important signals required by a profiler, such
as the PC and the instruction bus. In addition, since the profiler will be an integral part of
14
2.2 MIPS Processor
the LegUp framework, the code for the processor must also be open-source. The choice of a
MIPS-based processor was not set as a stringent requirement, but rather a desirable trait due
to the popularity and availability of the MIPS architecture in mobile environments. ARM-
based processors were also considered, but due to the company’s prior issues with proprietary
rights ([21], [47]), open-source versions are not available. Other desirable traits in a prospective
processor were code readability, documentation, a working compilation tool-chain, and high
performance.
2.2.1 Processor Selection
Three open-source soft processors were found to satisfy the above requirements. These proces-
sors were tested, compared and contrasted to select the best candidate. The three processors
were:� YACC [4], a processor found on OpenCores [2], which provides implementations for use
with both Altera and Xilinx tools� SPREE [53], developed at the University of Toronto, which contains a C++ processor-
generation mechanism to enable custom-processor creation� Tiger MIPS [39], developed at Cambridge University, which was developed for the
Altera DE-2 board [10] and utilizes Altera’s SOPC Builder [15] to connect the processor
to memory and other peripherals on the board.
Table 2.1 compares these processors based on a range of features, including maximum clock
frequency, area, system features, and implementation details. As is evident in the table, each
processor has its own advantages. YACC has the highest clock frequency of the three processors
and provides a clean implementation in Verilog. SPREE (using pipe5 forwardAB mulshift stall load
configuration) is the smallest processor, consuming only 4% of the LEs available on the Cyclone
II EP2C35 FPGA, and contains a C++ processor generator to create a tailored soft processor
implementation in Verilog. The exclusive use of on-chip memory in these processors limits their
15
2 Background and Related Work
use for two reasons: first, the processor’s netlist must be reassembled with memory initialization
files and reprogrammed to the FPGA each time an application is to be run. Second, the limited
on-chip resources sets a much tighter bound on the possible application size than an off-chip
memory would. Tiger is the largest and slowest of the three processors, but these downfalls
can be partly attributed to the system features it implements; hardware division alone, imple-
mented with one signed and one unsigned divider, consumes over 2000 LEs. In addition, the
use of off-chip memory is supported through the use of JTAG UARTs, which communicate with
the host computer to enable the dynamic programming of applications.
Tiger was chosen as the target processor for this research, and consequently for use in the
LegUp system. The driving factor for this decision was largely based on the communication in-
terface it provides, which allows an application to be programmed to the processor dynamically,
and also enables real-time debugging using GDB [25]. In addition, it provides a GCC-based
tool-chain, comprehensive Verilog code, full coverage of the MIPS1 [37] instruction set, and a
working SOPC system implementation for the DE-2 board.
2.2.2 Processor Details
Tiger is a 32-bit 5-stage processor that supports the MIPS1 instruction set [37]. The RTL
schematic of the processor can be seen in Figure 2.4. This schematic shows the five pipeline
stages which are standard for reduced instruction set computing (RISC) processors:� The Instruction fetch stage retrieves the instruction to execute from memory� The Decode stage determines the operation to perform based on the binary operation
code (opcode) fetched,� The Execute stage performs the computation required by the decoded instruction,� The Memory access stage communicates with memory to load or store data,� The Writeback stage writes instruction results back to the register file
16
2.2 MIPS Processor
Table 2.1: Comparison of three candidate soft processors. Note: Area percentages refer to thepercent utilization on the CycloneII EP2C35 FPGA.
Criteria YACC SPREE TIGERFmax 96.9 MHz 81.22 MHz 56.29 MHz
Area
- total LEs: 3,543 (11%) 1,235 (4%) 10, 073 (30%)
- total regs: 1191 349 4391
- total mem bits: 137,216 (28%) 395,364 (82%) 193,017 (40%)
- embed. mult: 0 8 16
Memory Types & Sizes
Single on-chipmemory: 4096x32
Instruction Memory:8192x32
Instruction & DataCaches: 512x148
Data Memory:4096x32
ApplicationMemory: 8MBSDRAM
Pipeline Depth 5 stagesVariable (3,5,7) –used 5 for results
5 stages
Features UARTCPU generator,ModelSimbenchmarking tools
SDRAM Memory,instr./data cache,Avalon busconnections, JTAGdebugger, HWdivision
Compilation tool chainGCC-based batchfile creates 4 hexfiles for memory init.
GCC-based makefilecreates MIF/RIFfiles for memory init.
GCC-based batchfile with Signal Tapprogrammer
Benchmarks included
UART calculator,int-to-text counter,Dhrystone, picalculator, reedsolomon
20 Benchmarks (incl.crc32, fft, dijkstra,qsort, sha, des, fir,vlc, Dhrystone)
“game of life”hardware via SOPC
MIPS CoverageDoes NOT support:bgezal, bltzal, break,syscall, mthi, mtli
Does NOT support:lwl, lr, swl, swr, div,divu, bgezal, blzal,syscall, break
Full coverage ofMIPS1
DocumentationHTML overviewpage, structuraldiagrams
ReadMEsthroughout directorystructure, thesis,overview PPT
HTML overviewpage, structuraldiagrams,descriptions incomments
Code QualityWell-structured,slightly commented
Well-structuredWell-structured,very well-commented
17
2 Background and Related Work
Table 2.2: Synthesis results for Tiger processor on Altera Cyclone II (2C35) FPGA. Values inparentheses indicate chip capacity.
Total logic elements 13,422 (33,216)Total registers 6,455 (33,216)Total memory bits 225,280 (483,840)Embedded Multipliers 16 (70)Maximum Frequency (MHz) 75.59 (402.5)
To facilitate the correct data flow among these pipeline stages, a stall logic module (bottom
of Figure 2.4) processes stall requests and control signals so that it may stall certain pipeline
stages as required. When implemented on an Altera Cyclone II [9] FPGA, Tiger consumes
13,422 logic elements (LEs) running at 75.59 MHz. More details are given in Table 2.2.
Tiger uses two types of memory for instruction and data storage: on-chip block RAM and
off-chip SDRAM on the DE-2 board. The communication between these memories and the
processor is achieved through the use of Altera’s SOPC Builder, which automatically generates
the arbitration logic to connect master and slave peripherals. At startup, reset, or after the
completion of an application, Tiger defaults to the code stored on-chip; this code acts as its boot
loader and is independent of the applications being run. In order to execute a given application,
the application must first be loaded into SDRAM.
In order to reduce the latency required to read and write from SDRAM, on-chip caches are
used. Tiger contains both a data cache and an instruction cache, each implemented in on-chip
block RAM. These caches are direct-mapped, meaning there is only one location in the cache a
given address can map to. Therefore, conflicts arising from the mapping of multiple addresses
to the same cache line are always resolved by replacement. They are also write-through caches,
meaning every write to the cache causes a synchronous write to the SDRAM. In order to
alleviate the inherent penalty of cache misses, these caches aim to reduce misses by performing
bursts when reading from the SDRAM. Every cache miss triggers a read of four consecutive
32-bit data words, stored as a single cache line. Tag and offset bits consume 20 bits, making
each cache line 148 bits wide.
18
2.2 MIPS Processor
Multiple modifications were made to the original processor system for integration into the
LegUp framework. First, a UART was added to enable the use of printf in the applications being
run. A more significant modification was to alter the data cache so that it communicates with
the processor over the Avalon interconnect fabric, as opposed to the original implementation
in which it communicated directly with the processor. Due to the high speeds achievable
using AltSyncRams and the Avalon fabric, the single-cycle access latency of the data cache is
maintained in this configuration. This change allowed a shared-memory model to be adopted
among the processor and variable numbers of accelerators for communication with main memory
via the cache. In addition, critical path delays were analyzed and optimized to boost the
maximum frequency of the processor from its initial speed of 43.86 MHz to its current maximum
of 75.59 MHz.
2.2.3 Communication Interface
In order to dynamically program an application to the processor, or perform debugging with
GDB, a communication interface between Tiger and a host computer was created. This interface
enables messages to be passed back and forth between the processor and host computer. To
program an application to the FPGA, the length of the application’s binary data (machine
code), the memory address at which the application should be stored, and the binary data
itself must be sent.
Three software tools were provided with the original Tiger processor to handle all communi-
cation between the processor and the host computer. This communication is performed on the
FPGA through the USB ByteBlaster [8], which allows UART communication via a standard
USB cable. The host computer handles the communication through sockets. First, a server is
needed to connect these two types of communication protocols by intercepting all messages and
relaying them appropriately. Clients must then be created on the FPGA and host computer
in order to send and receive messages. This flow permits programming an application to the
board, as well as debugging the application using GDB. In addition, this flow is used to initialize
19
2 Background and Related Work
Write Back
Memory Access
l
ExecuteDecodeinstrF
Branch
currentPC
instrDe
nextPC
controlDe
instr
Register File
rsFF
rtFF
rtrs
contr lEx
branchoutEx
instrEx
rs
rt
ALU
imm ALUControl
Shifter
En Enclear
En
shiftamount
Low
High
Multipl ier
Divider
executeOutMA
instrMA
controlMA
En
instrWB
controlWB
writeRegDMAOutWB
MemoryInterface
wri teRegEnWB writeRegNumWB
writeRegEnWB wri teRegNumWB
Stall Logic
stal lDe clearDe
stallDe stallEx
stallMAclearMA
stal lWB
iStal l dStall stallEx clearEx stal lMA clearMA stallWB clearWB
branchout
controlDe instrDe controlEx instrEx wri teRegEnMA wri teRegNumMAstal lRqEx writeRegEnWB writeRegNu
wri teRegEnMA writeRegNumMA
InstructionCache
Instruction Fetch
o
stal RqEx
Figure 2.4: RTL schematic for original version of the Tiger MIPS processor.
and retrieve data from the profiling framework, which will be discussed in Section 3.1.4.
The server required to enable communication between the FPGA and the host computer is
a GUI-based application developed in C# called “MIPS Communication Server.” It connects
to the FPGA through the Byte-Blaster using the SignalTap II Logic Analyzer [17], a tool that
captures internal device signals for debugging, but also provides a TCL interface for JTAG
communication. The server communicates with the processor through blocking sends/receives
over a virtual JTAG UART. In addition, it creates a local server on the host computer to which
that can be connected to via the computer’s command-line, enabling communication through
20
2.2 MIPS Processor
socket-based sends and receives.
The command-line interface on the host computer, called “MIPSLoad,” is written in C and
is used to disassemble compiled Executable and Linkable Format (ELF) files into readable
sections, send all data to the SDRAM, and verify the contents once programmed. ELF is a
standard binary file format for executables, object code and shared libraries, and is not bound
to any specific computer architecture. This data transfer is initiated by first sending a single
character to indicate the task to be performed; each task is associated with a unique character
and this encoding is known to the server and FGPA client. After the task is established,
task-specific parameters are sent consecutively. For example, a section of the ELF file can be
programmed by first sending “M” to specify the task, and then the required parameters: the
SDRAM address to write to, the length of the section, and the data itself, as seen in Figure 2.5.
An ELF section is verified by sending “m” as the task, followed by the SDRAM address to read
from and the length of data to be read, and then continuously reading transmitted data until
all is received. This received data is then compared to the data previously sent to ensure that
no errors were encountered and the application is correctly programmed.
int sendSect ion ( char* sect ionData , unsigned int addr , unsigned int l ength ) {i f ( serverSend (“M” , 1) ) return 1 ;i f ( serverSend(&addr , 4) ) return 1 ;i f ( serverSend(&length , 4) ) return 1 ;i f ( serverSend ( sect ionData , l ength ) ) return 1 ;
return 0 ;}
Figure 2.5: Host computer source code to send ELF section to processor.
The communication mechanism on the Tiger processor is called the “debug stub,” which
uses blocking sends and receives through a virtual JTAG UART to communicate with the host
server. The debug stub is contained in the boot loader code that is run both at startup and after
the completion of an application’s execution. When called, the debug stub performs a blocking
UART read to wait on a single-character task identifier to be sent from the host computer. A
21
2 Background and Related Work
case statement is then used to receive data relevant to the required task. For example, an ELF
section is programmed to the SDRAM with the task identifier “M.” The debug stub receives
the data from the host computer and then writes that data to memory mapped addresses using
pointer arithmetic, as seen in Figure 2.6. In addition to the debug stub, the boot loading code
handles tasks such as flushing the caches and resetting registers.
// Receive t a s k i d e n t i f i e rchar id = uar t r eadc (DEBUGUART) ;
switch ( id ) {. . .case ‘M’ :
// Receive s e c t i on l engthunsigned in t Length = uart readUInt (DEBUG UART) ;
// Receive and c r ea t e po in t e r to d e s i r ed s e c t i on addressv o l a t i l e unsigned char* Addr = ( v o l a t i l e unsigned char *) uart readUInt (
DEBUG UART) ;
// Receive s e c t i on one byte at a time , s t o r e to each byte to SDRAMin t i ;f o r ( i = 0 ; i < Length ; ++i ) {*Addr = uar t r eadc (DEBUG UART) ;
Addr++;}
// Return value o f 0 i n d i c a t e s proper execu t ionuar t wr i t eUInt (DEBUGUART, 0) ;
// Flush the cache o f s t a l e dataf lushICache ( ) ;
break ;. . .
}
Figure 2.6: FPGA-side code to receive section from host computer.
2.3 LegUp
Hardware profiling with LEAP is a key component in a larger project at the University of
Toronto, called LegUp. LegUp is an open-source system that consists of a HLS compilation
22
2.3 LegUp
flow, a hybrid processor/accelerator architecture generator, and LEAP-based hardware/soft-
ware partitioning. These components result in a system that provides a semi-automated self-
acceleration flow for software code running on the MIPS processor previously described.
2.3.1 Design Flow
The LegUp design flow consists of multiple steps, many of them automated, which generate a
hybrid architecture which accelerates a particular software application. Figure 2.7 illustrates
the detailed flow.
Referring to the labels in the figure, at step ➀ the user compiles a standard C program to
a software-only binary executable using the LLVM compiler [30]. At step ➁, the executable is
run on the Tiger MIPS processor described in Section 2.2.2.
Using LEAP, sections of program code that would most benefit from hardware implementa-
tion to improve program throughput and power are selected. Specifically, the profiling results
drive hardware/software partitioning, which is the selection of program code segments to be
re-targeted to custom hardware from the C source.
Having chosen program segments to target to custom hardware, at step ➂ the LegUp HLS tool
reads a configuration file representing the hardware/software partition and compiles the required
segments to synthesizable Verilog RTL. LegUp’s HLS operates at the function level; entire
functions are synthesized to hardware from the C source. Moreover, if a hardware function calls
other functions, these called functions are also synthesized to hardware; hardware-accelerated
functions may not call software functions. The RTL produced by LegUp is synthesized to an
FPGA implementation using standard commercial tools [14] at step ➃. As illustrated in the
figure, LegUp’s hardware synthesis and software compilation are part of the same LLVM-based
compiler framework.
In step ➄, the C source is modified such that the functions implemented as hardware accel-
erators are replaced by wrapper functions that communicate with the accelerators (instead of
performing computations in software). This new modified source is compiled to a MIPS binary
23
2 Background and Related Work
executable. Finally, in step ➅ the hybrid processor/accelerator system executes on the FPGA.
Step ➆ (the dashed line) shows a software-only optimization flow, in which a developer uses
the profiling data to target optimization efforts.
Self-Profiling
MIPS Processor
Program source code
C Compiler
Profiling Data:
Execution Cycles
Power
Cache Misses
High-level
synthesisSuggested
program
segments for HW
implementationHW Accelerators
µP Compiled
HW
program
segments
Altered SW binary (calls HW accelerators)
1
2
3
LegUp
6
5
4
FPGA fabric
....
y[n] = 0;
for (i = 0; i < 8; i++) {
y[n] += coeff[i] * x[n-i];
}
....
7
Figure 2.7: Design flow with LegUp.
2.3.2 System Architecture
The target system architecture of the LegUp framework is shown in Figure 2.8. The processor
connects to one or more custom hardware accelerators through a standard on-chip interface.
As the initial hardware platform is the Altera DE-2 Development and Education board (con-
taining a 90nm Cyclone II FPGA), the Altera Avalon interface is used for processor/accelerator
communication [7]. Synthesizable RTL code for the Avalon interface is generated automatically
using Altera’s SOPC builder tool. The DE-2 was chosen because of its widespread availability.
As shown in Figure 2.8, a shared memory architecture is adopted. The processor and ac-
celerators share an on-FPGA data cache and off-chip main memory, which are arbitrated by a
memory controller. Such an architecture allows processor/accelerator communication across the
Avalon interface or through memory. The shared single cache obviates the need to implement
24
2.4 Summary
FPGA
MIPS ProcessorHardware
Accelerator
AVALON INTERCONNECT
Hardware
Accelerator
Memory ControllerOn-Chip
Cache
Off-Chip Memory
Figure 2.8: High-level architecture overview of LegUp.
cache coherency or automatic cache line invalidation.
2.4 Summary
Multiple efforts have been made towards the goal of accurate time-based profiling. Software
profiling methods, such as gprof, suffer from the overhead caused by inserted instructions and
also suffer from accuracy loss due to sampling. FPGA-based profiling tools have been designed
to overcome these issues. Power profiling has been researched on both embedded systems and
FPGAs, however the results presented in all cases are for small benchmark sets. In addition, no
on-chip power profiling schemes have been presented. Tiger MIPS is the target soft processor
used in this research, as well as the processor used within the hybrid systems produced by
the LegUp HLS framework. LegUp is used to compile an application to software, profile its
execution, choose function candidates for acceleration, and synthesize the software code to
create hardware accelerators which communicate with the processor.
25
Chapter 3
Cycle-Accurate Profiling
3.1 Profiler Design
This chapter describes the creation of a low-overhead hardware profiler that monitors a proces-
sor as it executes, through the development of an area- and power-efficient architecture called
LEAP. The design goals for LEAP are to compactly gather and store all necessary data while
minimizing the energy expended in doing so. In addition, the profiler architecture has been
designed with extensibility in mind so that the measurement of different dynamic attributes of
a processor, such as run-time, cache stall rates, and energy consumption, may be implemented
easily by replacing a single module. Although LEAP targets a MIPS-based soft processor, there
are very few MIPS-specific aspects in the architecture. The profiler could be easily adapted to
a different processor, given that required signals, such as the program counter (PC), instruction
bus, and cache stall status, are provided.
3.1.1 Method of Operation
Throughout the execution of a program, the processor’s PC contains the address of the currently
executing instruction. LEAP profiles the execution of the program by monitoring the PC and
the instruction bus. During execution, the profiler maintains a single counter, called the Data
Counter, that tracks the number of times an event has occurred. In the case of instruction count
26
3.1 Profiler Design
profiling, this involves incrementing the Data Counter whenever the PC changes. For cycle
count profiling, the counter is incremented every clock cycle. LEAP organizes the collected data
on a per-function basis by allocating a storage counter for each software function. Recursion
support is currently limited by the depth of the Call Stack (described below), however, details
to provide full recusion support is discussed in Section 6.2.5.
Function boundaries are identified by decoding the executing instruction to check for function
calls or returns. If a function call is detected, the Data Counter is added to any previously
stored values associated with this function (from previous executions of the function). The
Data Counter is then reset to 0 so that it can start counting the events that have occurred in
this newly called function. If a function return is detected, the Data Counter value is added to
the counter associated with the current function, and once again the Data Counter is reset.
A key feature of LEAP is that data is only stored during function context switches, which
are triggered by calls to a new function or returns from the previous function. Minimal logic
is used to dynamically update the Data Counter (as opposed to updating counters for each
function each clock cycle), only updating the storage corresponding to the relevant function at
these context switches. It is shown in Section 3.4.4.2 that using a single counter value is more
power efficient than techniques described in Section 2.1.1. Another important feature of LEAP
is that the system (profiler and processor) only require resynthesis if any of the configurable
parameters are modified (discussed in Section 3.4.1); different applications can be run and
profiled without resynthesis or reprogramming the board. A novel aspect of the design is the
use of perfect hashing in hardware to associate code segments with hardware counters that
track dynamic run-time behavior. This use of hashing leads to considerably smaller overhead
versus previously-published hardware profiler designs.
The logical flow used in LEAP can be seen in Figure 3.1 for instruction-count profiling. In
the left branch, the PC is monitored to detect the occurrence of each newly-issued instruction.
The Data Counter is incremented each time a different instruction executes. The middle branch
detects function calls; when a call is made to a new function, the Data Counter is added to
27
3 Cycle-Accurate Profiling
the counter corresponding to the caller function. The Data Counter is then reset so that it
now represents only instructions executed in the callee function. At the same time, the caller
function number is stored to a stack so that when the callee function returns, the profiler is able
to keep track of function context. The right branch is used to detect function returns; when a
function executes a return statement, the Data Counter is added to the counter corresponding
to this function. Popping from the aforementioned stack yields the number of the function that
will be returned to, setting the current function context. The Data Counter is then reset again.
Instr, PC
yes
Store Data Counter
Reset Data Counter
Pop function number
off stack
Is return?
Store Data Counter Push
function
number to
stack
yes
Reset Data Counter
Is call?
yes
Data
Counter++
PC
Change?
Figure 3.1: High-level flow chart for instruction-count profiling.
3.1.2 Components
To realize the flow in Figure 3.1 efficiently, five key modules were created, organized as shown
in Figure 3.2. The Operation Decoder (labeled “Op Decoder”) monitors the instruction bus
to determine whether a call or return has been executed. The Call Stack is used to keep track
of the currently executing function, since a return instruction does not indicate which function
will be returned to. The Data Counter is used to store scheme-specific profiling data for the
currently-executing function. The Counter Storage module takes data collected from the Data
Counter and updates the relevant function’s stored result. Lastly, the Address Hash module is
28
3.1 Profiler Design
used to map the address decoded from a call instruction to a unique function number.
Data Counter
Counter Storage
Call Stack
µP Instr. $
Op Decoderret call
instr
0 1
PC
function #
Address Hash
target
address
F#count
Popped
F#
(ret | call)
PC
Figure 3.2: Modular view of profiling architecture.
3.1.2.1 Operation Decoder
In order to correctly identify when a function context switch occurs, call and return instructions
must be identified and the target of a call must be decoded.
A function call may be invoked by two MIPS instructions: JAL (jump-and-link), which jumps
to a predefined address and places the return address in register 31, and JALR (jump-and-link
register), which jumps to the address contained in a register, and places the return address in
register “rd.” As seen in Figure 3.3, the JAL instruction can be identified by comparing the
top six bits of the instruction to 000011, while the target can be decoded by extracting the
lower 26 bits of the instruction, shifted left by two. The JALR instruction can be identified by
comparing the top six bits of the instruction to 000000 and the lower six bits to 001001. Since
29
3 Cycle-Accurate Profiling
the target is stored in one of the 32 registers in the standard MIPS register file, it is not as easy
to decode because the profiler does not have access to the processor’s register file; due to the
open-source nature of the processor, these signals could be monitored, but this would restrict
the portability of the profiler. Special logic in the profiler delays use of this unknown jump
target until the actual jump occurs, at which time this new PC is used as the target address.
It should be noted that this does not delay the processor itself, only the profiler’s consumption
of data.
A function return can be invoked only by one instruction, JR (jump register), which jumps
to the address stored in a register. However, this instruction is also commonly used for simple
branching within a function, so the two uses must be differentiated. Since function calls always
store the return address to register 31, returns can be identified by any JR instruction using
register 31; this is determined by comparing the entire instruction to
0000 0011 1110 0000 0000 0000 0000 1000.
JAL: 0000 11ii iiii iiii iiii iiii iiii iiii
JALR: 0000 00ss sss0 0000 dddd d000 0000 1001
JR: 0000 00ss sss0 0000 0000 0000 0000 1000
Figure 3.3: Opcodes used to determine function context switches. Note: i represents bits ofan immediate, s represents bits of source register number, and d represents bits ofdestination register number.
3.1.2.2 Call Stack
Since function names are not used at the machine-code level, they cannot be used to represent
functions in the profiler. Therefore a function is uniquely represented by the address of its first
instruction. This is quite useful, since the target of a function call also uses this address; this
target can be used to easily identify the function to which the processor is jumping. When a
function completes, the profiler must also determine which function the processor is returning
30
3.1 Profiler Design
to. Two issues exist in regards to function returns: first, the target address is not encoded into
the return instruction, and second, the return address corresponds to the instruction located
just after the original function call, not to the first instruction of that function. This means
that the return address cannot be associated with a function without checking all functions to
see if the address lies between the start and end of the function.
A solution to this issue is to maintain a stack in which function indices (described below in
Section 3.1.2.5) are pushed during function calls and popped during function returns. In this
way, the current function can be tracked easily. To compactly represent function indices, the
Call Stack stores the hashes of the function addresses, which will be described in Section 3.1.2.5.
The stack is implemented using one Altera synchronous RAM (AltSyncRam), which is an
Altera-specific on-chip memory module [11]. The number of entries in the stack represents the
maximum nested call depth the profiler can support; for all experiments in this research, the
number of entries is 32. Since the maximum depth found in all benchmarks considered was only
8, seen in Table 3.1, this value is more than sufficient. This limit on the call depth impedes the
use of recursion in a target application; recursion support is discussed in Section 6.2.5.
Table 3.1: Maximum depth of nested function calls.
Benchmark Max. Depth
ADPCM 4AES 4BLOWFISH 3DFADD 5DFDIV 5DFMUL 3DFSIN 6GSM 3JPEG 8MIPS 1MOTION 6SHA 3DHRYSTONE 3
31
3 Cycle-Accurate Profiling
3.1.2.3 Data Counter
The Data Counter module performs the actual measurement of any required metrics; the re-
mainder of the architecture is used to properly associate this data with the function it corre-
sponds to. In this way, the implementation of a new profiling task only requires the modification
of this module. For cycle-accurate profiling, the Data Counter increments every time the PC
changes (instruction count profiling), every cycle (cycle count profiling), or every cycle while
stalling (stall cycle profiling). These profiling schemes and the Data Counter are described in
more detail in Section 3.2.
3.1.2.4 Counter Storage
The Counter Storage module consists of one AltSyncRam instance and update logic. At any
function context switch, the RAM entry for the current function is read, added to the current
value of the Data Counter, and written back to the same address in the RAM. The function
address is hashed into a unique number that is used to index into the RAM; hashing details are
described in the following subsection. Through this design, all data for a particular function is
accrued correctly, regardless of the number of calls and returns.
3.1.2.5 Address Hash
To store data for each function in an efficient manner, the indices used to represent the functions
must be continuous over a compact range. The address of the first instruction in a function
is generally used to identify the function, but if these addresses were used to index into the
counter storage RAM, the address space of the RAM would need to be as large as that of the
processor. For a 32-bit PC, this would require 232 bits. This large address space is impossible
to realize using memory bits on an FPGA; therefore, the address space must be translated to
a more compact range. This translation is achieved through hashing.
Figure 3.4 illustrates the above issue and how it is resolved. A sample program written in
C is shown in part (a), alongside the corresponding MIPS assembly in part (b), condensed
32
3.1 Profiler Design
for brevity. As indicated in part (b), only three function addresses are required but their
values span over the range 0x00800000 to 0x00800094. Part (c) shows a desired mapping of the
function addresses to a continuous set of indices.
int sqr ( int a ) {return a*a ;
}
int sqr add ( int a , int b) {return sqr ( a ) + sqr (b ) ;
}
int main ( ) {int x = 5 ;int y = 6 ;
int z = sqr add (x , y ) ;return z ;
}
00800000 <sqr >:800004: mult a0 , a0· · · · ·800024: j r ra800028: nop
0080002 c <sqr add >:· · · · ·800048: j a l 800000 <sqr>· · · · ·80005 c : j a l 800000 <sqr>
800060: nop800064: addu v0 , s0 , v0· · · · ·80008 c : j r ra800090: nop
00800094 <main>:· · · · ·80009 c : l i v0 , 5· · · · ·8000 a4 : l i a1 , 6· · · · ·8000b4 : j a l 80002 c <sqr add>· · · · ·8000 dc : j r ra8000 e0 : nop
(a) C Code Example (b) Condensed MIPS Assembly
00800000 ⇐⇒ (Function 0)0080002c ⇐⇒ (Function 1)00800094 ⇐⇒ (Function 2)
(c) Example Function Address Hashing
Figure 3.4: Code example to illustrate the need for function address hashing.
A hash function is a mathematical function that maps a large set of data to a smaller set. A
33
3 Cycle-Accurate Profiling
perfect hash function is one which performs a unique mapping so that no two inputs map to
the same output. Perfect hashing is used to convert each function address in the program into
an index in the range 0 to N-1, where N is the number of functions in the program rounded to
the next power of two.
The perfect hash generator described in [32] is used to create an application-specific perfect
hash. This generator is based on a static hashing algorithm, shown in Figure 3.5, which uses
6 parameters that can be tailored to the specific application to guarantee collision-free hashes.
During compilation of the application to be profiled, all function addresses are extracted from
the executable’s source and passed through the perfect hash generator, outputting the cus-
tomized set of parameters for use with the algorithm in Figure 3.5. This algorithm is efficient
in hardware because it largely consists of simple logical operations, plus one memory access.
The perfect hash generator first uses normal hashing to determine (A,B) pairs for each input
key (function address) such that (A,B) are distinct for all keys. Parameters A1 and A2 are
determined based on the required number of bits in A that were required to find distinct (A,B)
pairs; B1 and B2 are found similarly from the number of bits in B. Parameter V is determined
based on the seed value that was used to find the distinct pairs. After distinct pairs were found,
the tab array is populated to produce a mapping such that the hash is perfect. The parameter
sizes are as follows: tab (N-entry, 1-byte array), V (4 bytes), A1, A2, B1, and B2 (each 1 byte).
These parameters, totaling only (8+N) bytes, must be generated for each application that is to
be profiled to perform the unique mapping. At execution time of the program being profiled,
the profiler initializes the Address Hash module with these parameters. This initialization flow
is discussed further in Section 3.1.4.2.
3.1.3 Compilation Flow
Multiple compilation phases are required to generate a profilable executable, as shown in Fig-
ure 3.6. The first phase uses the LLVM [30] compiler infrastructure to compile the C source
code to MIPS assembly. The front-end of the compiler parses the C code into LLVM’s internal
34
3.1 Profiler Design
// Input : va l , output : r s l// Parameters : tab [ ] , V, A1, A2, B1 , B2int doHash (unsigned int val ) {
unsigned int a , b , r s l ;
va l += V;val += ( val << 8) ;val ˆ= ( val >> 4) ;b = ( val >> B1) & B2 ;a = ( val + ( val << A1) ) >> A2;r s l = ( aˆ tab [ b ] ) ;
return r s l ;}
Figure 3.5: Parameterized hashing algorithm used by the perfect hash generator (implementedin C).
bytecode representation. At this stage, hardware/software partitioning may be performed to
accelerate certain functions using LegUp’s high-level synthesis, as described in Section 2.3. The
partitioned bytecode is then linked with library code, and the back-end compiler generates a
MIPS assembly file. Due to LLVM’s limited support for the MIPS backend, LLVM cannot
create executable files; it can only be used to generate MIPS assembly code.
LLVM
Function
Extraction
C Code
.s.sAssemble Link
.o
GNU Binutils
.elf
.hashPerfect Hash
Generator
Figure 3.6: Compilation flow to generate programming files.
The second phase uses a cross-compiled version of GNU Binutils, a collection of binary tools
such as a linker and an assembler. Cross-compilation is the act of compiling an executable for a
computer architecture other than the one the compiler runs on. The cross-compiler used in this
work runs on an x86 processor but targets the MIPS architecture. It is used to first assemble
the MIPS assembly file produced by LLVM into an object file, then link this object file with
35
3 Cycle-Accurate Profiling
library code to produce machine code capable of running on the Tiger processor.
The third phase enables profiling of the executable by analyzing all declared functions and
determining the appropriate values with which the profiler’s perfect hashing algorithm should
be initialized. First, the name and address of each function in the executable are extracted and
stored to two files, with data for each function on a separate line. Then, the list of all function
addresses is input into the perfect hash generator. The generator outputs a “.hash” file, which
is an initialization file containing all parameters required in the perfect hashing algorithm. An
example hash initialization file is shown in Figure 3.7.
tab [ ] = {17 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0}V1 = 0x62c41909A1 = 14A2 = 27B1 = 2B2 = 0x1
Figure 3.7: Example hash initialization file for a 32-function program.
3.1.4 Data Transfer
After the processor system has been synthesized and programmed to the FPGA, the target
MIPS application must also be programmed. This involves running the boot loader code on the
Tiger processor to load the entire application into program memory on the FPGA board being
used. To make the discussion in the rest of this section more concrete, this memory is assumed
to be 8MB of off-chip SDRAM. After the application is loaded, a run signal is sent to tell the
processor where to start execution. The application code is always loaded into the beginning
(lowest addresses) of the SDRAM, which is designated as address 0x800000 in the system. The
profiling initialization data, and later the profiling results, are allocated near the end of the
SDRAM, starting at address 0xFFE000. This region dedicated to profiling data consumes less
than 0.05% of the 8MB SDRAM and therefore will not interfere with the application. The
layout of the SDRAM is seen in Figure 3.8.
36
3.1 Profiler Design
0x00800000
0x00FFFFFF
Stack
Status Register
Profiler Data
0x00FFDFFC
0x00FFE000
Application Data
Figure 3.8: An example memory layout of program memory, assuming 8MB off-chip SDRAM.Note: address ranges are not to scale, and the profiler data region is used for bothinitialization data and retrieval data.
3.1.4.1 Wrapper Function
Three important steps are required to properly execute and profile an application. First, the
profiler must be initialized with the hashing parameters specific to the program that will be
run. Next, the program must commence execution while the profiler, which is now initialized,
monitors the run-time behavior of the processor. Finally, after the program has completed, the
results of the profiler must be returned to the user. To this end, a wrapper function was created
whose sole purpose is to perform these tasks in order. The C code for the wrapper can be seen
in Figure 3.9.
The initialization data required by the address hash block, discussed in Section 3.1.2.5, is
specific to the application being run and must be reinitialized for every program. The wrapper
allows the processor to delay execution of the program until the profiler is ready, and provides
application-independent instruction addresses to inform the profiler when to begin initialization
37
3 Cycle-Accurate Profiling
and retrieval. Once the application begins, however, the wrapper does not interfere with the
program flow.
First, the processor clears address 0xFFDFFC, which is used as a status register to indicate
the state of execution. Two constants are used to indicate the completion of the initialization
and retrieval routines. The wrapper polls the status register until it equals 1, indicating that
the profiler has completed its initialization. The flushCache function is called before every read
of the status register to flush both the data and instruction caches. This is necessary because
the status register, located in main memory, is modified directly without the use of caching.
Since these updates are not reflected in the caches, the cached value would always be read
without flushes.
Upon completion of the initialization routine, the details of which will be discussed in the
following subsection, the application is run by performing a call to main. Upon completion
of the program, execution returns to the wrapper and the profile results are gathered. The
retrieval routine, which will be discussed in Section 3.1.4.3, is commenced and again the status
register is polled until it equals 2, indicating the completion of data retrieval. At this point,
the application has completed and the profiling results are available on the SDRAM.
3.1.4.2 Initialization
The wrapper function provides application-independent instruction addresses to indicate when
initialization should be performed. The profiler monitors the processor’s PC until it is equal
to 0x800018, which corresponds to an instruction that precedes the initialization loop which
polls the status register in Figure 3.9. Since the wrapper is independent of the application to
execute, it is precompiled and linked in at compile-time; this guarantees that the monitored
address will always correspond to the same instruction.
By monitoring the PC, the profiler can recognize when the processor is executing the wrap-
per; it can then initialize its hashing constants V, A1, A2, B1, and B2, by directly reading
redetermined addresses from SDRAM, as shown in Figure 3.8. The tab array, consisting of
38
3.1 Profiler Design
#de f i n e STACK ADDR 0x00FFDFFC
void wrap ( ) {int r e t ;unsigned long int Pr o f i l e r S t a t u s ;
// Reset P r o f i l e r S t a t u s*( volat i l e unsigned long int *) (STACK ADDR) = 0 ;
// Wait f o r p r o f i l e r to be i n i t i a l i z e ddo {
f lushCache ( ) ;P r o f i l e r S t a t u s = *( volat i l e unsigned long int *) (STACK ADDR) ;
} while ( P r o f i l e r S t a t u s != 1) ;
// Run ac tua l programr e t = main ( ) ;
// Wait f o r p r o f i l e r r e t r i e v a l to be f i n i s h e ddo {
f lushCache ( ) ;P r o f i l e r S t a t u s = *( volat i l e unsigned long int *) (STACK ADDR) ;
} while ( P r o f i l e r S t a t u s != 2) ;}
Figure 3.9: Application-independent wrapper to enable initialization and data retrieval.
N bytes (where N is the maximum number of functions to profile), is read next and stored
to a designated RAM block within the profiler. The details of these hashing parameters were
discussed previously in Section 3.1.2.5.
Next, all modules must be reset, a step which is especially important for the Call Stack and
Counter Storage modules. The Call Stack is reset by setting the stack pointer to 0, while the
Counter Storage must explicitly write a 0 into each entry.
The final task is to initialize the Call Stack to set the wrapper function as the top level of
hierarchy. This is required to ensure correct behaviour after the application’s main function
returns; if the Call Stack was not initialized properly, main would be located in entry 0 of the
stack when the return instruction executes. This return causes a pop from an empty stack,
which is incorrect behavior. By pushing the wrapper onto the stack first, the return from main
will no longer cause a pop from an empty stack as there is still one element remaining.
39
3 Cycle-Accurate Profiling
Upon completion of the initialization routine, the profiler must instruct the processor to
continue execution. This is accomplished by setting the status register, located at address
0xFFDFFC, equal to 1. When this value has been written successfully, the profiler begins
monitoring the processor to profile its execution.
3.1.4.3 Data Retrieval
Data retrieval begins when execution returns to the wrapper function after completing its call
to main, corresponding to a PC value of 0x800074. The first step in data retrieval is to write
all profiling results, which are stored in RAM blocks within the profiler, to a predetermined
location in SDRAM. This region begins at address 0x00FFE000, which is also the location to
which the hash initialization data is written. These results, read and written in order of hashed
function number, are stored in contiguous locations in memory. Once the writes to SDRAM
have finished, the profiler instructs the processor to complete execution by setting the status
register equal to 2. The profiling results are passed to the host computer using the MIPSLoad
program discussed previously in Section 2.2.3.
3.2 Cycle and Stall Cycle Profiling
As described in Section 1.3, two goals of the proposed hardware/software partitioning scheme
are to maximize throughput and to exploit data locality in a prospective accelerator. Through-
put can be maximized by accelerating functions that consume the largest number of cycles
during execution. Temporal data locality is exploited through the use of a cache, so that
recently-used data can be accessed more quickly. Spatial locality is exploited through the use
of cache bursts, which involve reading four consecutive cache entries for every read request.
Therefore, it is logical to associate high data locality with a high cache-hit rate. Thus, data
locality can be taken advantage of by accelerating those functions that spend the fewest cycles
stalling on cache misses relative to the total number of cycles they execute.
40
3.3 Hierarchical Profiling
3.2.1 Hardware Implementation
The proposed profiling architecture is easily adapted to measure the data required by the above
schemes, namely clock cycles and cycles spent stalling. In fact, the only component that requires
modification is the Data Counter. The initial framework was designed to measure the number
of instructions executed per function, which is measured by incrementing the counter whenever
the PC changes, as shown in Figure 3.10a. The total number of cycles executed per function
is collected by incrementing the counter every clock cycle, as shown in Figure 3.10b. The total
number of cycles consumed by cache stalls is found by incrementing the counter every clock
cycle in which either cache has its stall signal asserted, as shown in Figure 3.10c.
RESET
ENABLE
resetcall
return
clock
pc_diffOUTPUT count_out
Instruction Counter
(a) Instruction count configuration
RESET
ENABLE
resetcall
clockOUTPUT count_out
Cycle Counter
1
return
(b) Cycle count configuration
RESET
ENABLE
resetcall
return
clockOUTPUT count_out
Stall Cycle Countericache_stall
dcache_stall
(c) Stall cycle count configuration
Figure 3.10: Architecture diagrams of Data Counter module for cycle-accurate profiling.
3.3 Hierarchical Profiling
The previously described profiling schemes measure the instructions, cycles, or stall cycles
spent within each function of the profile. This flat profile is very useful for determining specific
41
3 Cycle-Accurate Profiling
functions that consume much of the execution time, but does not provide any information about
the flow or structure of an application. Hierarchical cycle profiling, on the other hand, measures
the number of cycles executed within the function plus all of its descendants. This is especially
applicable to hardware/software partitioning within LegUp since any function that is chosen
for acceleration also causes any descendant functions to be accelerated. Since the acceleration
is inherently hierarchical, the profiling tool used to partition should also be able to provide
hierarchical results.
To implement hierarchical profiling, data counts must be bubbled upwards at each function
return to count towards its parent function. This is achieved through the creation of a Hierarchy
Stack module. At every function call, instead of storing the Data Counter for the calling function
immediately into the Counter Storage module, this value is written into the top of the Hierarchy
Stack, indicated by a stack pointer (SP), and SP is incremented. At each function return, the
top element of the Hierarchy Stack is read, added to the current value of the Data Counter,
and SP is decremented. This sum is stored into the Counter Storage module, and the Hierarchy
Stack is updated by adding the sum to the entry pointed to by SP.
3.4 Experiments
3.4.1 Configurable Parameters
This profiling framework contains many parameters that can be modified pre-synthesis to tai-
lor its execution to its requirements. These parameters are implemented with Verilog define
statements in a single configuration file for the profiler. The parameters are:� Stack depth (S): the maximum call depth the profiler is to support� Number of functions (N): the maximum number of functions the profiler is to support,
rounded to the next power of two� Counter width (CW): the total number of bits used to store each function’s results
42
3.4 Experiments� Profiling scheme (PROF SCHEME): a single character to describe one of the four
available profiling schemes: i for Instruction Count, c for Cycle Count, s for Stall Cycle
Count (all discussed in Section 3.2), and p for Power Profiling (discussed in Chapter 4).� Hierarchical profiling (DO HIER): a binary value indicating whether the gathered
profile will be flat (0) or hierarchical (1).
For the experiments performed in this section, S is set to 32 entries, N varies from 16 to 256
(at powers of 2), CW varies from 16 to 64 (at powers of 2), PROF SCHEME is set to match
the scheme being tested, and unless otherwise stated, DO HIER is set to 0.
3.4.2 Experimental Benchmarks
To verify the accuracy of the profiling framework, an extensive benchmark suite was tested.
These 13 programs are written in C and each include self-contained test vectors to drive the
applications’ executions and ensure correct behavior. The benchmark suite, summarized in
Table 3.2, includes the twelve CHStone [29] benchmarks, which were developed for tests in
HLS, plus Dhrystone [49], a classic integer computing benchmark. A wide array of applications
are contained, including encoding and decoding, double-precision mathematics, encryption,
image decompression, and a MIPS processor.
3.4.3 Comparison with Software Profiler
This section describes the comparison of the proposed profiling framework, LEAP, to a widely-
used software profiling tool called gprof. Gprof employs sampling to minimize its performance
overhead, which consequently reduces its accuracy. Function-level profiles of each benchmark
described in Section 3.4.2 are generated with LEAP and gprof, and these results are compared
to show the accuracy and performance gains of non-invasive profiling over sampling-based. Both
flat and hierarchical profiling methods are compared.
43
3 Cycle-Accurate Profiling
Table 3.2: Description of the benchmarks used in experiments.
Lines ofBenchmark Description C Code
CH
ST
ON
E
ADPCM Adaptive differential pulse code modulation decoder and encoder 541AES Advanced encryption standard 716BLOWFISH Data encryption standard 1,406DFADD Double-precision floating-point addition 526DFDIV Double-precision floating-point division 436DFMUL Double-precision floating-point multiplication 376DFSIN Sine function for double-precision floating-point numbers 755GSM Linear predictive coding analysis of global system for mobile com-
munications393
JPEG JPEG image decompression 1,692MIPS Simplified MIPS processor 232MOTION Motion vector decoding of MPEG-2 583SHA Secure hash algorithm 1,284DHRYSTONE Synthetic integer programming benchmark 301
Methodology
Using the compilation flow described in Section 3.1.3, the required files to initialize and perform
profiling with LEAP are generated. To profile an application with gprof, GCC is used with the
flag ‘-pg” to compile the source C code into an executable with inserted profiling commands.
Due to the sampling employed in gprof, which defaults to intervals of 10 ms, each benchmark
was modified slightly so that it executed 10,000,000 times, thus reducing the sampling error
but greatly increasing the execution time. This executable is then run to completion, which
produces an output file “gmon.out.” Finally, gprof is called, which analyzes the executable
and output files to create a call graph profile. The benchmarks were compiled and run on an
x86-based computer for use with gprof, as no MIPS-based computer was available.
Results
The profiling results of gprof and LEAP can be seen in Tables 3.3 and 3.4 for the benchmarks
“ADPCM” and “DFMUL,” respectively. The full set of results for all 13 benchmarks are
44
3.4 Experiments
available in Appendix A.1.1. The results for gprof are shown as the self time that each function
executed, along with the function’s percentage of the total benchmark execution time. The
hierarchical time, which includes the self time plus all time spend in descendant functions, is
also shown along with the function’s hierarchical percentage of total execution time. LEAP
results are given as the number of clock cycles taken to execute each function, along with
the function’s percentage execution time and the hierarchical clock cycles per function and the
hierarchical percentage of total execution time. The last column shows the difference in percent
execution time for both the self time and hierarchical time.
It can be seen that although discrepancies arise between the percent execution times deter-
mined from gprof and LEAP, most functions are ranked similarly by both profilers for both flat
(self time) and hierarchical profiling. There are two sources of error to account for these discrep-
ancies: firstly, the sampling inaccuracies incurred by gprof result in estimate profiles as opposed
to exact ones. Secondly, the target architectures of the two profilers differ; LEAP is designed to
profile a MIPS processor, while the gprof used in these experiments targets an x86 processor.
Therefore processor characteristics such as cache properties and organization differ, in addition
to the obvious micro-architectural differences, result in slightly different benchmark execution.
To verify the accuracy of these results, LEAP is compared with another non-intrusive hardware
profiler in the following section.
3.4.4 Comparison with FPGA-Based Profiler
To demonstrate the utility of this profiling framework, experiments were designed to compare
LEAP to another FPGA-based hardware profiler with similar goals. SnoopP, described in Sec-
tion 2.1, was implemented in Verilog with the same initialization and data retrieval mechanisms
as LEAP to allow a fair comparison. SnoopP was chosen as a comparative baseline for several
reasons: it is a popular FPGA-based profiler that has been cited many times and included
in a recent profiler comparison [45]. In addition, its source code is very general which makes
the profiler easy to use with Tiger. Finally, due to its non-intrusive nature, the cycle-accurate
45
3 Cycle-Accurate Profiling
Table 3.3: Comparative results of gprof vs. LEAP for the benchmark “ADPCM.”Hierarchical Hierarchical Self Hier.
Function Self Time (s) Time (s) Self Cycles Cycles Time Time
encode 145.12 22.4% 344.51 53.2% 62193 26.63% 118422 50.7% 4.23% 2.47%decode 117.31 18.1% 281.81 43.5% 63863 27.34% 102578 43.9% 9.24% 0.42%upzero 116.23 17.9% 116.23 17.9% 32614 13.96% 32589 14.0% 3.97% 3.99%filtez 77.93 12.0% 77.93 12.0% 14708 6.30% 14683 6.3% 5.73% 5.74%uppol2 40.22 6.2% 40.22 6.2% 11209 4.80% 11212 4.8% 1.41% 1.41%quantl 34.89 5.4% 34.89 5.4% 14177 6.07% 14743 6.3% 0.69% 0.93%filtep 28.86 4.5% 28.86 4.5% 3655 1.56% 3655 1.6% 2.89% 2.89%scalel 21.48 3.3% 21.48 3.3% 2902 1.24% 2888 1.2% 2.07% 2.08%uppol1 21.10 3.3% 21.10 3.3% 8097 3.47% 8097 3.5% 0.21% 0.21%logscl 13.92 2.1% 13.92 2.1% 3425 1.47% 3440 1.5% 0.68% 0.68%adpcm main 10.06 1.6% 641.72 99.0% 4588 1.96% 228355 97.8% 0.41% 1.27%logsch 9.24 1.4% 9.24 1.4% 3100 1.33% 3097 1.3% 0.10% 0.10%main 5.74 0.9% 647.46 99.9% 5122 2.19% 233486 100.0% 1.31% 0.04%reset 5.34 0.8% 5.34 0.8% 2755 1.18% 2764 1.2% 0.36% 0.36%abs 0.49 0.1% 0.49 0.1% 1155 0.49% 1155 0.5% 0.42% 0.42%
(a) gprof (b) LEAP (c) AbsoluteDifference
Table 3.4: Comparative results of gprof vs. LEAP for the benchmark “DFMUL.”Hierarchical Hierarchical Self Hier.
Function Self Time (s) Time (s) Self Cycles Cycles Time Time
float64 mul 5.34 33.9% 13.66 86.8% 5051 46.2% 9138 83.5% 12.25% 3.25%mul64To128 1.86 11.8% 1.86 11.8% 788 7.2% 788 7.2% 4.61% 4.61%main 1.71 10.9% 15.37 97.6% 1815 16.6% 10965 100.2% 5.73% 2.59%extractFloat64Frac 1.42 9.0% 1.42 9.0% 289 2.6% 289 2.6% 6.38% 6.38%extractFloat64Exp 1.17 7.4% 1.17 7.4% 287 2.6% 287 2.6% 4.81% 4.81%roundAndPackFloat64 1.05 6.7% 1.35 8.6% 1212 11.1% 1263 11.5% 4.41% 2.97%propagateFloat64NaN 0.95 6.0% 1.58 10.0% 722 6.6% 1015 9.3% 0.56% 0.76%extractFloat64Sign 0.58 3.7% 0.58 3.7% 279 2.6% 279 2.6% 1.13% 1.13%packFloat64 0.57 3.6% 0.57 3.6% 129 1.2% 129 1.2% 2.44% 2.44%float64 is signaling nan 0.33 2.1% 0.33 2.1% 171 1.6% 171 1.6% 0.53% 0.53%float64 is nan 0.31 2.0% 0.31 2.0% 120 1.1% 120 1.1% 0.87% 0.87%normalizeFloat64Subnormal 0.24 1.5% 0.24 1.5% 0 0.0% 0 0.0% 1.52% 1.52%shift64RightJamming 0.10 0.6% 0.10 0.6% 0 0.0% 0 0.0% 0.64% 0.64%float raise 0.09 0.6% 0.09 0.6% 76 0.7% 76 0.7% 0.12% 0.12%countLeadingZeros64 0.02 0.1% 0.02 0.1% 0 0.0% 0 0.0% 0.13% 0.13%
(a) gprof (b) LEAP (c) AbsoluteDifference
46
3.4 Experiments
results are exact so as to provide a baseline for comparison.
3.4.4.1 Profiling Results Comparison
The first set of experiments between LEAP and SnoopP were performed to verify the cycle-
accurate results produced by LEAP.
Methodology
To perform comparative experiments to verify the results of LEAP, four systems needed to be
created and synthesized: 1) LEAP measuring clock cycles, 2) LEAP measuring stall cycles, 3)
SnoopP measuring clock cycles, and 4) SnoopP measuring stall cycles. Each of these systems
were programmed to the target FPGA board, in turn, to generate data for the corresponding
configuration. For these experiments, the maximum number of functions to be profiled (N)
was set to 64 and the counter width (CW ) was set to 32 bits.
Using the compilation flow described in Section 3.1.3, the 13 benchmarks were compiled to
ELF format and the hash initialization files for LEAP were created. Although [43] describes
the design of SnoopP to use pre-synthesis parameters for the upper and lower bounds of the
profiling ranges, it was decided that these parameters should be programmable at run-time
for two reasons: 1) to compare the two profilers as fairly as possible, and 2) to eliminate the
need to resynthesize the profiler for each benchmark to be profiled. Therefore, the proper
initialization of SnoopP requires setting values for the upper and lower bounds of each range;
for these experiments, each range corresponds to a function in the benchmark. These ranges
are determined through analysis of each benchmark, and are store in “.range” files. This
initialization data is loaded into SDRAM in the same way as the hash data (discussed in
Section 3.1.4), and later used to initialize the profiler by reading each range from SDRAM and
writing it to a corresponding register within the profiler.
47
3 Cycle-Accurate Profiling
Results
The profiling results of LEAP and SnoopP are shown in Tables 3.5 and 3.6 for the benchmarks
“ADPCM” and “DFMUL,” respectively. The full set of results for all 13 benchmarks can be
found in Appendix A.1.2. These tables show, for each function in the benchmark, the total
number of cycles executed and the total number of cycles stalled, as measured by each profiler.
The difference between the profilers, measured in clock cycles, is shown.
The difference between the results of LEAP and SnoopP, although very small, warrants
explanation. The experiments were performed using only one profiler on-chip at a time, meaning
the results are for different runs of the same benchmark, with all other configurations the same.
These differences can be attributed to small variations in SDRAM memory access times on
the DE2 board. To show that the discrepancies between LEAP and SnoopP are not accuracy
errors, but rather the result of run variations of the system, Table 3.7 shows the cycle count
results for four runs of the benchmark “DFMUL,” for each profiler. The standard deviation of
the number of cycles executed in each function was calculated for LEAP, SnoopP, as was the
deviation among all eight runs. It can be seen that the standard deviation of the eight runs
is less than the individual profiler deviations, indicating that the results of each profiler only
differ by the variations between runs.
3.4.4.2 Area and Power Comparison
The second set of experiments that were performed between LEAP and SnoopP were to measure
the overhead incurred by the use of each profiler. For these experiments, overhead refers to the
area requirements and power consumption of each profiler.
Methodology
The area of each profiler was found by synthesizing the design with Altera’s Quartus II. As
discussed in Section 3.4.1, the number of functions to be profiled (N) was varied from 16 to
256 at powers of 2, while the counter widths are set to 16, 32, and 64. In addition to these
48
3.4 Experiments
Table 3.5: Comparative results of LEAP vs. SnoopP for the benchmark “ADPCM.”
Function LEAP SnoopP Difference LEAP SnoopP Difference
decode 63863 63863 0 3656 3652 4encode 62193 62204 -11 4245 4271 -26upzero 32614 32620 -6 948 942 6filtez 14708 14708 0 695 691 4quantl 14177 14175 2 737 723 14uppol2 11209 11209 0 230 233 -3uppol1 8097 8097 0 127 126 1main 5122 5122 0 1814 1829 -15adpcm main 4588 4588 0 1641 1626 15filtep 3655 3655 0 56 55 1logscl 3425 3425 0 234 234 0logsch 3100 3112 -12 134 133 1scalel 2902 2902 0 288 292 -4reset 2755 2768 -13 2332 2341 -9abs 1155 1155 0 56 55 1
Total 233563 233603 -40 17193 17203 -10
(a) Cycles (b) Stall Cycles
two parameters, the profiling scheme is also varied to cover instruction count profiling, cycle
profiling and stall cycle profiling. From the compilation reports generated for each configuration,
the total number of logic elements, registers, and memory bits are obtained. The maximum
frequencies are also gathered.
The power overhead of each profiler in the configurations mentioned above was measured
using ModelSim [36] and the Quartus PowerPlay Analyzer [13]. Quartus is able to produce
timing-based simulation files for use with ModelSim. The system was simulated using timing
simulation for the profiler and functional simulation for the rest of the system, for each of
the benchmarks previously described. The system was simulated in this way so that only the
power consumption of the profiler was measured, not of the system. Upon completion of the
simulation, a Value Change Dump (VCD) file is produced containing all signal transitions found
in the timing portion of simulation. The VCD file is read by Quartus PowerPlay to determine
the average dynamic power consumption of the simulation.
49
3 Cycle-Accurate Profiling
Table 3.6: Comparative results of LEAP vs. SnoopP for the benchmark “DFMUL.”
Function LEAP SnoopP Error LEAP SnoopP Error
float64 mul 5051 5077 -26 1855 1830 25main 1815 1827 -12 1074 1082 -8roundAndPackFloat64 1212 1227 -15 479 465 14mul64To128 788 788 0 273 288 -15propagateFloat64NaN 722 724 -2 488 488 0extractFloat64Frac 289 289 0 49 49 0extractFloat64Exp 287 287 0 47 47 0extractFloat64Sign 279 279 0 40 39 1packFloat64 129 129 0 39 39 0float64 is nan 120 120 0 71 71 0float64 is signaling nan 171 171 0 87 87 0float raise 76 76 0 62 62 0
Total 10939 10994 -55 4564 4547 17
(a) Cycles (b) Stall Cycles
Table 3.7: Repeatability testing for cycle accurate profiling of benchmark “DFMUL.”Run Number 1 2 3 4 Dev. 1 2 3 4 Dev. Dev.float64 mul 5050 5083 5074 5055 15.59 5075 5088 5029 5068 25.39 19.51main 1826 1819 1823 1828 3.92 1819 1839 1822 1827 8.81 6.48roundAndPackFloat64 1212 1212 1226 1212 7.00 1212 1221 1220 1227 6.16 6.56mul64To128 794 788 788 797 4.50 791 788 788 788 1.50 3.49propagateFloat64NaN 751 724 739 753 13.35 724 724 733 724 4.50 12.40extractFloat64Frac 289 289 289 289 0.00 289 289 289 289 0.00 0.00extractFloat64Exp 287 287 287 287 0.00 287 287 287 287 0.00 0.00extractFloat64Sign 279 279 279 279 0.00 279 279 282 279 1.50 1.06packFloat64 129 129 129 129 0.00 129 129 129 129 0.00 0.00float64 is nan 120 120 120 120 0.00 120 129 120 120 4.50 3.18float64 is signaling nan 171 171 171 171 0.00 185 171 171 171 7.00 4.95float raise 76 76 76 76 0.00 76 76 76 76 0.00 0.00
Total 10984 10977 11001 10996 10.97 10986 11020 10946 10985 30.25 21.25(a) LEAP (b) SnoopP (c) Overall
Results
The area overhead and maximum frequency for the LEAP profiler can be seen in Table 3.8 for
three profiling schemes: Part (a) shows results for instruction profiling, part (b) shows results
for cycle profiling, and part (c) shows results for stall cycle profiling. It can be seen that these
three schemes have a very small area variation between them, which is because only the Data
Counter module changes among the configurations, requiring extra comparators to check if the
50
3.4 Experiments
PC has changed (instruction profiling), or if the processor is stalled (stall cycle profiling).
Table 3.8: Area overhead of LEAP for three profiling schemes.
Total Total Mem. Fmax Total Total Mem. Fmax Total Total Mem. FmaxCW N LEs Regs Bits (MHz) LEs Regs Bits (MHz) LEs Regs Bits (MHz)
16 16 1,402 615 4,352 167.3 1,374 608 4,352 169.1 1,394 619 4,352 172.016 32 1,444 622 4,608 157.0 1,503 686 4,608 160.6 1,451 645 4,608 157.316 64 1,520 671 5,120 158.8 1,481 664 5,120 166.9 1,438 617 5,120 149.216 128 1,536 647 6,144 160.3 1,517 655 6,144 165.3 1,508 643 6,144 165.216 256 1,553 677 8,192 146.8 1,531 656 8,192 144.9 1,498 648 8,192 161.5
32 16 1,556 700 4,608 163.2 1,494 675 4,608 167.3 1,530 748 4,608 174.732 32 1,632 742 5,120 169.2 1,594 714 5,120 152.2 1,559 694 5,120 161.132 64 1,660 737 6,144 165.8 1,641 751 6,144 157.3 1,658 783 6,144 161.032 128 1,709 757 8,192 151.4 1,656 710 8,192 152.3 1,659 717 8,192 154.132 256 1,747 788 12,288 159.4 1,660 747 12,288 148.1 1,719 796 12,288 158.2
64 16 1,648 776 5,120 151.2 1,618 749 5,120 151.1 1,583 734 5,120 150.864 32 1,732 801 6,144 147.9 1,649 748 6,144 150.9 1,689 787 6,144 150.964 64 1,779 819 8,192 147.3 1,698 770 8,192 150.8 1,678 770 8,192 150.764 128 1,801 811 12,288 151.2 1,743 771 12,288 151.2 1,781 803 12,288 150.464 256 1,754 810 20,480 150.8 1,793 840 20,480 151.2 1,783 794 20,480 150.2
(a) Instruction Profiling (b) Cycle Profiling (c) Stall Cycle Profiling
The area of LEAP can be compared to that of Tiger, which consumes 13,422 LEs. For the
parameter configurations used for the experiments in Section 3.4.4.1, CW=32 and N=64, the
area overhead of LEAP is 12.37%, 12.23%, and 12.35% for instruction, cycle, and stall cycle
profiling, respectively. The maximum frequency of the profiler ranges from 144.9 to 174.7 MHz,
depending on the configuration of the profiler. Although this is much greater than 75.59 MHz,
the maximum frequency of Tiger, it may not be fast enough to use with a soft processor such as
Nios II when that processor is setup to run at its peak speed (Nios II-fast can achieve frequencies
of up to 200 MHz). Upon investigation, it was observed that the critical paths of the profiler
are contained in the Address Hash module and, with larger counter widths, the RAM modules
used to store the Data Counter. If required, the Data Storage module could be pipelined to
alleviate this portion of the critical path.
Table 3.9 shows the area overhead and maximum frequency of LEAP when configured for
hierarchical profiling. These results are shown for instruction, cycle, and stall cycle profiling.
It can be seen that the extra logic required to implement hierarchical profiling incurs penalties
51
3 Cycle-Accurate Profiling
in logic elements, memory bits, and Fmax. Logic elements, compared to the results of the flat
profiler in Table 3.8, increase by 6.6% in the smallest case (CW=16, N=16) to 18.3% in the
largest case (CW=64, N=256). Extra memory bits are consumed solely by the Hierarchy Stack,
which is a 32xCW AltSyncRam. Fmax decreases between 2-20% for most configurations as a
result of the increased combinational logic complexity. However, in a few cases the maximum
frequency increases despite the logical trend to decrease. These cases are believed to arise as a
result of random changes in placement and routing.
Table 3.9: Area overhead of LEAP for three profiling schemes with Hierarchical Profiling.
Total Total Mem. Fmax Total Total Mem. Fmax Total Total Mem. FmaxCW N LEs Regs Bits (MHz) LEs Regs Bits (MHz) LEs Regs Bits (MHz)
16 16 1,495 672 4,864 162.6 1,440 634 4,864 174.0 1,506 651 4,864 170.616 32 1,537 658 5,120 147.7 1,565 691 5,120 155.0 1,546 651 5,120 150.116 64 1,656 759 5,632 165.7 1,552 660 5,632 154.8 1,566 672 5,632 160.016 128 1,649 700 6,656 157.5 1587 683 6,656.0 162.3 1,626 705 6,656 161.716 256 1,661 705 8,704 149.4 1645 715 8,704.0 153.4 1,670 730 8,704 158.4
32 16 1,770 829 5,632 157.7 1697 762 5,632.0 174.4 1,736 775 5,632 174.032 32 1,840 873 6,144 159.4 1797 800 6,144.0 154.1 1,821 811 6,144 166.932 64 1,884 858 7,168 167.3 1881 843 7,168.0 153.5 1,890 898 7,168 167.532 128 1,975 955 9,216 168.0 1,875 814 9,216 159.1 1,887 852 9,216 159.132 256 1,914 867 13,312 165.5 1,887 826 13,312 150.7 1,927 859 13,312 158.0
64 16 1,882 896 7,168 122.2 1,963 934 7,168 124.9 1,857 886 7,168 127.264 32 1,954 921 8,192 125.4 1,918 892 8,192 126.0 1,906 885 8,192 124.464 64 1,984 911 10,240 127.4 1,941 895 10,240 124.1 1,973 915 10,240 125.664 128 2,010 942 14,336 124.6 2,100 977 14,336 124.4 2,083 985 14,336 121.964 256 2,075 961 22,528 122.7 2,031 932 22,528 124.7 2,034 934 22,528 124.7
(a) Instruction Profiling (b) Cycle Profiling (c) Stall Cycle Profiling
Table 3.10 shows the area overhead for two profilers, LEAP and SnoopP, for flat profiling. It
should be noted that since SnoopP does not require memory bits, the column was omitted. In
part (c) of the table, the profilers are compared by taking the ratio of LEAP over SnoopP in two
ways: first, only logic elements are compared, and second, the “total area” is estimated. When
considering only logic elements, it can be seen that LEAP is at least 11% smaller than SnoopP
in all configurations, and achieves up to 97% area reduction for CW=64 and N=256. This
comparison incurs some inaccuracies, as it does not take memory bits into account; memory
bits are much smaller than LEs, and are used in LEAP but not SnoopP. Memory bits correspond
52
3.4 Experiments
to one SRAM cell on the FPGA, plus decode logic. Logic elements, however, contain a four-
input look-up table (LUT), a dedicated register, and muxing logic. A four-input LUT alone
contains 16 SRAM cells, so we pessimistically assume that each SRAM bit is equivalent to 116
of a LUT in terms of silicon area. It is in fact much less than this as the memory bit decoding
circuitry is amortized over thousands of bits, whereas multiplexor decoding circuitry is needed
for each LE. Therefore the final column in Table 3.10 shows the ratio of “Total Area,” an
approximation which is represented by the equation A = L+ M16
, where A is total area, L is the
number of logic elements, and M is the number of memory bits.
Table 3.10: Area overhead comparison of LEAP and SnoopP, both configured for cycle profiling.
Total Total Mem. Fmax Total Total Fmax Total TotalCW N LEs Regs Bits (MHz) LEs Regs (MHz) LEs Area
16 16 1,402 615 4,352 167.3 1,572 1,226 138.3 0.89 1.0616 32 1,444 622 4,608 157.0 2,865 2,320 130.7 0.50 0.6016 64 1,520 671 5,120 158.8 5,458 4,502 135.8 0.28 0.3416 128 1,536 647 6,144 160.3 10,630 8,860 122.8 0.14 0.1816 256 1,553 677 8,192 146.8 20,984 17,566 116.2 0.07 0.10
32 16 1,556 700 4,608 163.2 2,008 1,498 114.0 0.77 0.9232 32 1,632 742 5,120 169.2 3,725 2,847 112.8 0.44 0.5232 64 1,660 737 6,144 165.8 7,179 5,540 109.0 0.23 0.2832 128 1,709 757 8,192 151.4 14,047 10,921 103.2 0.12 0.1632 256 1,747 788 12,288 159.4 27,807 21,678 97.6 0.06 0.09
64 16 1,648 776 5,120 151.2 2,855 2,012 86.8 0.58 0.6964 32 1,732 801 6,144 147.9 5,424 3,873 85.3 0.32 0.3964 64 1,779 819 8,192 147.3 10,587 7,590 83.6 0.17 0.2264 128 1,801 811 12,288 151.2 20,872 15,019 79.3 0.09 0.1264 256 1,754 810 20,480 150.8 54,790 29,872 89.3 0.03 0.06
(a) LEAP (b) SnoopP (c) LEAPSnoopP
The comparison in part (c) of Table 3.10 aims to provide a more fair area comparison than
only considering LEs. It shows that LEAP consumes less area than SnoopP in all but one
configuration, with a maximum reduction of 94%. The single configuration in which LEAP
consumes more resources than SnoopP is the smallest configuration possible, with N=CW=16.
These two comparisons form bounds on the ratio of the true area consumptions of the two
profilers. The first comparison assumes memory bits are of zero area and thus forms an upper
53
3 Cycle-Accurate Profiling
0
10,000
20,000
30,000
40,000
50,000
60,000
16
16
32
16
64
16
128
16
256
16
16
32
32
32
64
32
128
32
256
32
16
64
32
64
64
64
128
64
256
64
Are
a U
tili
zati
on
(LE
s)
Top: Number of Functions (N)
Bottom: Counter Width (CW)
LEAP (LEs)
LEAP (LEs+Mem.)
SnoopP (LEs)
Figure 3.11: Area overhead comparison of LEAP and SnoopP, both configured for cycle profil-ing.
bound on the area ratio LEAP/SnoopP. The second comparison assumes that one memory
bit is equal to 1/16th of the area consumed by a logic element, which is greater than its true
area thus forms a lower bound on the area ratio. Figure 3.11 graphically shows these area
comparisons. The Y-axis represents the total area utilization in equivalent LEs, and the X-axis
represents the configurations of CW and N. It is apparent that as either N or CW increase, the
area requirements of SnoopP increase much more than LEAP. This is because the comparators
and counters required by SnoopP for each function increase with N, and the size of each counter
54
3.4 Experiments
increases with CW. In contrast, LEAP only increases in area to accommodate the transfer of
this data. As the counter width increases, only a small increase in the number of registers is
required to store this data. For increases in N, more bits are needed by the Address Hash and
Call Stack. The actual storage overhead, however, is compactly stored in memory bits.
The implementation of SnoopP used registers to store the count values. It is possible for
SnoopP to use memory bits for this storage, but two drawbacks limit the practicality of such an
approach: 1) instead of simply performing a comparison and incrementing a registered counter
each cycle, the profiler would need to determine which function is currently executing, read
from an AltSyncRam, increment the read value, and write the data back. This process would
require multiple cycles, as each memory read requires one cycle. 2) Table 3.11 shows that the
area requirement of the 2N 32-bit comparators alone dominates the area requirements of the
profiler, consuming over 1000 LEs in the smallest case (column SnoopP Comparators Only).
The area to be saved by converting the counter registers into memory bits is less than CW ×N ,
which when subtracted from the LE requirements of SnoopP is still larger than LEAP in all but
the smallest case (column SnoopP Mem. Bit Counters). This benefit is lessened if the extra
logic that would be required to handle the reads and writes to/from memory is considered.
For these reasons, it is apparent that not only would SnoopP still be larger than LEAP if
the counters were implemented with memory bits, but also it is not obvious how the profiler’s
multi-cycle requirements could be implemented.
In addition to comparing the area consumption of the two profilers, the maximum operating
frequencies of each are shown. The Fmax of LEAP ranges between 146.8 MHz to 169.2 MHz,
which is well above the maximum operating frequency of the Tiger processor. SnoopP, on the
other hand, ranges between 79.3 MHz to 138.3 MHz. This results in a 17-91% increase in Fmax
for LEAP over SnoopP, another compelling advantage for this framework.
The power overhead of LEAP and SnoopP are shown in Table 3.12, where each power value
is the average of the power consumed when running each of the 13 benchmarks. The individual
55
3 Cycle-Accurate Profiling
Table 3.11: Area overhead investigation for SnoopP, measured in LEs, if counters were imple-mented with memory bits.
SnoopP SnoopP SnoopPRegister Mem. Bit Comparators
CW N LEAP Counters Counters Only
16 16 1,402 1,572 1,316 1,02416 32 1,444 2,865 2,353 2,04816 64 1,520 5,458 4,434 4,09616 128 1,536 10,630 8,582 8,19216 256 1,553 20,984 16,888 16,38432 16 1,556 2,008 1,496 1,02432 32 1,632 3,725 2,701 2,04832 64 1,660 7,179 5,131 4,09632 128 1,709 14,047 9,951 8,19232 256 1,747 27,807 19,615 16,38464 16 1,648 2,855 1,831 1,02464 32 1,732 5,424 3,376 2,04864 64 1,779 10,587 6,491 4,09664 128 1,801 20,872 12,680 8,19264 256 1,754 54,790 38,406 16,384
benchmark powers are shown in Appendix A.1.3 1. Part (a) of Table 3.12 shows the results for
LEAP in both the cycle profiling and stall-cycle profiling schemes. It can be seen that there
is very little variation between configurations; different values of N result in almost no change
in the power overhead, and variations in CW result in only a few milliwatt difference. The
range of dynamic power overhead for LEAP is 4.55-6.94 mW and 4.78-7.05 mW for cycle and
stall-cycle profiling, respectively. Part (b) shows the results for SnoopP for the same schemes.
Unlike LEAP, SnoopP’s power overhead drastically increases with both N and CW, resulting
in an overhead range of 3.00-52.32 mW and 2.77-40.69 mW for cycle profiling and stall-cycle
profiling, respectively.
The results of parts (a) and (b) are graphically shown in Figure 3.12 for cycle-accurate profil-
1For the N=128 and N=256 cases, only three benchmarks were profiled for power overhead due to the largeincrease in simulation time with increasing system size. Due to the highly consistent results measured amongbenchmarks in the same configuration for other values of N, the average power overhead from these threebenchmarks is sufficient.
56
3.4 Experiments
Table 3.12: Power overhead comparison of LEAP and SnoopP, configured for both cycle andstall-cycle profiling. Note: “Cycles” implies cycle profiling, “Stalls” implies stall-cycle profiling, all values are measured in milliwatts, “N” is the max number offunctions profiled, and “CW” is the counter width.
CW N Cycles Stalls Cycles Stalls Cycles Stalls
16 16 4.55 4.84 3.00 2.77 1.52 1.7516 32 4.97 4.78 6.33 6.06 0.79 0.7916 64 5.28 5.10 9.78 8.68 0.54 0.5916 128 4.99 4.82 16.77 14.91 0.30 0.3216 256 4.99 5.16 28.55 25.19 0.18 0.21
32 16 6.94 5.63 4.03 4.41 1.72 1.2832 32 5.85 6.25 7.17 6.10 0.82 1.0332 64 6.13 6.12 11.94 9.76 0.51 0.6332 128 6.67 6.53 20.54 16.99 0.33 0.3832 256 5.87 5.59 35.39 28.88 0.17 0.19
64 16 6.52 6.84 6.07 4.15 1.07 1.6564 32 6.20 6.35 9.40 6.02 0.66 1.0664 64 6.23 6.70 16.54 11.22 0.38 0.6064 128 6.24 6.11 28.01 21.07 0.22 0.2964 256 6.10 7.05 52.32 40.69 0.12 0.17
(a) LEAP (b) SnoopP (c) LEAPSnoopP
ing, in which the Y-axis represents the power overhead measured in milliwatts, and the X-axis
represents the configurations of CW and N. When only 16 functions are profiled, SnoopP ac-
tually consumes slightly less power than LEAP; however, the rapid increase in comparators
required by SnoopP as N increases, and the extra power consumed by larger counters as CW
increases, causes the power consumption of SnoopP to increase much faster and more signif-
icantly that LEAP. In fact, LEAP shows no correlation between increasing N and its power.
This is due to the efficient use of AltSyncRams to store data at function context switches;
they use dedicated multiplexing logic to read data from these RAMs, so increasing the required
RAM size costs very little in terms of overhead. Part (c) of Table 3.12 shows the ratio LEAPSnoopP.
As previously mentioned, the smallest configurations (N=16) result in a lower power overhead
for SnoopP; LEAP consumes 51.7-72.2% (1.55-2.91 mW) and 127.6-174.5% (1.22-2.07 mW)
57
3 Cycle-Accurate Profiling
0
10
20
30
40
50
60
16
16
32
16
64
16
128
16
256
16
16
32
32
32
64
32
128
32
256
32
16
64
32
64
64
64
128
64
256
64
Po
we
r C
on
sum
pti
on
(m
W)
Top: Number of Functions (N)
Bottom: Counter Width (CW)
LEAP
SnoopP
Figure 3.12: Power overhead comparison of LEAP and SnoopP, both configured for cycle pro-filing.
more power, for cycle profiling and stall-cycle profiling, respectively. Stall-cycle profiling also
consumes more power in LEAP for N=32 (CW=32,64), by 2.5-5.6% (0.15-0.33 mW). However,
LEAP is significantly more power-efficient in all other cases, achieving reductions of up to 88.3%
(46.22 mW) and 82.7% (33.64 mW) for cycle profiling and stall-cycle profiling, respectively.
58
3.5 Summary
3.5 Summary
LEAP is a non-intrusive profiler that is both area- and power-efficient. The LLVM compiler
framework is used to create a MIPS assembly file from C source code, after which the GNU
Binutils assemble and link the MIPS assembly to create an ELF file capable of running on the
Tiger processor. Initialization files specific to the profiler are also created, and are programmed
along with the ELF to the system running on a target FPGA through a command-line inter-
face; the profiling results are also retrieved through this interface. Three profiling schemes are
investigated: instruction profiling records the function-level instruction counts, cycle profiling
attributes clock cycles to individual functions during execution, and stall cycle profiling mea-
sures the number of cycles the processor spends stalled per function while waiting on data from
either the instruction or data cache. In addition, each of these configurations can be run with
either a flat or hierarchical profile. LEAP is compared to two profilers, gprof and SnoopP,
to verify the accuracy of the results presented as well as to assess the area, speed and power
characteristics of the architecture.
59
Chapter 4
Energy Consumption Profiling
As mobile systems increase in complexity, the need to maximize their energy efficiency is be-
coming a priority in order to extend battery life, and more generally to conserve electricity.
Methods to determine the energy consumed in individual regions of an application are impor-
tant to aid in this effort. Although two common methods exists to gather such data, direct
measurement and simulation, they are not ideal for rapid development. Direct measurement
requires a specialized experimental setup to monitor the current drawn by the system, whereas
simulation is very slow, taking hours or even days to gather data for only seconds worth of
execution time. This chapter discusses a profiling tool built within LEAP that aims to quickly
estimate the energy consumption of each function within an application. First, an instruction-
level power database is created offline to characterize the power consumed by instructions in the
MIPS1 instruction set. This database is then combined with run-time cycle counts of individual
instructions to produce a function-level energy profile, shown to be accurate within 5.95%, on
average.
4.1 Instruction Power Database
An instruction-level power database is created to store the average dynamic power each in-
struction consumes while executing on the processor. Using this database, it is then possible
60
4.1 Instruction Power Database
to calculate the total energy consumed by each instruction if the profiler measures the amount
of time that instruction executes. Simulation is used to measure the average dynamic power
but, as previously mentioned, is very slow which prohibits its use in rapid development cycles.
Therefore, the database is created offline once (for the specific processor and device in use) and
is used as a lookup table for profiling.
ModelSim and Quartus are used to determine the average dynamic power consumption of each
instruction in the MIPS1 instruction set. First, Quartus is used to produce the files required
by ModelSim for performing timing simulations of the Tiger system (not including LEAP) for
each of the benchmarks previously described. Individual VCD files for each instruction type
are created to record the switching activity of the system while executing that instruction.
VCD recording is enabled when a given instruction enters the decode stage of the Tiger MIPS
pipeline and is disabled when the instruction exits the pipeline. In addition, VCD recording is
paused during stalls; the dynamic power consumed while stalling is recorded in a separate VCD
file. Two approaches were investigated when considering stalls: 1) cache stalls from either the
instruction or data cache, and 2) pipeline stalls, which can occur due to data hazards, memory
latency, and multi-cycle instructions.
Upon completing the simulation of a benchmark, each VCD file is run through the Quartus
PowerPlay Analyzer to determine the average dynamic power for the corresponding instruction.
The database required many days of computation to simulate the 13 benchmarks, with each
benchmark requiring multiple VCD files totaling over 20GB. As a result, the four longest-
executing benchmarks, Blowfish, DFSIN, JPEG, and SHA, could not be run to completion
due to both storage and time limitations as weeks of computation and hundreds of gigabytes
would be required. These partial runs are acceptable, however, because the benchmarks are
used to measure the power consumption of individual instructions and measurements near the
beginning of execution are no less valuable than those at the end.
The average dynamic power of each instruction, as measured in each benchmark, can be seen
in Tables 4.1 and 4.2, for cache-related stalls and pipeline stalls, respectively. It is important to
61
4 Energy Consumption Profiling
note that the power attributed to each instruction varies from benchmark to benchmark. These
deviations are due to inter-instruction effects: the power measured for a given instruction also
depends on the other instructions currently in the pipeline. By averaging these power values
across many benchmarks, the average inter-instruction effect is accounted for. To increase the
quality of results, any instruction that is run less than 100 times in a benchmark is “pruned”
from the data set as the dynamic power for such instructions may be inaccurate due to the
limited data that was collected. The power of any instructions that are not actually used
throughout the benchmark suite are estimated for use with the power database. This estimation
is based on the dynamic power of similar instructions; these instructions have average dynamic
power of 0mW in Tables 4.1 and 4.2.
4.2 Implementation in Existing Framework
In order to estimate the function-level energy consumption of an application, run-time data must
be collected. Energy can be calculated by taking the product of power and time, E = P × T .
Since the dynamic power of each instruction is determined offline and stored in the instruction
power database, only the execution time of each instruction per function is required from the
profiler. This measurement is performed by modifying the Data Counter module, as seen
in Figure 4.2. As each instruction enters the execute stage, it must be decoded so that the
correct counters are incremented. In order to reduce the area of the profiler, instructions are
combined into six groups; a single 22-bit counter is used for each group to record the cycle
counts of all instructions in that group. The instruction power database uses these same groups
to provide dynamic power values for the energy calculation; the power for a group is simply
the average dynamic power of all instructions in the group. The groupings were chosen to
minimize (by inspection) the standard deviation within each group, which in turn reduces
the error introduced by averaging. The database, including the instruction groupings, their
average power and standard deviations, is shown in Table 4.3; part (a) shows the database
when considering only cache stalls, while part (b) considers all pipeline stalls. Figure 4.1 shows
62
4.2 Implementation in Existing Framework
Table 4.1: Pruned instruction-level power database for cache stalls, measured in mW.
Instr. ADPCM
AESBlow
fish
DFADD
DFDIV
DFMUL
DFSIN
GSM JPEG
Mip
sM
otion
SHA
Dhrys
tone
Averagestall 36.13 25.52 52.56 24.31 31.91 23.70 29.08 27.39 31.22 34.97 30.00 35.28 28.60 31.59add 0.00addi 0.00addiu 61.39 61.88 63.11 65.52 66.45 60.86 63.31 69.95 58.23 57.48 81.89 64.46 64.54addu 60.90 54.31 52.99 55.14 53.66 62.75 62.54 64.94 75.54 74.02 61.68and 83.13 64.86 80.61 65.15 44.40 48.84 68.23 89.16 68.05andi 66.52 52.29 76.91 72.05 61.28 76.60 58.58 56.53 60.07 49.88 63.07beq 42.28 61.45 45.10 53.47 56.47 62.46 49.70 44.85 67.63 58.90 54.23bgez 95.64 30.78 63.21bgezal 0.00bgtz 68.61 51.32 59.97blez 59.29 68.15 63.72bltz 50.06 50.06bltzal 0.00bne 54.69 54.59 79.61 50.57 44.68 45.33 45.60 64.84 66.32 49.80 77.53 80.49 67.95 60.15div 49.07 49.07divu 0.00j 62.41 73.65 61.66 57.68 72.78 59.81 62.28 65.23 64.44jal 57.94 50.03 74.33 60.31 60.68 64.91 65.16 68.46 62.99 62.05 59.90 62.43jr 66.01 66.01 97.52 66.61 73.14 71.28 78.96 74.62 75.60 52.25 75.43 63.95 71.78lb 50.93 68.07 59.50lbu 65.55 85.06 85.37 73.75 59.07 69.27 73.01lh 85.75 41.47 63.61lhu 78.67 78.67lui 58.65 56.57 76.21 66.09 52.40 60.77 56.20 74.77 60.76 57.46 60.29 96.90 57.43 64.19lw 59.85 55.60 76.22 55.92 55.51 48.66 58.37 67.42 66.50 42.60 82.41 66.04 61.26mfhi 40.00 40.00mflo 43.22 39.76 44.35 59.78 46.67 41.97 45.96mult 46.65 42.53 64.85 46.47 50.13multu 53.10 53.10nop 65.27 63.24 79.04 68.42 61.48 68.59 65.03 78.47 73.01 61.30 68.52 85.26 60.02 69.05or 55.82 55.36 64.18 48.34 67.02 83.38 62.35ori 87.81 73.71 71.84 59.06 59.50 61.82 52.24 84.32 68.79sb 90.95 74.51 88.47 84.64sh 0.00sll 62.53 76.43 68.22 61.42 65.15 76.03 72.34 63.91 60.01 67.34sllv 63.45 86.23 74.84slt 51.02 49.85 46.51 72.21 57.69 47.10 58.94 52.21 54.44slti 44.65 46.57 50.67 58.12 41.50 48.30sltiu 65.09 49.64 47.34 40.70 47.72 50.10sltu 63.81 57.38 43.06 48.38 76.06 81.78 43.10 59.08sra 56.67 64.03 64.31 72.13 59.18 71.78 54.16 63.18srav 67.62 67.62srl 54.97 59.40 78.68 59.62 65.96 67.96 54.73 53.62 80.42 63.93srlv 83.31 83.31sub 0.00subu 69.92 73.65 61.98 69.37 61.71 66.84 83.31 73.41 70.02sw 59.06 53.22 80.82 53.29 59.91 52.33 57.85 76.95 66.12 72.68 40.60 89.17 65.98 63.69xor 56.50 61.27 50.40 52.08 50.20 63.46 55.20 65.80 86.69 50.73 59.23xori 58.13 60.46 59.30
63
4 Energy Consumption Profiling
Table 4.2: Pruned instruction-level power database for pipeline stalls, measured in mW.
Instr. ADPCM
AESBlow
fish
DFADD
DFDIV
DFMUL
DFSIN
GSM JPEG
Mip
sM
otion
SHA
Dhrys
tone
Averagestall 45.55 32.87 71.60 34.75 49.34 30.24 42.75 48.07 50.54 53.80 46.83 102.27 55.50 51.09add 0.00addi 0.00addiu 61.08 59.38 52.76 66.89 68.66 62.14 63.00 71.32 71.76 63.41 53.14 86.43 63.49 64.88addu 60.96 57.63 70.51 51.48 52.18 53.41 49.53 65.15 63.11 62.42 70.82 75.58 70.88 61.82and 80.88 60.90 74.31 64.55 54.93 59.81 65.40 84.43 68.15andi 63.13 48.92 68.96 62.72 77.99 74.18 54.57 63.11 59.81 63.71beq 42.69 58.00 42.91 56.84 61.06 65.03 49.82 47.39 71.42 60.83 55.60bgez 88.39 28.94 58.67bgezal 0.00bgtz 75.86 54.16 65.01blez 63.81 63.81bltz 37.76 37.76bltzal 0.00bne 52.19 51.04 82.41 49.92 44.64 44.73 45.31 62.75 62.49 48.06 72.39 77.66 63.54 58.24div 99.52 99.52divu 0.00j 74.87 69.46 57.16 54.07 69.69 66.44 58.50 59.83 63.75jal 54.36 47.53 66.86 56.96 57.16 61.90 60.61 77.54 61.52 68.18 58.22 60.99jr 62.51 63.20 106.36 63.18 68.94 66.64 75.51 70.42 70.81 47.65 77.37 64.76 69.78lb 52.15 70.48 61.32lbu 74.54 99.68 82.36 69.14 54.59 70.62 75.16lh 104.68 37.94 71.31lhu 75.03 75.03lui 53.91 53.61 72.88 63.02 63.78 57.26 68.28 70.13 57.11 54.19 57.27 91.84 54.79 62.93lw 65.07 55.30 71.02 51.99 60.91 45.59 64.15 71.18 65.49 39.69 66.28 59.70mfhi 39.83 39.83mflo 58.52 48.27 55.60 89.66 71.98 56.08 63.35mult 68.89 42.69 138.77 59.11 77.37multu 105.77 105.77nop 66.56 59.43 74.39 64.23 62.13 64.98 65.20 75.90 70.07 59.07 63.10 78.20 58.99 66.33or 45.17 56.61 68.08 48.33 71.79 79.62 92.28 65.98ori 70.13 65.91 58.42 61.85 59.63 66.17 79.25 65.91sb 22.89 94.73 70.24 83.55 67.85sh 0.00sll 65.12 58.32 71.83 64.45 62.85 66.29 73.15 68.77 60.59 58.63 65.00sllv 58.57 93.34 75.96slt 64.92 52.99 60.95 84.98 72.86 57.99 80.53 58.01 66.65slti 48.50 58.71 57.82 74.91 49.71 57.93sltiu 61.90 52.43 46.30 49.47 49.27 51.87sltu 79.98 60.36 51.51 60.78 84.76 92.19 53.37 68.99sra 55.86 60.25 64.60 72.25 64.84 69.80 51.02 62.66srav 63.43 63.43srl 49.97 56.29 71.62 58.29 73.75 74.79 60.53 57.11 75.81 64.24srlv 37.57 81.61 59.59sub 0.00subu 67.53 71.51 56.46 66.16 66.87 74.29 81.42 74.47 92.48 72.35sw 63.17 52.45 79.51 57.35 68.80 53.95 61.08 89.49 72.99 68.08 38.69 95.47 69.39 66.96xor 63.57 58.58 73.09 52.29 51.89 49.60 61.52 54.71 62.00 86.11 49.14 60.23xori 54.56 67.97 61.27
64
4.2 Implementation in Existing Framework
these two database configurations graphically; the vertical axis represents the average dynamic
power consumption of each isntruction, the horizontal axis shows each instruction (plus stalls),
and the horizontal lines represent the instruction groupings.
In addition to the six 22-bit counters used for instructions in the MIPS1 instruction set,
a seventh 28-bit counter is kept to count the number of cycles spent stalling. As will be
further discussed in the following section, both cache stalls and pipeline stalls were investigated;
Figure 4.2 shows the Data Counter module configured for power profiling when cache stalls are
considered, but the input to the module can also be the logical OR of all pipeline stall signals.
These seven counters are concatenated into one 160-bit output from the module. In this way,
the rest of the profiler can treat the output as a single 160-bit counter when storing to RAM
blocks, thus requiring no changes to the storage mechanisms. However, the delay of an adder
grows proportionally with the size of its inputs, limiting a 160-bit adder (and therefore the
profiler) to a maximum frequency of 70 MHz. To remove these adders from the critical path of
the profiler, 160-bit adders throughout the system were replaced with seven smaller additions.
This modification is not required for functionality; it is implemented purely as an optimization.
4.2.1 Overhead
Table 4.4 shows the area overhead of LEAP when configured for power profiling. This configu-
ration is much larger than the cycle-accurate profiling schemes described in Chapter 3, requiring
nearly double the LEs and registers, and more than double the memory bits as the number of
functions increase. This increase in area can be attributed to two aspects of the profiler: 1) To
store the seven counters, the effective counter width as seen by the remainder of the profiler is
160 bits. This is over double the maximum CW used in cycle-accurate profiling, which was 64
bits. This especially increases the number of memory bits required to store the function-level
data. 2) Instruction decoding logic is needed to determine which group the currently-executing
instruction belongs to. This logic requires many comparisons, consuming over 350 LEs and 150
registers.
65
4 Energy Consumption Profiling
Table 4.3: Instruction power database for the MIPS1 instruction set, measured in mW. Part(a) shows the database configured for measuring cache stalls only, part (b) shows thedatabase configured for measuring all pipeline stalls.
Dynamic Group GroupInstr. Power Average Std. Dev.
stall 31.59
A
mfhi 40.00
48.08 3.31
mflo 45.96slti 48.30div 49.07divu 49.07bltz 50.06bltzal 50.06sltiu 50.10mult 50.13
B
multu 53.10
53.10 2.83
beq 54.23slt 54.44sltu 59.08xor 59.23xori 59.30lb 59.50bgtz 59.97blez 59.97
C
bne 60.15
62.57 1.11
lw 61.26add 61.68addu 61.68or 62.35jal 62.43andi 63.07sra 63.18bgez 63.21bgezal 63.21lh 63.61sw 63.69srl 63.93
D
lui 64.19
65.45 1.58
j 64.44addi 64.54addiu 64.54sll 67.34srav 67.62
E
and 68.05
70.70 2.33
ori 68.79nop 69.05sub 70.02subu 70.02jr 71.78lbu 73.01sllv 74.84
F
lhu 78.67
82.82 2.83srlv 83.31sb 84.64sh 84.64
Dynamic Group GroupInstr. Power Average Std. Dev.
stalls 44.34
Abltz 37.76
38.45 1.20bltzal 37.76mfhi 39.83
B
sltiu 51.87
57.53 2.62
beq 55.60slti 57.93bne 58.24bgez 58.67bgezal 58.67srlv 59.59lw 59.70
C
xor 60.23
62.82 1.43
jal 60.99xori 61.27lb 61.32add 61.82addu 61.82sra 62.66lui 62.93mflo 63.35srav 63.43andi 63.71j 63.75blez 63.81srl 64.24addi 64.88addiu 64.88
D
sll 65.00
68.03 2.47
bgtz 65.01ori 65.91or 65.98nop 66.33slt 66.65sw 66.96sb 67.85sh 67.85and 68.15sltu 68.99jr 69.78lh 71.31sub 72.35subu 72.35
E
lhu 75.03
75.88 1.08lbu 75.16sllv 75.96mult 77.37
Fdiv 99.52
101.60 3.61divu 99.52multu 105.77
(a) Cache stalls only (b) All pipeline stalls
66
4.2Im
plem
entation
inE
xistin
gFram
ework
30
.00
40
.00
50
.00
60
.00
70
.00
80
.00
90
.00
10
0.0
0
11
0.0
0
stallmfhi mflo
slti div
divu bltz
bltzalsltiu
mult multu beq
slt sltu xor xori lb
bgtz blez bne lw
add addu
or jal
andi sra
bgez bgezal
lhsw srl lui j
addi addiu
sll srav
and ori
nop sub
subu jr
lbusllv
lhusrlv sb
sh
Dynamic Power Consumption (mW)
MIP
S In
structio
ns
Dy
na
mic P
ow
er
Gro
up
Po
we
r
(a)
Cach
e-stalls
only
consid
ered
30
.00
40
.00
50
.00
60
.00
70
.00
80
.00
90
.00
10
0.0
0
11
0.0
0
Stallsbltz
bltzalmfhi sltiu
beq slti
bne bgez
bgezalsrlv lw xor jal
xori lb
add addu
sra lui
mflo srav
andi j
blez srl
addi addiu
sll bgtz ori or
nop slt
sw sb
shand sltu
jrlh
sub subu
lhulbu sllv
mult div
divu multu
Dynamic Power Consumption (mW)
Dy
na
mic P
ow
er
Gro
up
Po
we
r
(b)
All
pip
eline
stalls
consid
ered
Figu
re4.1:
Instru
ction-level
pow
erdatab
asegrou
pin
gs.
Desp
iteth
elarge
increase
inarea,
the
Fm
ax
ofth
epow
erprofi
leris
relativelyhigh
,w
ith
comparab
levalu
esto
those
found
inm
any
ofth
ecy
cle-accurate
configu
rations.
This
isdue
67
4 Energy Consumption Profiling
clock
Power Profiling Counter
RESET
ENABLE OUTPUTcountA
RESET
ENABLE OUTPUTcountB
RESET
ENABLE OUTPUTcountC
RESET
ENABLE OUTPUTcountD
RESET
ENABLE OUTPUTcountE
RESET
ENABLE OUTPUTcountF
RESET
ENABLE OUTPUT
icache_stall
dcache_stall
instruction
Instruction
Decoding
enA
enB
enC
enD
enE
enF
enSTALL
instr
stall
callreset
return
countSTALL
count_out
Figure 4.2: Data Counter module configured for power profiling.
to the optimization mentioned in the previous section to replace the 160-bit additions with 7
smaller additions. This removes the power-specific modifications from the critical path, keeping
the Address Hash’s slowest path as the critical path of the profiler.
Table 4.4: Area overhead of LEAP in configured for energy profiling.
Total Total Total Fmax
N LEs Registers Mem. Bits (MHz)
16 2,559 1427 6,656 161.132 2,620 1430 9,216 158.064 2,703 1478 14,336 150.6
128 2,667 1397 24,576 157.0256 2,758 1483 45,056 163.2
4.3 Experimental Results
Through the combined use of the instruction power database and the on-chip profiler, energy
estimation was performed for the set of 13 benchmarks described in Section 3.4.2. These
experiments demonstrate the utility of this approach, as well as highlight the difficulties inherent
68
4.3 Experimental Results
in energy estimation.
4.3.1 Methodology
To perform function-level energy estimation, the profiler and Tiger system are synthesized
and programmed to the target FPGA. Each benchmark is run with the corresponding hash
initialization parameters to produce a 160-bit result for each function. This value is the result of
concatenating the seven counters used within the profiler to store the counts of each instruction
grouping and of stalls. All benchmarks were run in two configurations: in the first, a stall
was said to have occurred whenever either the instruction or data cache stalled. In the second
configuration, a stall was said to occur at any time that a pipeline stage was stalled.
Using these results, the energy estimation of each function within a benchmark is calculated
with the formula E =
7∑
i=1
Pi × Ti, where E is the energy estimation, Pi is the average dy-
namic power for group i in the instruction power database, and Ti is the time spent executing
instructions of group i. Time can be further expressed as Ti = Ci/f , where Ci is the cycle
count reported for group i, and f is the frequency the system was run at. Since the instruction
database was created based on a system running at 25 MHz, f is set to this value. This may
appear to be a limitation on the frequency the system can run at, but energy estimation is
independent of frequency. In the equation E = P ×T , power is proportional to frequency, so as
frequency changes by a factor df , so does the power. However, the time required for completion
will change by 1df
, since the system is running faster; these two factors of df cancel each other
out, therefore the frequency of the system does not affect its energy consumption.
To test the validity of these energy estimations, timing simulations in ModelSim are performed
on the system (not including the profiler) to produce a VCD for each benchmark. The average
dynamic power, calculated by the Quartus PowerPlay Analyzer, and the total execution time
are recorded and used to calculated the overall energy consumption. This value was taken to be
the golden result as timing simulations, coupled with Quartus PowerPlay, is the recommended
flow for power estimation with Altera devices. The function-level energy estimates were summed
69
4 Energy Consumption Profiling
to produce the overall energy estimate for the benchmark, which was compared against these
golden results to gauge the accuracy of the profiling method.
To confirm the accuracy of the estimates at a function level, timing simulations in ModelSim
were performed to create a VCD for each function in the benchmark. The PC of the processor
was monitored throughout execution to indicate which function was currently executing; VCD
recording for the current function is enabled, recording for all other functions is disabled. These
results were then compared to the function-level energy estimates produced through profiling
to further verify the estimates.
A final experiment was performed to compare energy consumption profiles with cycle count
profiles to determine the degree of correlation between the two. The percentage energy con-
sumption of each function, as compared to the overall energy consumed throughout the entire
application, and the percentage cycle count are calculated and the values are compared to see
if a) the numbers correlate, and b) if the ordering of functions based on these percentages is
the same for energy profiling and cycle profiling.
4.3.2 Results
4.3.2.1 Overall Energy Estimation
Table 4.5 shows the overall energy estimates, each of which is the sum of all function-level energy
estimates within a benchmark, of each of the 13 benchmarks in the test suite when considering
only cache stalls. Detailed results for all 13 benchmarks can be found in Appendix A.2.1. Part
(a) shows the results of energy profiling when run on the FPGA, including the percentage of
time the benchmark spent in cache stalls, the total number of cycles required to execute the
benchmark, the average dynamic power estimate (calculated as P = E/T ), and the energy esti-
mate calculated with the aid of the instruction-level power database. Part (b) shows the results
of timing simulation with ModelSim followed by analysis with Quartus PowerPlay, including
total execution time in nanoseconds, average dynamic power, and total energy consumption
(calculated as E = P × T ). Part (c) shows the energy estimation error for each benchmark,
70
4.3 Experimental Results
calculated using the standard estimation error formula: error = estimate−actualactual
. The absolute
value of the estimation error can be seen to range from 1.8% for the benchmark “DFADD,” to
22.0% for “SHA.” The average estimation error is calculated as -0.10%, however, the average
of the absolute values is more meaningful as under-estimates are equally as wrong as over es-
timates; the absolute average is seen to be 8.27%. Two benchmarks, “Blowfish” and “SHA,”
have much higher errors than any others, the next largest error being only 12.6%. If these two
problematic benchmarks are ignored, the average error would drop to only 6.49%. In addition
to having the largest errors, they are also among the least-stalling benchmarks. These results
necessitated further investigation, leading to repeating the experiments to include all pipeline
stalls for energy profiling.
Table 4.5: Energy profiling results when only cache stalls are considered.
Power Energy Total Power Energy
Benchmark Stalls Cycles (mW) (nJ) Time (ns) (mW) (nJ) Energy
adpcm 7.4% 216,242 60.59 565,890.55 9,312,640 53.95 502,416.93 12.6%
aes 35.8% 47,427 52.36 154,817.45 2,881,320 49.03 141,271.12 9.6%
blowfish 9.5% 916,497 61.76 2,501,841.74 40,507,640 71.91 2,912,904.39 -14.1%
dfadd 29.8% 18,233 54.18 56,266.19 1,011,720 54.61 55,250.03 1.8%
dfdiv 9.9% 72,057 59.88 191,590.02 3,226,360 55.51 179,095.24 7.0%
dfmul 41.8% 6,426 50.25 22,199.31 431,200 49.04 21,146.05 5.0%
dfsin 21.9% 3,003,897 56.22 8,652,870.50 153,907,120 54.80 8,434,110.18 2.6%
gsm 17.9% 41,662 60.42 122,567.17 2,018,160 64.05 129,263.15 -5.2%
jpeg 14.2% 5,277,413 59.26 14,588,694.58 246,168,000 63.06 15,523,354.08 -6.0%
mips 7.1% 40,780 62.42 109,635.61 1,751,480 56.23 98,485.72 11.3%
motion 30.7% 18,251 56.46 59,462.06 1,057,160 60.52 63,979.32 -7.1%
sha 2.5% 1,113,309 63.98 2,923,453.63 45,690,040 82.05 3,748,867.78 -22.0%
dhrystone 16.1% 27,213 59.76 77,537.53 1,280,440 58.68 75,136.22 3.2%
Average -0.10%
Abs. Average 8.27%
(a) Energy Profiling Results (b) Simulation Results (c) Error
Pipeline stalls can consume a considerable chunk of an application’s execution time. Table 4.6
shows the overall energy estimates when all pipeline stalls are considered for both the instruction
71
4 Energy Consumption Profiling
database and the profiler counters. This change in stall measurement is apparent by noting the
increase in stall percentage versus Table 4.5, as all cache stalls cause pipeline stalls. For these
experiments, the absolute estimate error ranges between 1.1% error for “DFSIN” to 23.2% error
for “SHA.” The absolute average error across all benchmarks is 8.38%, which is very similar to
the result of the cache-only experiments. Detailed results for all 13 benchmarks can be found
in Appendix A.2.2.
Table 4.6: Energy profiling results when all pipeline stalls are considered.
Power Energy Total Power Energy
Benchmark Stalls Cycles (mW) (nJ) Time (ns) (mW) (nJ) Energy
adpcm 37.3% 146,468 57.02 532,740.94 9,312,640 53.95 502,416.93 6.0%
aes 46.0% 39,924 55.28 163,500.43 2,881,320 49.03 141,271.12 15.7%
blowfish 12.3% 888,221 62.48 2,530,356.30 40,497,080 71.91 2,912,145.02 -13.1%
dfadd 44.7% 14,318 55.65 57,556.45 1,011,720 54.61 55,250.03 4.2%
dfdiv 33.9% 52,898 57.26 183,025.38 3,226,360 55.51 179,095.24 2.2%
dfmul 55.0% 4,969 53.57 23,716.31 431,200 49.04 21,146.05 12.2%
dfsin 43.3% 2,180,140 55.40 8,526,386.48 153,920,040 54.80 8,434,818.19 1.1%
gsm 42.2% 29,559 57.17 116,914.53 2,018,160 64.05 129,263.15 -9.6%
jpeg 34.6% 4,034,329 57.57 14,195,382.50 246,571,880 63.06 15,548,822.75 -8.7%
mips 21.1% 34,660 60.04 105,358.20 1,751,480 56.23 98,485.72 7.0%
motion 33.0% 17,647 58.22 61,332.44 1,057,160 60.52 63,979.32 -4.1%
sha 9.5% 1,033,449 63.03 2,880,061.19 45,688,840 82.05 3,748,769.32 -23.2%
dhrystone 27.6% 23,394 59.20 76,520.37 1,280,440 58.68 75,136.22 1.8%
Average -0.65%
Abs. Average 8.38%
(a) Energy Profiling Results (b) Simulation Results (c) Error
The benchmarks “Blowfish” and “SHA” still result in relatively high error in this stall con-
figuration, all under-estimates, but it is important to note that the stall percentages of these
two are now significantly less than any other benchmark. This observation lead to an impor-
tant realization about the dynamic power of the processor: an application with low pipeline
stalls means that, on average, the processor is performing work more often than one with a
higher stall rate, so it will consume more power. During a pipeline stall caused by pipeline
72
4.3 Experimental Results
Table 4.7: Energy profiling results when all pipeline stalls are considered and correction factorapplied.
Power Stall Energy Total Power Energy
Benchmark Stalls Cycles (mW) Factor (nJ) Time (ns) (mW) (nJ) Energy
adpcm 37.3% 146,468 56.88 0.99 531,430.45 9,312,640 53.95 502,416.93 5.8%
aes 46.0% 39,924 54.96 0.97 162,562.87 2,881,320 49.03 141,271.12 15.1%
blowfish 12.3% 888,221 71.45 1.18 2,893,418.97 40,507,640 71.91 2,912,145.02 -0.6%
dfadd 44.7% 14,318 55.35 0.98 57,245.57 1,011,720 54.61 55,250.03 3.6%
dfdiv 33.9% 52,898 57.26 1.00 183,036.18 3,226,360 55.51 179,095.24 2.2%
dfmul 55.0% 4,969 53.20 0.96 23,550.46 431,200 49.04 21,146.05 11.4%
dfsin 43.3% 2,180,140 55.12 0.98 8,483,392.37 153,907,120 54.80 8,434,818.19 0.6%
gsm 42.2% 29,559 56.90 0.98 116,368.08 2,018,160 64.05 129,263.15 -10.0%
jpeg 34.6% 4,034,329 57.53 1.00 14,187,192.49 246,168,000 63.06 15,548,822.75 -8.8%
mips 21.1% 34,660 61.79 1.06 108,434.21 1,751,480 56.23 98,485.72 10.1%
motion 33.0% 17,647 58.27 1.00 61,382.00 1,057,160 60.52 63,979.32 -4.1%
sha 9.5% 1,033,449 79.95 1.26 3,652,738.91 45,690,040 82.05 3,748,769.32 -2.6%
dhrystone 27.9% 23,394 59.69 1.02 77,149.23 1,280,440 58.68 75,136.22 2.7%
Average 1.95%
Abs. Average 5.95%
(a) Energy Profiling Results (b) Simulation Results (c) Error
stage X, all stages before X must also stall, but all stages after may finish execution; this
results in a partial pipeline stall. Therefore, higher stall rates also cause emptier pipelines and
take longer to complete the computation; less work is performed per unit time, resulting in a
lower overall dynamic power consumption. The instruction-level power database, which was
created by averaging the dynamic powers for each instruction across all benchmarks, inher-
ently assumes a certain pipeline stall rate equal to the average of the stalls rates of Table 4.6,
which is 33.9%, and assumes all pipeline stalls are full stalls. This approach will naturally
incur error because benchmarks such as “Blowfish” and “SHA” only stall for 12.3% and 9.5%
of the time, respectively. This means they have higher dynamic power consumptions than
indicated by the power database; the rest of the benchmarks range between 21.1-55.0% stall
rate, much closer to the database’s average. In a first-attempt method to compensate for the
assumed stall rate, we scale the estimated energy values by an empirically-derived correction
73
4 Energy Consumption Profiling
factor that is based on a program’s stall rate; future investigation may lead to a modified
methodology for creating the instruction-level power database. The correction factor is defined
to be: factor = 1 + (Avg.Stall%−Stall%Stall%
) × 1Stall%
, where Avg.Stall% is the average stall rate
across all programs (33.9%) and Stall% is the fraction of time spent stalling in the program
being profiled. Intuitively, this factor increases the power estimate when the stall rate of the
program being profiled is below the average stall rate; likewise, the power estimate is reduced
when the stall rate of the program is above the average stall rate. This factor is multiplied with
the estimated energy consumptions of Table 4.6 to achieve lower estimation errors, as seen in
Table 4.7. These factors range from 0.96, which corresponds to the highest-stalling benchmark
“DFMUL,” to 1.26, which corresponds to the lowest-stalling benchmark “SHA.” It can be seen
in this table that the absolute average estimation error is now only 5.95%, and is at worst
15.1%, confirming that real-time energy profiling is accurate.
4.3.2.2 Function-level Energy Estimation
The previous section demonstrated the overall accuracy of energy consumption profiling by
summing the energy estimates of each function and comparing this total for each benchmark
against the “golden” result from simulation. Tables 4.8 and 4.9 show the function-level results
of energy profiling for the benchmarks “ADPCM” and “DFMUL”, respectively, versus function-
level “golden” results. Part (a) shows the cycle counts of each instruction group and of pipeline
stalls (discussed in Section 4.2). Due to the composition of the instruction-level power database
in Table 4.3(b), groups A and F tend to have low cycle counts; “ADPCM” executes none of the
instructions in either of these groups. Part (b) shows the estimated energy consumption using
the instruction-level database shown in Table 4.3(b), and the percent energy consumption of
each function as compared to the total estimated energy consumption of the application. Part
(c) shows the average dynamic power as measured from timing simulation, the calculated energy
consumption based on this power, and the percentage energy consumption. Finally, part (d)
shows the estimation error at a function level, as well as the difference between the estimated
74
4.3 Experimental Results
Table 4.8: Function-level energy estimation results for benchmark “ADPCM”. Note: “Stalls”indicate pipeline stalls, and “Energy %” indicates percentage energy compared tothe total energy.
Est. Energy VCD VCD Energy EnergyFunction A B C D E F Stalls Energy % Power Energy % Energy %encode 0 7782 20665 12490 2642 0 18669 144,953.36 27.2% 58.10 144,664.35 28.2% 0.2% -1.0%decode 0 6872 14744 13062 2547 0 26632 143,372.68 26.9% 57.80 147,637.38 28.8% -2.9% -1.9%upzero 0 4137 5376 7153 1848 0 14038 73,001.07 13.7% 51.63 67,226.39 13.1% 8.6% 0.6%filtez 0 2395 2195 3384 1196 0 5513 33,643.84 6.3% 52.99 31,122.09 6.1% 8.1% 0.3%quantl 0 2183 3364 2471 991 0 5120 32,289.59 6.1% 47.22 26,686.86 5.2% 21.0% 0.9%uppol2 0 1597 3392 1783 799 0 3652 25,952.73 4.9% 52.31 23,483.01 4.6% 10.5% 0.3%uppol1 0 798 1984 2184 398 0 2733 18,820.20 3.5% 51.32 16,621.52 3.2% 13.2% 0.3%main 0 713 1049 875 149 0 2338 11,256.73 2.1% 56.25 11,529.00 2.2% -2.4% -0.1%adpcm main 0 354 1383 773 200 0 1871 10,318.75 1.9% 50.93 9,332.41 1.8% 10.6% 0.1%filtep 0 0 598 799 600 0 1658 8,438.64 1.6% 50.92 7,444.50 1.5% 13.4% 0.1%logscl 0 399 1191 795 99 0 948 8,056.15 1.5% 52.92 7,264.86 1.4% 10.9% 0.1%logsch 0 400 1030 630 200 0 837 7,314.58 1.4% 53.02 6,568.12 1.3% 11.4% 0.1%scalel 0 200 1591 798 0 0 298 7,158.11 1.3% 63.72 7,358.39 1.4% -2.7% -0.1%reset 0 0 65 96 0 0 2594 5,025.29 0.9% 30.64 3,376.53 0.7% 48.8% 0.3%abs 0 0 299 748 0 0 108 2,978.35 0.6% 60.43 2,791.87 0.5% 6.7% 0.0%
Total 0 27,830 58,926 48,041 11,669 0 87,009 532,580 513,107 3.8%
(a) Profiling Results (b) Energy (c) Simulation (d) ErrorEstimation Results
and actual percentage energy consumptions.
It can be seen that although many functions have seemingly high estimation errors, the
overall energy estimates are still quite small, measuring only 3.8% and 2.0% for “ADPCM”
and “DFMUL”, respectively. This result can be explained by noting that the majority of each
program’s execution is spent in a relatively small number of functions; for “ADPCM”, 74% of
all cycles executed are spent in only 4 of 15 functions. Likewise, for “DFMUL”, 81% of all
cycles are spent in only 4 of the 12 functions. Although some function-level error percentages
are high, the actual difference between the energy estimate and the actual energy are quite
small. Therefore, these errors are less critical since they correspond to very small portions of
the code. The right-most column in Tables 4.8 and 4.9 show that the difference in percentage
error between the estimate and the value calculated from simulation is at most 1.9% for either
benchmark. This confirms that the estimate is accurate at the function level.
75
4 Energy Consumption Profiling
Table 4.9: Function-level energy estimation results for benchmark “DFMUL”. Note: “Stalls”indicate pipeline stalls, and “Energy %” indicates percentage energy compared tothe total energy.
Est. Energy VCD VCD Energy EnergyFunction A B C D E F Stalls Energy % Power Energy % Energy %float64 mul 0 374 789 1173 32 0 2674 10,875.01 46.2% 52.01 10,489.38 48.1% 3.7% -1.9%main 0 142 260 232 20 0 1199 3,798.68 16.1% 47.97 3,555.54 16.3% 6.8% -0.2%roundAndPackFloat64 0 116 198 241 8 0 641 2,581.46 11.0% 47.82 2,303.01 10.6% 12.1% 0.4%mul64To128 31 30 123 93 0 31 495 1,682.78 7.1% 42.49 1,364.78 6.3% 23.3% 0.9%propagateFloat64NaN 0 41 60 74 0 0 549 1,420.20 6.0% 39.2 1,135.23 5.2% 25.1% 0.8%extractFloat64Exp 0 0 79 159 0 0 49 718.09 3.0% 60.91 699.25 3.2% 2.7% -0.2%extractFloat64Sign 0 0 39 198 0 0 42 711.30 3.0% 54.72 610.68 2.8% 16.5% 0.2%extractFloat64Frac 0 0 79 140 0 0 70 703.63 3.0% 66.32 766.66 3.5% -8.2% -0.5%float64 is signaling nan 0 6 17 44 0 0 104 360.71 1.5% 42.98 293.98 1.3% 22.7% 0.2%packFloat64 0 0 29 30 29 0 41 315.25 1.3% 51.78 267.18 1.2% 18.0% 0.1%float64 is nan 0 5 16 18 0 0 81 244.35 1.0% 44.69 214.51 1.0% 13.9% 0.1%float raise 0 2 2 7 0 0 65 143.96 0.6% 37.35 113.54 0.5% 26.8% 0.1%
Total 31 716 1,691 2,409 89 31 6,010 23,555 21,813 2.0%
(a) Profiling Results (b) Energy (c) Simulation (d) ErrorEstimation Results
4.3.2.3 Energy and Cycle Profile Correlation
A final investigation into energy profiling was to determine how closely the amount of time a
program spends in different parts of its code relates to the amount of energy it consumes. Ta-
bles 4.10 and 4.11 show the function-level results for cycle count profiling and energy profiling for
the benchmarks “ADPCM” and “DFMUL.” The number of clock cycles spent executing within
each function, along with the percentage execution time as compared with the total number of
clock cycles required to execute the program, are labeled Cycles. The energy consumption per
function, along with the percentage energy consumption, are labeled cStall Energy and pStall
Energy; cStall Energy refer to the energy estimate based on cache stalls only, and pStall Energy
refers to the estimates based on all pipeline stalls. By comparing the percentage execution time
to each of the percentage energy consumptions, it can be seen that the maximum difference
between time and cache stall energy for “ADPCM” is 0.3%, and 0.6% for time versus pipeline
stall energy. For “DFMUL,” the maximum difference between time and cache stall energy is
1.8%, and 0.6% for time versus energy (when considering pipeline stalls). The full set of results
for all 13 benchmarks can be found in Appendix A.2.3. This shows that there is a very strong
76
4.4 Summary
Table 4.10: Comparative results of percentage energy consumption versus percentage executiontime for benchmark “ADPCM”.
Function Cycles cStall Energy pStall Energy
decode 63818 27.3% 155,220.31 27.4% 143,372.68 26.9%encode 62188 26.6% 151,221.70 26.7% 144,953.36 27.2%upzero 32578 14.0% 79,712.37 14.1% 73,001.07 13.7%filtez 14707 6.3% 35,573.37 6.3% 33,643.84 6.3%quantl 14137 6.1% 34,992.36 6.2% 32,289.59 6.1%uppol2 11209 4.8% 27,070.45 4.8% 25,952.73 4.9%uppol1 8097 3.5% 20,692.39 3.7% 18,820.20 3.5%main 5115 2.2% 10,846.07 1.9% 11,256.73 2.1%adpcm main 4596 2.0% 9,799.36 1.7% 10,318.75 1.9%filtep 3655 1.6% 9,986.15 1.8% 8,438.64 1.6%logscl 3425 1.5% 8,382.52 1.5% 8,056.15 1.5%logsch 3097 1.3% 8,083.42 1.4% 7,314.58 1.4%scalel 2887 1.2% 7,190.45 1.3% 7,158.11 1.3%reset 2753 1.2% 4,035.83 0.7% 5,025.29 0.9%abs 1155 0.5% 2,960.50 0.5% 2,978.35 0.6%wrap 91 0.0% 123.28 0.0% 160.89 0.0%
Total 233508 100.0% 565,891 100.0% 532,741 100.0%
correlation between the function-level energy estimates and the function-level execution time.
It is also important to note that except for a few small differences (i.e. encode and decode in
“ADPCM”), the sorted order for percentage execution time, percentage cache stall energy, and
percentage pipeline stall energy are the same. For hardware/software partitioning purposes,
this means that a partition optimized for throughput is also optimized for energy consumption,
and vice-versa.
4.4 Summary
LEAP has shown to be an extensible profiling framework through the implementation of en-
ergy profiling. An instruction-level power database was created using timing-based simulation
to characterize the individual power consumption of instructions in the MIPS1 instruction set.
LEAP’s Data Counter module was modified to count, per function, the number of cycles the
77
4 Energy Consumption Profiling
Table 4.11: Comparative results of percentage energy consumption versus percentage executiontime for benchmark “DFMUL”.
Function Cycles cStall Energy pStall Energy
float64 mul 5038 45.6% 10,462.91 47.1% 10,875.01 45.9%main 1834 16.6% 3,290.03 14.8% 3,798.68 16.0%roundAndPackFloat64 1213 11.0% 2,463.48 11.1% 2,581.46 10.9%mul64To128 788 7.1% 1,627.69 7.3% 1,682.78 7.1%propagateFloat64NaN 727 6.6% 1,237.55 5.6% 1,420.20 6.0%extractFloat64Frac 289 2.6% 677.64 3.1% 703.63 3.0%extractFloat64Exp 287 2.6% 681.30 3.1% 718.09 3.0%extractFloat64Sign 282 2.6% 681.14 3.1% 711.30 3.0%float64 is signaling nan 171 1.5% 321.52 1.4% 360.71 1.5%packFloat64 129 1.2% 308.63 1.4% 315.25 1.3%float64 is nan 120 1.1% 209.88 0.9% 244.35 1.0%wrap 91 0.8% 123.28 0.6% 160.89 0.7%float raise 76 0.7% 114.26 0.5% 143.96 0.6%
Total 11045 100.0% 22,199 100.0% 23,716 100.0%
processor spent executing each instruction and the number of cycles spent stalling. Using the
equation E = P × T , the energy consumption of thirteen benchmarks was estimated within
5.95%, on average. The results of cycle profiling and energy profiling were compared to deter-
mine the relationship between percent execution time and percent energy consumption for each
function in the benchmark suite. It was found that for a given function, the two rarely differ
by more than 1-2%.
78
Chapter 5
Hardware/Software Partitioning
This research aims to enable the partitioning of software code into hardware and software
portions. In Chapter 3, a low-overhead profiling framework was described that returned the
cycle count of each function that was called in an application. This chapter demonstrates
the utility of this framework by using the results of Chapter 3 to drive hardware/software
partitioning with the LegUp framework.
5.1 System Creation with LegUp
In order to demonstrate the effectiveness of hardware/software partitioning using LEAP, LegUp
was used to create hybrid systems for comparison. These hybrid systems consist of the Tiger
processor plus hardware accelerators communicating with the processor over the Altera Avalon
Interconnect fabric. Each accelerator is the hardware implementation of a software function
and all descendant functions.
In order to create a hybrid system, the software-only version of the system is first run using
LEAP to gather profiling data. Using these results, functions are chosen to maximize through-
put or minimize energy consumption by placing the name of each function in a configuration
file. LegUp reads this file and, using its HLS tool, compiles each function and all descendants
into Verilog modules.
79
5 Hardware/Software Partitioning
In order to communicate with these accelerated functions, a software wrapper function is cre-
ated to: 1) pass any parameters required by the function, 2) retrieve the function’s return value,
3) start the hardware accelerator at the correct times in program flow, and 4) determine when
the accelerator has completed and the results are available. These wrappers are automatically
generated by LegUp for each function that is accelerated. Using the LLVM compiler infras-
tructure, all function calls to accelerated functions are replaced with calls to the corresponding
wrapper, which can then pass parameters over the Avalon Interface, send a start signal to the
accelerator, wait until the accelerator completes and read the return value, if any.
Once the accelerators have been compiled and the software wrappers have been generated to
communicate with the accelerators, the hybrid system (not containing LEAP) is automatically
generated. The Tiger processor and all accelerators are placed on the Avalon Interconnect
fabric, and SOPC builder is used to generate arbitration code for the communication between
them. Quartus is used to synthesize the system and program it to the target FPGA. Users
run applications in the same way they would run software on the Tiger system, except that
the application-specific hybrid system must be used instead of the generic processor, and the
modified software binary containing the software wrapper functions must be run.
5.2 Experimental Results
Multiple configurations of the hybrid systems were generated and compared to investigate the
trends that arise as computation is gradually moved from software to hardware. Data was
collected that measured the area consumption, throughput, and energy consumption of these
configurations.
5.2.1 Methodology
The experiments performed to investigate trends in hardware/software partitioning consist of
four system configurations created with LegUp:
80
5.2 Experimental Results
1. Software Only: this system consists of the base Tiger system, running pure software
through the processor.
2. Hardware Only: this system consists of a pure-hardware implementation of the bench-
mark, as created by the LegUp HLS tools. The Tiger processor is not part of this system.
3. Hybrid 1: this hybrid system accelerates the most compute-intensive software function
(and descendants).
4. Hybrid 2: this hybrid system accelerates the second-most compute-intensive software
function (and descendants).
All four configurations were created for each of the 13 benchmarks, using the results of cycle
profiling (Section 3.4.3) to assist in partitioning, where applicable. Using the post-synthesis
results from Quartus, the area consumption of each system was determined, measured in logic
elements (LEs), memory bits, and embedded hardware multipliers. Using the Fmax of the
system (as determined by Quartus), in conjunction with the number of cycles required to execute
the benchmark in the given configuration, the total execution time was determined. Finally,
by using timing simulation with ModelSim followed by analysis with the Quartus PowerPlay
Analyzer, the average dynamic power and total energy consumption of each configuration was
determined.
5.2.2 Results
The area consumption of each benchmark in the four system configurations is shown in Table 5.1.
Each configuration, labeled (a)-(d), shows the number of logic elements (LEs), memory bits
(Mem. Bits), and embedded multipliers (Mults) for that benchmark. The geometric mean,
geomean, is also shown, averaging the results of each metric among all benchmarks to show the
trends among configurations. The ratio indicates the geomean of the given metric divided by
that of software-only. It can be seen that as more computation is performed in hardware (i.e.
from left to right in the table), all metrics increase in area consumption as accelerators must be
81
5 Hardware/Software Partitioning
added to the system. This does not hold for the hardware-only configuration, as the average
number of LEs is only slightly more than that of software only, and the memory and multiplier
requirements are the lowest of any configuration. This can be explained by noting that the
Tiger processor is no longer needed in this configuration, as no computation is performed by
software. The trends in area consumption, specifically LE count, are shown in Figure 5.1.
Table 5.1: Area results for hybrid systems in four configurations.Mem. Mem. Mem. Mem.
Benchmark LEs Bits Mults LEs Bits Mults LEs Bits Mults LEs Bits Mults
adpcm 12,243 226,009 16 25,628 242,944 152 46,301 242,944 300 22,605 29,120 300aes 12,243 226,009 16 56,042 244,800 32 68,031 245,824 40 28,490 38,336 0blowfish 12,243 226,009 16 25,030 341,888 16 31,020 342,752 16 15,064 150,816 0dfadd 12,243 226,009 16 22,544 233,664 16 26,148 233,472 16 8,881 17,120 0dfdiv 12,243 226,009 16 28,583 225,600 46 36,946 233,472 78 20,159 12,416 62dfmul 12,243 226,009 16 16,149 225,280 48 20,284 233,472 48 4,861 12,032 32dfsin 12,243 226,009 16 34,695 233,472 78 54,450 233,632 116 38,933 12,864 100gsm 12,243 226,009 16 15,220 225,280 16 30,808 233,296 142 19,131 11,168 70jpeg 12,243 226,009 16 25,148 232,576 114 64,441 354,544 254 46,224 253,936 172mips 12,243 226,009 16 46,432 338,096 252 18,857 230,304 24 4,479 4,480 8motion 12,243 226,009 16 18,857 230,304 24 18,013 242,880 16 13,238 34,752 0sha 12,243 226,009 16 28,761 243,104 16 29,754 359,136 16 12,483 134,368 0dhrystone 12,243 226,009 16 20,382 359,136 16 16,310 225,280 16 4,985 82,008 0
Geomean: 12,243 226,009 16 26,593 248,671 43 33,629 261,260 51 15,646 28,822 12Ratio: 1.00 1.00 1.00 2.17 1.10 2.68 2.75 1.16 3.18 1.28 0.13 0.72
(a) SW-Only (b) Hybrid 2 (c) Hybrid 1 (d) HW-Only
The throughput of each system is shown in Table 5.2. Each configuration shows the number
of cycles required to run the benchmark to completion (Cycles), the maximum frequency at
which the system can be run (Freq.), and the time required to run the benchmark to completion,
measured in microseconds (Time (µs)). It can be seen that, on average, the execution time is
cut in half as more computation is performed in hardware, up to an 88% reduction in execution
time for the hardware-only configuration. The maximum frequency of each system remains
fairly constant, as the LegUp HLS tool orders the execution of operations so as to minimize
its critical path. As one might expect, the dual-precision floating point benchmarks achieved
the greatest speedups as more hardware acceleration was used. The benchmark “DFDIV,”
for example, requires 962.9µs, 190.7µs, 68.8µs, and 29.0µs for the software-only, hybrid 2,
82
5.2 Experimental Results
0
0.5
1
1.5
2
2.5
3
SW-Only Hybrid 2 Hybrid 1 HW-Only
Nu
mb
er o
f LE
s R
atio
Figure 5.1: Area comparison for hybrid systems in four configurations, averaged across 13benchmarks.
hybrid 1, and hardware-only configurations, respectively. When comparing to the software-
only configuration for this benchmark, these execution times correspond to reductions of 80.2%,
92.9%, and 97.0% for the hybrid 2, hybrid 1, and hardware-only configurations, respectively.
These results show that partition choices, based on profiling results from LEAP, greatly reduce
the execution time requirements of a system. Moreover, by making incorrect partitioning
choices, such as choosing the second most time-consuming function for acceleration as opposed
to the most time-consuming, it is shown that the generated hybrid system will not be as
beneficial as one that was based on good profiling results. There exists a cost to using the hybrid
systems due to the hardware-software communication overhead. Data must be transferred
over the Avalon fabric each time an accelerator is called. Although this overhead is usually
more than compensated for by the gains in throughput from hardware acceleration, incorrect
partitioning can actually increase the total execution time if ineffective accelerators are chosen.
83
5 Hardware/Software Partitioning
The benchmark “DFMUL,” for example, actually has a longer execution time for the hybrid
2 system than for pure software, due to the communication overhead of each function call.
The trends in throughput, specifically execution time, are shown in Figure 5.2. For the two
hybrid cases, the bottom portion of each bar represents the ideal speedup, using Amdahl’s Law,
assuming a speedup equal to that achieved for the hardware-only case.
Table 5.2: Time results for hybrid systems in four configurations.Time Time Time Time
Benchmark Cycles Freq. (µs) Cycles Freq. (µs) Cycles Freq. (µs) Cycles Freq. (µs)adpcm 193,607 74.26 2,607.2 159,883 61.61 2,595.1 96,948 57.19 1,695.2 36,795 45.79 804.0aes 73,777 74.26 993.5 55,014 54.97 1,000.8 26,878 49.52 542.8 14,022 60.72 231.0blowfish 954,563 74.26 12,854.3 680,343 63.21 10,763.2 319,931 63.70 5,022.5 209,866 65.41 3,208.0dfadd 16,496 74.26 222.1 14,672 83.14 176.5 5,649 83.65 67.5 2,330 124.05 19.0dfdiv 71,507 74.26 962.9 15,973 83.78 190.7 4,538 65.92 68.8 2,144 74.72 29.0dfmul 6,796 74.26 91.5 10,784 85.46 126.2 2,471 83.53 29.6 347 85.62 4.0dfsin 2,993,369 74.26 40,309.3 293,031 65.66 4,462.9 80,678 68.23 1,182.4 67,466 62.64 1,077.0gsm 39,108 74.26 526.6 29,500 61.46 480.0 18,505 61.14 302.7 6,656 58.93 113.0jpeg 29,802,639 74.26 401,328.3 16,072,954 51.20 313,924.9 15,978,127 46.65 342,510.8 5,861,516 47.09 124,475.0mips 43,384 74.26 584.2 6,463 84.50 76.5 6,463 84.50 76.5 6,443 90.09 72.0motion 36,753 74.26 494.9 34,859 73.34 475.3 17,017 83.98 202.6 8,578 91.79 93.0sha 1,209,523 74.26 16,287.7 358,405 84.52 4,240.5 265,221 81.89 3,238.8 247,738 86.93 2,850.0dhrystone 28,855 74.26 388.6 25,599 82.26 311.2 25,509 83.58 305.2 10,202 85.38 119.0
Geomean: 173,332.0 74.26 2334.1 86,258.3 69.98 1232.6 42,700.5 67.78 630.0 20,853.8 71.56 291.7Ratio: 1.00 1.00 1.00 0.50 0.94 0.53 0.25 0.91 0.27 0.12 0.96 0.12
(a) SW-Only (b) Hybrid 2 (c) Hybrid 1 (d) HW-Only
The energy consumption of each system is shown in Table 5.3. Each configuration shows the
average dynamic power consumption of the system, measured in milliwatts (Power (mW)), and
the total energy consumed by the system while running the indicated benchmark, measured
in nanojoules (Energy (nJ)). It can be seen that the dynamic powers of the software-only and
hybrid systems differ by less than 12% on average, whereas the hardware-only case consumes
less than half of this power. This extra power is consumed by the processor; even though much
of the execution is performed by the accelerators, the processor is configured to poll until the
accelerator has completed, thus is always running. The design decision for the processor to poll
until the accelerator completes, rather than stall, is so that future work may enable concurrent
processing of the accelerators and processor using threads. Although the dynamic power results
do not look promising for the hybrid systems, the energy consumptions of each configuration
show much better results: as more computation is performed in hardware, each configuration
84
5.2 Experimental Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SW-Only Hybrid 2 Hybrid 1 HW-Only
Exe
cu
tio
n T
ime
Ra
tio
Exec. Time
Exec. Time Ideal
Figure 5.2: Execution time comparison for hybrid systems in four configurations, averagedacross 13 benchmarks.
halves in energy consumption. This result can be explained through the equation E = P × T ,
where E is energy, P is dynamic power, and T is execution time. As just discussed, P does not
change much for software-only or the hybrid systems. T , however, drastically improves from
software to hardware, as shown in Table 5.2, leading to excellent energy reductions. The trends
in energy consumption are shown in Figure 5.3.
85
5 Hardware/Software Partitioning
Table 5.3: Power and energy results for hybrid systems in four configurations.Power Energy Power Energy Power Energy Power Energy
Benchmark (mW) (nJ) (mW) (nJ) (mW) (nJ) (mW) (nJ)
adpcm 264.57 689,783.0 205.80 534,073.1 175.80 298,018.1 157.83 126,894.4aes 231.63 230,125.2 221.84 222,014.5 226.23 122,789.4 123.09 28,434.1blowfish 127.77 1,642,398.5 274.46 2,954,049.3 214.16 1,075,608.0 121.30 389,118.5dfadd 157.61 35,011.1 221.09 39,015.7 178.98 12,086.6 85.84 1,631.0dfdiv 214.70 206,741.0 257.54 49,101.0 193.20 13,299.9 83.66 2,426.0dfmul 122.50 11,210.6 171.13 21,593.8 202.21 5,981.8 40.00 160.0dfsin 240.75 9,704,502.3 274.49 1,224,986.7 240.06 283,857.4 129.14 139,082.3gsm 253.35 133,420.8 167.91 80,594.0 178.90 54,145.6 102.70 11,605.4jpeg 281.92 113,142,738.70 207.3 65,082,605.3 204.74 70,124,803.7 261.37 32,533,823.6mips 239.86 140,130.3 166.74 12,752.7 166.74 12,752.7 41.44 2,983.7motion 274.44 135,824.3 316.77 150,562.9 158.22 32,060.0 73.91 6,873.5sha 383.21 6,241,622.4 227.70 965,543.0 211.64 685,437.1 152.23 433,860.6dhrystone 260.15 101,084.8 230.82 71,830.7 243.18 74,220.9 116.19 13,826.0
Geomean: 221.67 517,413.28 221.61 273,162.62 194.46 122,511.37 100.75 29,390.23Ratio: 1.00 1.00 1.00 0.53 0.88 0.24 0.45 0.06
(a) SW-Only (b) Hybrid 2 (c) Hybrid 1 (d) HW-Only
0.00
0.20
0.40
0.60
0.80
1.00
SW-Only Hybrid 2 Hybrid 1 HW-Only
En
erg
y C
on
sum
pti
on
Ra
tio
Figure 5.3: Energy consumption comparison for hybrid systems in four configurations.
86
Chapter 6
Conclusions
6.1 Summary and Contributions
In recent years, SoCs implemented on reconfigurable FPGAs have become popular due to the
relative ease of development versus traditional ASIC designs, and the performance benefits
versus pure software. These systems combine multiple computational modules, including cus-
tom logic, communication interfaces, vendor intellectual property code, and soft processors. In
order to broaden the user-base of SoCs, and FPGAs in general, software developers need to
be able to create these systems using software development principles and limited hardware
knowledge. To this end, hardware/software partitioning is a process in which certain software
code segments are chosen to be converted into a hardware implementation to gain from the
performance benefits of hardware. Once partitioned, a framework such as LegUp can be used
to create the hardware accelerators for hybrid systems in which software code runs on a soft
processor and calls the accelerators. In order for this partitioning to provide performance ben-
efits, the choices of which code to convert to hardware must be based on real-time data about
the program’s execution.
Profiling is a form of dynamic code analysis in which real-time data is gathered about a
processor’s execution. Software approaches incur time overhead, resulting in very slow execution
of the application being profiled. To alleviate this overhead, sampling is often employed but
87
6 Conclusions
this leads to less accurate results. FPGA-based approaches are generally non-intrusive, meaning
they do not slow down the application’s execution. However, they add area and power overhead,
and may require resynthesis when targeting a new application.
This thesis has contributed a low-overhead and extensible architecture for profiling, called
LEAP, to aid in hardware/software partitioning for FPGA-based systems:� Chapter 3 showed the extensibility of LEAP through the implementation of three profil-
ing schemes: instruction profiling, cycle profiling, and stall cycle profiling. The accuracy
of the profiling results was verified through comparisons with existing profiling tools, and
the overhead was shown to be minimal as compared to another FPGA-based profiler;
up to a 94% reduction in area, and 88% reduction in dynamic power consumption was
achieved.� Chapter 4 introduced an instruction-level power database that characterizes the dynamic
power consumption of instructions in the MIPS1 instruction set. LEAP was shown to be
versatile by implementing instruction-specific profiling to measure the time spent exe-
cuting different instructions per function. The instruction power database was combined
with the real-time profiling results to estimate the function-level energy consumption of
13 complex C benchmarks, achieving an accuracy of 5.95%, on average, when compared to
simulation-based energy estimates. Cycle profiling and energy profiling results were used
to compare the percent execution time of each function within a benchmark to the percent
energy consumptions. It was found that the two are closely related, with the ordering
from high to low percentages being nearly identical for both profiles for all benchmarks.� Chapter 5 demonstrated the utility of LEAP through the use of its profiling results
to drive hardware/software partitioning and the creation of hybrid processor/accelerator
systems. The throughput, area, and power consumption of four systems created with
LegUp were presented to show the effectiveness of partitioning for reducing execution
time and energy consumption.
88
6.2 Future Work
6.2 Future Work
The research proposed in this thesis aims to enable efficient hardware/software partitioning for
optimizing throughput, energy consumption, and locality. LEAP was created to facilitate these
optimizations and shown to be extensible to capture the desired metrics. Three additional
metrics that could provide further insight into enhancing the partitioned system are left as
future work, as are two possible improvements to the profiling framework.
6.2.1 Detailed Function Profiling
The current implementation of the framework performs all measurements at the function level.
This is useful when determining which functions should be accelerated, but the technique gives
no indication of how they should be accelerated. This question of “how” can be answered
by performing detailed analyses of a function’s execution, especially by focusing on branch
conditions.
Branch conditions dictate the flow of execution and can be represented with conditional
statements, loops, and gotos. Dynamic optimizations such as loop unrolling and branch predic-
tion can be very effective when applied to hardware circuitry to leverage available parallelism,
but they require knowledge of branch outcomes to drive the optimization. Loop unrolling, for
example, replicates the body of a loop a number of times, attempting to exploit intra-iteration
parallelism. This replication obviously increases the area required to implement the loop, so
the optimization must be applied strategically such that long-running loops are unrolled many
times, while infrequently iterating loops are not unrolled.
A new profiling metric to enable hardware-orientated optimizations such as loop unrolling
could therefore be the measurement of the total or average number of iterations of a given loop.
All natural loops can be determined by short-backwards-branches (sbb), which are statically-
detectable branches from the tail to the head of a loop. By monitoring the outcomes of these
particular branches and identifying a given loop by its sbb instruction’s address, loop iterations
89
6 Conclusions
can be counted.
The current framework can be extended to implement this scheme by noting the similarities
between it and cycle-accurate profiling: each uses an instruction’s address as an index to store
data, and each simply increments a counter when an event occurs. Cycle-accurate profiling uses
the address of the first instruction in the function as its index into the storage RAM to store the
count of cycles executed. Iteration measurement would use the address of the sbb instruction
as its index into a storage RAM to store the number of times that instruction occurs.
6.2.2 Value Profiling
Standard compiler optimizations include evaluating and propagating constant variables and
expressions to reduce the amount of computation required by the compiled binary. In order
to enable these optimizations, however, the compiler must be able to determine that a given
variable or expression is indeed constant. A beneficial use of dynamic profiling would be to
enable the extension of this optimization to variables and expressions that are probable to be
constant.
A semi-static parameter is an argument to a function that is highly probable to have the
same value whenever that function is called. If it could be determined that a given parameter
is semi-static, a source-code transformation could be performed to enable further optimization.
The transformation would involve creating a second implementation of the applicable function
in which the semi-static parameter is removed as a variable and replaced as a constant with the
value it is highly likely to hold. Any call to this function would be redirected to one of the two
implementations available. If the parameter equals the highly-probable case, the newly-created
constant version is called; otherwise, the original version is used. In this way, the execution of
the code is potentially accelerated while maintaining correctness.
Dynamic profiling with the proposed framework can be used to determine semi-static pa-
rameters by using caches to store data for each function, as opposed to the currently-used data
counters. The caches could be indexed by the actual value of the desired parameter; if a param-
90
6.2 Future Work
eter was in fact semi-static, only one of these blocks would be updated often. The associativity
and replacement policy of the caches would need to be investigated to reduce the chances of
thrashing, which would occur when other values who hash to the same block overwrite the
desired semi-static value.
6.2.3 Memory Access Pattern Profiling
Nearly all modern processors incorporate instruction and/or data caches to help reduce the
overhead involved in accessing memory. There are many parameters to set when implementing
a cache, including its size, associativity, coherency protocol, and its ability to pre-fetch data
through bursting. These parameters, once chosen, are generally fixed characteristics of the
processor, even though software applications vary widely in the ways they access memory. One
possible avenue of research is to profile applications to determine their memory access patterns,
thus enabling a customization of the processor’s caches to the required application to reduce
memory access time overhead.
Code that iterates through an array many times will have a fairly consistent memory access
pattern because the data is located in contiguous memory addresses. This type of access
pattern would benefit from bursting since it is highly probable that an address in memory will
be accessed soon after the one which precedes it. In addition, high associativity is not necessary
due to the use of modulus in mapping an address to a cache line. For example, in a direct-
mapped cache containing M cache lines, the line that a given address A will be mapped to is
A mod M ; any set of addresses that differ by less than M will never be mapped to the same
line.
Code that makes heavy use of dynamically allocated memory, such as linked list iteration,
may have a more random access pattern. This type of pattern would likely benefit little from
bursting, since there is low probability that subsequent reads or writes would be to nearby
addresses. Higher associativity, however, would be beneficial in reducing the likelihood of
replacement, in which a cache-line conflict causes new data to overwrite the data currently
91
6 Conclusions
stored in the cache.
One way to extend this work in order to create a profiling scheme capable of determining the
memory access patterns of an application is to measure the spatial characteristics of memory
accesses. This would require access to the data bus, which would be monitored to discover the
locality, or lack thereof, exhibited by the application. High locality would indicate a contiguous
access pattern, while a lower locality would imply a more random one.
6.2.4 Counter Saturation Detection
LEAP contains multiple parameters that can be set to tailor the profiler to the user’s needs.
One such parameter is the width of the data counters used to store the profiling results, which
ranges from 16 to 64 bits and defines the largest integer that can be stored. For long-executing
functions, larger counters are required to avoid counter overflow, also known as saturation; this
increases the area requirements of the profiler.
In the context of hardware/software partitioning, the relative values of the data counters
are as useful as the absolute values since percentage of an overall resource is often all that is
desired. This means that any scaling factor could be applied to the results without affecting
the partition. One way to help reduce the area usage attributed to large counter values is to
implement smaller counters that contain logic to detect saturation. If at any time a counter
saturates, all counters are divided by two through a simple right shift. In this way, a final scaling
factor of 12S is applied to all counters, where S is the total number of saturations detected, and
a relative data set is maintained. This idea is presented in [27], but the authors do not address
a key issue with this approach. After a saturation has occurred, all values added to the counter
(which has been divided by 2) are worth twice as much as a value added prior to the saturation.
In this way, values measured near the beginning of execution are counted as less than those
profiled near the end of execution. This issue must be solved before implementation to achieve
correct counter results.
92
6.2 Future Work
6.2.5 Recursion Support
Currently, LEAP provides limited support for the profiling of recursive applications. This
limitation is due to the static depth of the Call Stack; the depth of the recursive function calls
can grow very large, causing the Call Stack’s stack pointer to overflow and incorrectly maintain
function contexts. A limited solution to this problem is to simply make the stack extremely
large. If N ≤ 256, where N is the maximum number of functions to profile, the number of bits
required to store a function number to the stack is N2 = log2(N) ≤ 8. Therefore, the number
of stack entries available in a single M4K AltSyncRam is 40968 = 512. This would provide
support for many recursive applications, but cannot be guaranteed to work under extreme uses
of recursion.
In order to provide full support for recursive programs, modifications to the Call Stack
module are required. First, it must be able to determine when an application begins recursion,
or equivalently, when a function already on the stack is called. This can be achieved through the
use of a bit vector of width N, where the position of each bit corresponds to a hashed function
number, and the value of the bit indicates its presence in the stack; a function currently in the
stack is represented with a one. During each function call, the bit corresponding to the called
function is checked to ensure it is not already in the stack, and the bit is set to a 1 to reflect the
function being pushed onto the stack. If a return occurs, the bit corresponding to the returning
function is cleared. If a checked bit is a 1 during a call, this means this function is being
called recursively and a “recursion mode” must be entered. The recursion mode differs from
normal operation as the stack pointer is not changed while in this mode, rather a “recursion
depth counter” is incremented on calls and decremented on returns to keep track of the current
depth of recursion. In addition, the bit vector is not modified in this mode. Upon entering
recursion mode, the depth counter is incremented to a 1 to reflect the current function call.
If any return causes the depth counter to reach 0, the recursion has completed and normal
operation continues. This approach lumps all cycles accrued during recursion into the top-level
recursive function when returning results, as opposed to simple approach of extending the stack
93
6 Conclusions
which attributes the cycles to each function that executes.
94
Appendix A
Complete Benchmark Results
This appendix contains the full set of results for the experiments described in Chapters 3, 4,
and 5.
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
A.1.1 Full Results for LEAP vs. gprof
ADPCM
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
abs 0.49 0.1% 0.49 0.1% 1155 0.49% 1155 0.5% 0.42% 0.42%adpcm main 10.06 1.6% 641.72 99.0% 4588 1.96% 228355 97.8% 0.41% 1.27%decode 117.31 18.1% 281.81 43.5% 63863 27.34% 102578 43.9% 9.24% 0.42%encode 145.12 22.4% 344.51 53.2% 62193 26.63% 118422 50.7% 4.23% 2.47%filtep 28.86 4.5% 28.86 4.5% 3655 1.56% 3655 1.6% 2.89% 2.89%filtez 77.93 12.0% 77.93 12.0% 14708 6.30% 14683 6.3% 5.73% 5.74%logsch 9.24 1.4% 9.24 1.4% 3100 1.33% 3097 1.3% 0.10% 0.10%logscl 13.92 2.1% 13.92 2.1% 3425 1.47% 3440 1.5% 0.68% 0.68%main 5.74 0.9% 647.46 99.9% 5122 2.19% 233486 100.0% 1.31% 0.04%quantl 34.89 5.4% 34.89 5.4% 14177 6.07% 14743 6.3% 0.69% 0.93%reset 5.34 0.8% 5.34 0.8% 2755 1.18% 2764 1.2% 0.36% 0.36%scalel 21.48 3.3% 21.48 3.3% 2902 1.24% 2888 1.2% 2.07% 2.08%uppol1 21.1 3.3% 21.1 3.3% 8097 3.47% 8097 3.5% 0.21% 0.21%uppol2 40.22 6.2% 40.22 6.2% 11209 4.80% 11212 4.8% 1.41% 1.41%upzero 116.23 17.9% 116.23 17.9% 32614 13.96% 32589 14.0% 3.97% 3.99%
(a) gprof (b) LEAP (c) Difference
95
A Complete Benchmark Results
AES
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
AddRoundKey 3.74 1.1% 3.74 1.1% 1248 1.7% 1228 1.7% 0.58% 0.55%AddRoundKey InversMixColumn 226.2 67.2% 226.2 67.2% 26069 35.3% 26058 35.3% 31.94% 31.96%aes main 0.18 0.1% 336.33 100.0% 1032 1.4% 73620 99.7% 1.34% 0.29%ByteSub ShiftRow 7.46 2.2% 7.46 2.2% 8942 12.1% 8970 12.1% 9.89% 9.93%decrypt 2.56 0.8% 259.13 77.0% 1514 2.1% 39028 52.9% 1.29% 24.18%encrypt 1.97 0.6% 77.02 22.9% 1916 2.6% 33551 45.4% 2.01% 22.54%frame dummy 0.01 0.0% 0.01 0.0% 0.0% 0.0% 0.00% 0.00%InversShiftRow ByteSub 6.62 2.0% 6.62 2.0% 5010 6.8% 5020 6.8% 4.82% 4.83%KeySchedule 38.28 11.4% 43.76 13.0% 14079 19.1% 15357 20.8% 7.69% 7.79%main 0.05 0.0% 336.33 100.0% 219 0.3% 73839 100.0% 0.28% 0.01%MixColumn AddRoundKey 43.83 13.0% 43.83 13.0% 12544 17.0% 12521 17.0% 3.96% 3.93%SubByte 5.48 1.6% 5.48 1.6% 1271 1.7% 1271 1.7% 0.09% 0.09%
(a) gprof (b) LEAP (c) Difference
BLOWFISH
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
BF cfb64 encrypt 924.25 23.2% 2639.39 66.3% 166292 16.4% 546458 54.0% 6.78% 12.30%BF encrypt 2426.87 60.9% 2426.87 60.9% 674439 66.6% 674327 66.6% 5.68% 5.67%BF set key 35.83 0.9% 793.89 19.9% 18402 1.8% 327706 32.4% 0.92% 12.44%blowfish main 548.51 13.8% 3981.79 100.0% 137899 13.6% 1012071 100.0% 0.15% 0.01%frame dummy 0.03 0.0% 0.03 0.0% 0.0% 0.0% 0.00% 0.00%main 0.13 0.0% 3981.88 100.0% 142 0.0% 1012213 100.0% 0.01% 0.00%memcpy 46.32 1.2% 46.32 1.2% 15085 1.5% 15085 1.5% 0.33% 0.33%
(a) gprof (b) LEAP (c) Difference
96
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
DFADD
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
addFloat64Sigs 7.79 16.9% 16.31 35.4% 4714 18.3% 9583 37.1% 1.33% 1.67%countLeadingZeros32 0.72 1.6% 0.72 1.6% 388 1.5% 388 1.5% 0.06% 0.06%countLeadingZeros64 0.49 1.1% 1.21 2.6% 453 1.8% 841 3.3% 0.69% 0.63%extractFloat64Exp 2.23 4.8% 2.23 4.8% 575 2.2% 575 2.2% 2.62% 2.62%extractFloat64Frac 1.93 4.2% 1.93 4.2% 591 2.3% 591 2.3% 1.90% 1.90%extractFloat64Sign 2.3 5.0% 2.3 5.0% 591 2.3% 591 2.3% 2.71% 2.71%float raise 0.21 0.5% 0.21 0.5% 53 0.2% 53 0.2% 0.25% 0.25%float64 add 4.86 10.6% 40.88 88.8% 3122 12.1% 22123 85.7% 1.53% 3.15%float64 is nan 1.39 3.0% 1.39 3.0% 262 1.0% 272 1.1% 2.01% 1.97%float64 is signaling nan 1.34 2.9% 1.34 2.9% 397 1.5% 397 1.5% 1.37% 1.37%frame dummy 0.05 0.1% 0.05 0.1% 0.0% 0.0% 0.11% 0.11%main 5.07 11.0% 45.96 99.9% 3666 14.2% 25775 99.8% 3.18% 0.04%normalizeRoundAndPackFloat64 0.6 1.3% 3.55 7.7% 665 2.6% 2450 9.5% 1.27% 1.78%packFloat64 1.29 2.8% 1.29 2.8% 195 0.8% 195 0.8% 2.05% 2.05%propagateFloat64NaN 3.96 8.6% 6.69 14.5% 1587 6.1% 2248 8.7% 2.46% 5.83%roundAndPackFloat64 3.01 6.5% 3.91 8.5% 2227 8.6% 2384 9.2% 2.08% 0.74%shift64RightJamming 1.29 2.8% 1.29 2.8% 1500 5.8% 1884 7.3% 3.01% 4.49%subFloat64Sigs 7.49 16.3% 17.42 37.9% 4324 16.7% 8820 34.2% 0.47% 3.69%
lshrdi3 0.0% 0.0% 191 0.7% 191 0.7% 0.74% 0.74%ashldi3 0.0% 0.0% 318 1.2% 303 1.2% 1.23% 1.17%
(a) gprof (b) LEAP (c) Difference
97
A Complete Benchmark Results
DFDIV
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
add128 0.26 0.8% 0.26 0.8% 0 0.0% 0 0.0% 0.78% 0.78%countLeadingZeros64 0.01 0.0% 0.01 0.0% 0 0.0% 0 0.0% 0.03% 0.03%estimateDiv128To64 8.43 25.4% 12.28 37.1% 2789 3.5% 66972 83.8% 21.95% 46.73%extractFloat64Exp 1.16 3.5% 1.16 3.5% 287 0.4% 287 0.4% 3.14% 3.14%extractFloat64Frac 0.98 3.0% 0.98 3.0% 329 0.4% 329 0.4% 2.55% 2.55%extractFloat64Sign 0.95 2.9% 0.95 2.9% 303 0.4% 303 0.4% 2.49% 2.49%float raise 0.2 0.6% 0.2 0.6% 138 0.2% 131 0.2% 0.43% 0.44%float64 div 6.94 20.9% 30.06 90.7% 6251 7.8% 77933 97.5% 13.12% 6.79%float64 is nan 0.17 0.5% 0.17 0.5% 115 0.1% 115 0.1% 0.37% 0.37%float64 is signaling nan 0.73 2.2% 0.73 2.2% 165 0.2% 165 0.2% 2.00% 2.00%main 2.56 7.7% 32.62 98.4% 1961 2.5% 79909 100.0% 5.27% 1.54%mul64To128 5.1 15.4% 5.1 15.4% 1574 2.0% 1577 2.0% 13.42% 13.42%normalizeFloat64Subnormal 0.26 0.8% 0.26 0.8% 0 0.0% 0 0.0% 0.78% 0.78%packFloat64 1 3.0% 1 3.0% 141 0.2% 141 0.2% 2.84% 2.84%propagateFloat64NaN 1.34 4.0% 2.23 6.7% 719 0.9% 1005 1.3% 3.14% 5.47%roundAndPackFloat64 1.74 5.3% 2.45 7.4% 1603 2.0% 1690 2.1% 3.25% 5.28%shift64RightJamming 0.0% 0.0% 0 0.0% 0 0.0% 0.00% 0.00%sub128 1.31 4.0% 1.31 4.0% 1110 1.4% 1110 1.4% 2.56% 2.56%
udivdi3 0.0% 0.0% 4174 5.2% 62435 78.1% 5.22% 78.11%udivsi3 0.0% 0.0% 1438 1.8% 29464 36.9% 1.80% 36.86%umodsi3 0.0% 0.0% 1368 1.7% 28819 36.1% 1.71% 36.05%
udivmodsi4 0.0% 0.0% 55470 69.4% 55470 69.4% 69.39% 69.39%
(a) gprof (b) LEAP (c) Difference
DFMUL
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
countLeadingZeros64 0.02 0.1% 0.02 0.1% 0 0.0% 0 0.0% 0.13% 0.13%extractFloat64Exp 1.17 7.4% 1.17 7.4% 287 2.6% 287 2.6% 4.81% 4.81%extractFloat64Frac 1.42 9.0% 1.42 9.0% 289 2.6% 289 2.6% 6.38% 6.38%extractFloat64Sign 0.58 3.7% 0.58 3.7% 279 2.6% 279 2.6% 1.13% 1.13%float raise 0.09 0.6% 0.09 0.6% 76 0.7% 76 0.7% 0.12% 0.12%float64 is nan 0.31 2.0% 0.31 2.0% 120 1.1% 120 1.1% 0.87% 0.87%float64 is signaling nan 0.33 2.1% 0.33 2.1% 171 1.6% 171 1.6% 0.53% 0.53%float64 mul 5.34 33.9% 13.66 86.8% 5051 46.2% 9138 83.5% 12.25% 3.25%main 1.71 10.9% 15.37 97.6% 1815 16.6% 10965 100.2% 5.73% 2.59%mul64To128 1.86 11.8% 1.86 11.8% 788 7.2% 788 7.2% 4.61% 4.61%normalizeFloat64Subnormal 0.24 1.5% 0.24 1.5% 0 0.0% 0 0.0% 1.52% 1.52%packFloat64 0.57 3.6% 0.57 3.6% 129 1.2% 129 1.2% 2.44% 2.44%propagateFloat64NaN 0.95 6.0% 1.58 10.0% 722 6.6% 1015 9.3% 0.56% 0.76%roundAndPackFloat64 1.05 6.7% 1.35 8.6% 1212 11.1% 1263 11.5% 4.41% 2.97%shift64RightJamming 0.1 0.6% 0.1 0.6% 0 0.0% 0 0.0% 0.64% 0.64%
(a) gprof (b) LEAP (c) Difference
98
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
DFSIN
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
add128 2.83 0.2% 2.83 0.2% 0 0.0% 0 0.0% 0.17% 0.17%addFloat64Sigs 33.57 2.0% 78.45 4.6% 14567 0.4% 74173 1.9% 1.61% 2.71%countLeadingZeros32 43.03 2.5% 43.03 2.5% 52029 1.4% 52036 1.4% 1.19% 1.19%countLeadingZeros64 16.52 1.0% 33.98 2.0% 27119 0.7% 58368 1.5% 0.27% 0.49%estimateDiv128To64 257.07 15.2% 345.05 20.4% 77938 2.0% 2985616 77.5% 13.18% 57.15%extractFloat64Exp 99.13 5.9% 99.13 5.9% 13335 0.3% 13335 0.3% 5.51% 5.51%extractFloat64Frac 60.68 3.6% 60.68 3.6% 10128 0.3% 10128 0.3% 3.32% 3.32%extractFloat64Sign 70.2 4.2% 70.2 4.2% 13342 0.3% 13342 0.3% 3.80% 3.80%float raise 0.0% 0.0% 0 0.0% 0 0.0% 0.00% 0.00%float64 abs 6.89 0.4% 6.89 0.4% 12033 0.3% 12059 0.3% 0.09% 0.09%float64 add 50.8 3.0% 372.37 22.0% 16488 0.4% 353666 9.2% 2.58% 12.83%float64 div 145.1 8.6% 652.04 38.6% 56192 1.5% 3158964 82.0% 7.12% 43.50%float64 ge 8.63 0.5% 88.51 5.2% 18409 0.5% 79901 2.1% 0.03% 3.16%float64 le 50.39 3.0% 79.88 4.7% 55222 1.4% 61634 1.6% 1.55% 3.12%float64 mul 139.22 8.2% 373.68 22.1% 58336 1.5% 177966 4.6% 6.72% 17.47%float64 neg 5.17 0.3% 5.17 0.3% 1046 0.0% 1054 0.0% 0.28% 0.28%frame dummy 1.22 0.1% 1.22 0.1% 0.0% 0.0% 0.07% 0.07%int32 to float64 39.83 2.4% 88.89 5.3% 13962 0.4% 44197 1.1% 1.99% 4.11%main 5.62 0.3% 1659.96 98.1% 2365 0.1% 3849898 100.0% 0.27% 1.85%mul64To128 118.72 7.0% 118.72 7.0% 133633 3.5% 133843 3.5% 3.55% 3.54%normalizeFloat64Subnormal 5.79 0.3% 5.79 0.3% 0 0.0% 0 0.0% 0.34% 0.34%normalizeRoundAndPackFloat64 14.75 0.9% 105.27 6.2% 51587 1.3% 138573 3.6% 0.47% 2.62%packFloat64 59.47 3.5% 59.47 3.5% 6687 0.2% 6687 0.2% 3.34% 3.34%propagateFloat64NaN 21.51 1.3% 21.51 1.3% 0 0.0% 0 0.0% 1.27% 1.27%roundAndPackFloat64 222.69 13.2% 258.32 15.3% 158701 4.1% 163792 4.3% 9.04% 11.02%shift64ExtraRightJamming 0.0% 0.0% 0 0.0% 0 0.0% 0.00% 0.00%shift64RightJamming 14.69 0.9% 14.69 0.9% 94591 2.5% 134740 3.5% 1.59% 2.63%sin 66.79 3.9% 1654.34 97.8% 19729 0.5% 3847539 99.9% 3.44% 2.12%sub128 37.86 2.2% 37.86 2.2% 30995 0.8% 30870 0.8% 1.43% 1.44%subFloat64Sigs 93.15 5.5% 237.61 14.0% 30683 0.8% 259812 6.7% 4.71% 7.30%
lshrdi3 0.0% 0.0% 20098 0.5% 20005 0.5% 0.52% 0.52%ashldi3 0.0% 0.0% 36042 0.9% 36113 0.9% 0.94% 0.94%udivdi3 0.0% 0.0% 280963 7.3% 2824428 73.4% 7.30% 73.36%udivsi3 0.0% 0.0% 33760 0.9% 1325113 34.4% 0.88% 34.42%umodsi3 0.0% 0.0% 38029 1.0% 1218137 31.6% 0.99% 31.64%
udivmodsi4 0.0% 0.0% 2472067 64.2% 2471591 64.2% 64.21% 64.20%
(a) gprof (b) LEAP (c) Difference
99
A Complete Benchmark Results
GSM
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
Autocorrelation 76.18 44.3% 118.73 69.0% 28406 56.0% 33783 66.6% 11.68% 2.46%frame dummy 0.45 0.3% 0.45 0.3% 0.0% 0.0% 0.26% 0.26%gsm abs 23.67 13.8% 23.67 13.8% 2596 5.1% 2596 5.1% 8.65% 8.65%gsm add 4.67 2.7% 4.67 2.7% 1744 3.4% 1744 3.4% 0.72% 0.72%gsm div 9.72 5.7% 9.72 5.7% 1531 3.0% 1531 3.0% 2.63% 2.63%Gsm LPC Analysis 0.11 0.1% 156.45 91.0% 212 0.4% 44089 86.9% 0.35% 4.09%gsm mult 2.96 1.7% 2.96 1.7% 187 0.4% 187 0.4% 1.35% 1.35%gsm mult r 28.96 16.8% 28.96 16.8% 3639 7.2% 3639 7.2% 9.67% 9.67%gsm norm 0.52 0.3% 0.52 0.3% 245 0.5% 245 0.5% 0.18% 0.18%main 15.07 8.8% 171.52 99.7% 6684 13.2% 50746 100.0% 4.41% 0.27%Quantization and coding 1.56 0.9% 5.46 3.2% 1250 2.5% 1771 3.5% 1.56% 0.32%Reflection coefficients 6.85 4.0% 29.81 17.3% 3476 6.8% 7608 15.0% 2.87% 2.34%Transformation to Log Area Ratios 1.26 0.7% 2.34 1.4% 615 1.2% 715 1.4% 0.48% 0.05%legup memset 4 0.0% 0.0% 161 0.3% 161 0.3% 0.32% 0.32%
(a) gprof (b) LEAP (c) Difference
100
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
JPEG
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
BoundIDctMatrix 459.66 2.6% 459.66 2.6% 203370 3.3% 203370 3.3% 0.70% 0.70%buf getb 1869.29 10.6% 2091.38 11.9% 1437809 23.4% 1479621 24.0% 12.75% 12.17%buf getv 769.36 4.4% 914.62 5.2% 641506 10.4% 673810 10.9% 6.06% 5.76%ChenIDct 2895.91 16.4% 2895.91 16.4% 619944 10.1% 619944 10.1% 6.36% 6.36%decode block 42.21 0.2% 13081.3 74.2% 12492 0.2% 4752166 77.2% 0.04% 2.96%decode start 89.4 0.5% 16341.6 92.7% 5719 0.1% 5650316 91.8% 0.41% 0.95%DecodeHuffman 2919.39 16.6% 5010.76 28.4% 748738 12.2% 2228359 36.2% 4.40% 7.77%DecodeHuffMCU 2352.81 13.4% 8278.19 47.0% 547850 8.9% 3450019 56.1% 4.45% 9.07%first marker 0.15 0.0% 0.21 0.0% 343 0.0% 567 0.0% 0.00% 0.01%frame dummy 5.66 0.0% 5.66 0.0% 0.0% 0.0% 0.03% 0.03%get dht 27.7 0.2% 41.79 0.2% 5260 0.1% 11872 0.2% 0.07% 0.04%get dqt 13.83 0.1% 18.37 0.1% 4578 0.1% 6730 0.1% 0.00% 0.01%get sof 2.35 0.0% 3.26 0.0% 2248 0.0% 2505 0.0% 0.02% 0.02%get sos 1.4 0.0% 1.91 0.0% 1207 0.0% 1371 0.0% 0.01% 0.01%huff make dhuff tb 115.04 0.7% 115.04 0.7% 15761 0.3% 15761 0.3% 0.40% 0.40%IQuantize 502.41 2.9% 502.41 2.9% 199028 3.2% 199028 3.2% 0.38% 0.38%IZigzagMatrix 497.81 2.8% 497.81 2.8% 165262 2.7% 165262 2.7% 0.14% 0.14%jpeg init decompress 0.71 0.0% 115.75 0.7% 1075 0.0% 16836 0.3% 0.01% 0.38%jpeg read 0.14 0.0% 16530.2 93.8% 183 0.0% 5692492 92.5% 0.00% 1.33%jpeg2bmp main 1081.02 6.1% 17611.2 100.0% 462095 7.5% 6154750 100.0% 1.37% 0.04%main 1.11 0.0% 17612.4 100.0% 249 0.0% 6155008 100.0% 0.00% 0.04%next marker 1.4 0.0% 2.49 0.0% 448 0.0% 947 0.0% 0.00% 0.00%pgetc 367.35 2.1% 367.35 2.1% 74116 1.2% 74116 1.2% 0.88% 0.88%PostshiftIDctMatrix 405.1 2.3% 405.1 2.3% 102051 1.7% 102051 1.7% 0.64% 0.64%read byte 19.35 0.1% 19.35 0.1% 9580 0.2% 9580 0.2% 0.05% 0.05%read dword 0.04 0.0% 0.04 0.0% 0 0.0% 0 0.0% 0.00% 0.00%read markers 4.69 0.0% 72.73 0.4% 1165 0.0% 25157 0.4% 0.01% 0.00%read word 1.87 0.0% 1.87 0.0% 328 0.0% 328 0.0% 0.01% 0.01%Write4Blocks 28.7 0.2% 1244.3 7.1% 15925 0.3% 222711 3.6% 0.10% 3.44%WriteBlock 0.92 0.0% 0.92 0.0% 0 0.0% 0 0.0% 0.01% 0.01%WriteOneBlock 1215.6 6.9% 1215.6 6.9% 206786 3.4% 206786 3.4% 3.54% 3.54%YuvToRgb 1926.62 10.9% 1926.62 10.9% 669720 10.9% 669720 10.9% 0.05% 0.05%
(a) gprof (b) LEAP (c) Difference
MIPS
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
main 115.82 100.0% 115.82 100.0% 43417 99.0% 43891 100.1% 1.01% 0.07%legup memset 4 0.0% 443 1.0% 435 1.0% 1.01% 0.99%
(a) gprof (b) LEAP (c) Difference
101
A Complete Benchmark Results
MOTION
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
decode motion vector 0.3 0.4% 0.3 0.4% 312 1.2% 312 1.2% 0.84% 0.84%Fill Buffer 0.11 0.1% 80 93.8% 406 1.5% 19952 75.9% 1.42% 17.88%Flush Buffer 2 2.3% 82.01 96.2% 1921 7.3% 21883 83.3% 4.97% 12.88%Get Bits 0.15 0.2% 59 69.2% 365 1.4% 1303 5.0% 1.21% 64.23%Get Bits1 0.1 0.1% 35.5 41.6% 109 0.4% 440 1.7% 0.30% 39.96%Get dmvector 0.02 0.0% 0.02 0.0% 0 0.0% 0 0.0% 0.02% 0.02%Get motion code 0.28 0.3% 47.55 55.8% 430 1.6% 1076 4.1% 1.31% 51.67%Initialize Buffer 0.04 0.0% 11.76 13.8% 263 1.0% 21082 80.2% 0.95% 66.45%main 1.68 2.0% 85.24 100.0% 1222 4.7% 26273 100.0% 2.68% 0.04%motion vector 0.17 0.2% 59.81 70.1% 740 2.8% 2757 10.5% 2.62% 59.65%motion vectors 0.2 0.2% 71.81 84.2% 866 3.3% 3970 15.1% 3.06% 69.10%read 79.89 93.7% 79.89 93.7% 19561 74.5% 19546 74.4% 19.23% 19.29%Show Bits 0.33 0.4% 0.33 0.4% 77 0.3% 77 0.3% 0.09% 0.09%
(a) gprof (b) LEAP (c) Difference
SHA
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
frame dummy 0.01 0.0% 0.01 0.0% 0.0% 0.0% 0.00% 0.00%main 0.34 0.0% 4262.22 100.0% 362 0.0% 1147523 100.0% 0.02% 0.00%memcpy 721.09 16.9% 721.09 16.9% 120372 10.5% 120393 10.5% 6.43% 6.43%memset 1.85 0.0% 1.85 0.0% 248 0.0% 248 0.0% 0.02% 0.02%sha final 0.15 0.0% 15.65 0.4% 402 0.0% 4598 0.4% 0.03% 0.03%sha init 7.78 0.2% 7.78 0.2% 229 0.0% 229 0.0% 0.16% 0.16%sha stream 0.26 0.0% 4261.88 100.0% 255 0.0% 1147163 100.0% 0.02% 0.02%sha transform 3508.55 82.3% 3508.55 82.3% 1020909 89.0% 1020907 89.0% 6.65% 6.65%sha update 22.2 0.5% 4238.19 99.4% 4729 0.4% 1142081 99.5% 0.11% 0.09%
(a) gprof (b) LEAP (c) Difference
102
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
DHRYSTONE
Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time
frame dummy 0.14 0.1% 0.14 0.1% 0.0% 0.0% 0.14% 0.14%Func 1 5.14 5.0% 5.14 5.0% 535 1.7% 535 1.7% 3.38% 3.38%Func 2 2.18 2.1% 30.26 29.6% 1530 4.7% 9240 28.6% 2.60% 1.04%Func 3 1.14 1.1% 1.14 1.1% 136 0.4% 136 0.4% 0.70% 0.70%main 8.57 8.4% 102.02 99.9% 5493 17.0% 32262 99.8% 8.60% 0.08%Proc 1 6 5.9% 12.15 11.9% 2377 7.4% 7093 21.9% 1.48% 10.05%Proc 2 0.95 0.9% 0.95 0.9% 477 1.5% 477 1.5% 0.55% 0.55%Proc 3 1.01 1.0% 1.95 1.9% 858 2.7% 978 3.0% 1.67% 1.12%Proc 4 1.99 1.9% 1.99 1.9% 468 1.4% 468 1.4% 0.50% 0.50%Proc 5 0.31 0.3% 0.31 0.3% 208 0.6% 208 0.6% 0.34% 0.34%Proc 6 2.13 2.1% 3.27 3.2% 1234 3.8% 1379 4.3% 1.73% 1.06%Proc 7 2.81 2.8% 2.81 2.8% 488 1.5% 460 1.4% 1.24% 1.33%Proc 8 6.86 6.7% 6.86 6.7% 1984 6.1% 1984 6.1% 0.58% 0.58%strcmp 30.32 29.7% 30.32 29.7% 9103 28.2% 9095 28.1% 1.52% 1.55%strcpy 32.6 31.9% 32.6 31.9% 5176 16.0% 5176 16.0% 15.90% 15.90%legup memcpy 4 0.0% 0.0% 2261 7.0% 2261 7.0% 6.99% 6.99%
(a) gprof (b) LEAP (c) Difference
103
A Complete Benchmark Results
A.1.2 Full Results for LEAP vs. SnoopP
ADPCM
Function LEAP SnoopP Error LEAP SnoopP Error
main 5122 5122 0 1814 1829 -15encode 62193 62204 -11 4245 4271 -26decode 63863 63863 0 3656 3652 4upzero 32614 32620 -6 948 942 6filtez 14708 14708 0 695 691 4uppol2 11209 11209 0 230 233 -3quantl 14177 14175 2 737 723 14filtep 3655 3655 0 56 55 1uppol1 8097 8097 0 127 126 1scalel 2902 2902 0 288 292 -4logscl 3425 3425 0 234 234 0logsch 3100 3112 -12 134 133 1reset 2755 2768 -13 2332 2341 -9abs 1155 1155 0 56 55 1adpcm main 4588 4588 0 1641 1626 15
Total 233563 233603 -40 17193 17203 -10
(a) Cycles (b) Stall Cycles
104
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
AES
Function LEAP SnoopP Error LEAP SnoopP Error
main 219 219 0 203 202 1decrypt 1514 1500 14 955 941 14AddRoundKey InversMixColumn 26069 26058 11 1531 1508 23encrypt 1916 1925 -9 1281 1294 -13MixColumn AddRoundKey 12544 12521 23 6420 6460 -40KeySchedule 14079 14065 14 5405 5384 21ByteSub ShiftRow 8942 8970 -28 6593 6547 46InversShiftRow ByteSub 5010 5020 -10 2625 2634 -9SubByte 1271 1271 0 78 71 7AddRoundKey 1248 1228 20 467 481 -14aes main 1032 1035 -3 919 918 1
Total 73844 73812 32 26477 26440 37
(a) Cycles (b) Stall Cycles
BLOWFISH
Function LEAP SnoopP Error LEAP SnoopP Error
main 142 142 0 134 133 1BF cfb64 encrypt 166292 166354 -62 21059 21059 0BF encrypt 674439 674394 45 42075 42034 41BF set key 18402 18397 5 3631 3628 3memcpy 15085 15091 -6 6175 6172 3blowfish main 137899 137911 -12 22657 22656 1
Total 1012259 1012289 -30 95731 95682 49
(a) Cycles (b) Stall Cycles
105
A Complete Benchmark Results
DFADD
Function LEAP SnoopP Error LEAP SnoopP Error
main 3666 3649 17 1964 1986 -22float64 add 3122 3118 4 369 368 1subFloat64Sigs 4324 4335 -11 1327 1291 36addFloat64Sigs 4714 4729 -15 1346 1356 -10propagateFloat64NaN 1587 1594 -7 572 555 17roundAndPackFloat64 2227 2233 -6 523 519 4normalizeRoundAndPackFloat64 665 665 0 248 248 0extractFloat64Exp 575 575 0 23 23 0extractFloat64Sign 591 591 0 39 39 0extractFloat64Frac 591 591 0 41 39 2float64 is signaling nan 397 397 0 87 87 0float64 is nan 262 262 0 64 64 0shift64RightJamming 1500 1485 15 390 403 -13countLeadingZeros64 453 453 0 215 215 0packFloat64 195 195 0 41 39 2countLeadingZeros32 388 388 0 172 187 -15float raise 53 53 0 39 39 0
lshrdi3 191 191 0 71 71 0ashldi3 318 303 15 104 112 -8
Total 25819 25807 12 7635 7641 -6
(a) Cycles (b) Stall Cycles
106
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
DFDIV
Function LEAP SnoopP Error LEAP SnoopP Error
main 1961 1981 -20 1136 1139 -3float64 div 6251 6210 41 2050 2021 29estimateDiv128To64 2789 2804 -15 953 952 1mul64To128 1574 1582 -8 284 274 10roundAndPackFloat64 1603 1603 0 443 449 -6sub128 1110 1132 -22 239 238 1propagateFloat64NaN 719 713 6 467 451 16extractFloat64Sign 303 303 0 39 39 0extractFloat64Exp 287 287 0 24 23 1packFloat64 141 141 0 39 39 0extractFloat64Frac 329 329 0 65 65 0float64 is signaling nan 165 165 0 87 87 0float64 is nan 115 115 0 65 64 1add128 0 0 0 0 0 0normalizeFloat64Subnormal 0 0 0 0 0 0float raise 138 131 7 111 110 1countLeadingZeros64 0 0 0 0 0 0shift64RightJamming 0 0 0 0 0 0countLeadingZeros32 0 0 0 0 0 0
lshrdi3 0 0 0 0 0 0ashldi3 0 0 0 0 0 0udivdi3 4174 4190 -16 917 915 2udivsi3 1438 1438 0 187 171 16umodsi3 1368 1365 3 101 101 0
udivmodsi4 55470 55453 17 595 600 -5
Total 79935 79942 -7 7802 7738 64
(a) Cycles (b) Stall Cycles
107
A Complete Benchmark Results
DFMUL
Function LEAP SnoopP Error LEAP SnoopP Error
main 1815 1827 -12 1074 1082 -8float64 mul 5051 5077 -26 1855 1830 25mul64To128 788 788 0 273 288 -15propagateFloat64NaN 722 724 -2 488 488 0roundAndPackFloat64 1212 1227 -15 479 465 14extractFloat64Frac 289 289 0 49 49 0extractFloat64Exp 287 287 0 47 47 0extractFloat64Sign 279 279 0 40 39 1packFloat64 129 129 0 39 39 0float64 is nan 120 120 0 71 71 0float64 is signaling nan 171 171 0 87 87 0normalizeFloat64Subnormal 0 0 0 0 0 0float raise 76 76 0 62 62 0shift64RightJamming 0 0 0 0 0 0countLeadingZeros64 0 0 0 0 0 0countLeadingZeros32 0 0 0 0 0 0
lshrdi3 0 0 0 0 0 0ashldi3 0 0 0 0 0 0
Total 10939 10994 -55 4564 4547 17
(a) Cycles (b) Stall Cycles
108
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
DFSIN
Function LEAP SnoopP Error LEAP SnoopP Error
main 2365 2379 -14 1291 1291 0sin 19729 19718 11 1578 1578 0float64 div 56192 56189 3 1557 1557 0float64 add 16488 16482 6 304 322 -18float64 mul 58336 58353 -17 1199 1201 -2estimateDiv128To64 77938 77970 -32 35321 35321 0subFloat64Sigs 30683 30696 -13 1176 1176 0roundAndPackFloat64 158701 158717 -16 73201 73201 0mul64To128 133633 133652 -19 94579 94579 0extractFloat64Exp 13335 13335 0 39 39 0int32 to float64 13962 13950 12 225 225 0normalizeRoundAndPackFloat64 51587 51582 5 41515 41515 0float64 ge 18409 18405 4 14389 14389 0extractFloat64Sign 13342 13342 0 46 46 0float64 le 55222 55241 -19 31891 31891 0addFloat64Sigs 14567 14569 -2 969 969 0extractFloat64Frac 10128 10128 0 48 48 0countLeadingZeros32 52029 52053 -24 38452 38452 0packFloat64 6687 6687 0 39 39 0countLeadingZeros64 27119 27105 14 22727 22727 0sub128 30995 31001 -6 16755 16755 0shift64RightJamming 94591 94609 -18 78640 78640 0propagateFloat64NaN 0 0 0 0 0 0float64 neg 1046 1046 0 830 830 0float64 abs 12033 12041 -8 10425 10425 0normalizeFloat64Subnormal 0 0 0 0 0 0add128 0 0 0 0 0 0shift64ExtraRightJamming 0 0 0 0 0 0float raise 0 0 0 0 0 0float64 is nan 0 0 0 0 0 0float64 is signaling nan 0 0 0 0 0 0
lshrdi3 20098 20114 -16 16633 16633 0ashldi3 36042 36016 26 26879 26879 0udivdi3 280963 280979 -16 204026 204026 0udivsi3 33760 33753 7 204 204 0umodsi3 38029 38027 2 6257 6257 0
udivmodsi4 2472067 2472113 -46 122484 122484 0
Total 3850076 3850252 -176 843679 843699 -20
(a) Cycles (b) Stall Cycles
109
A Complete Benchmark Results
GSM
Function LEAP SnoopP Error LEAP SnoopP Error
main 6684 6668 16 1327 1322 5Gsm LPC Analysis 212 212 0 198 182 16Autocorrelation 28406 28416 -10 3415 3397 18gsm mult r 3639 3639 0 71 71 0Reflection coefficients 3476 3474 2 1845 1835 10gsm abs 2596 2596 0 110 106 4gsm div 1531 1534 -3 119 132 -13Quantization and coding 1250 1240 10 1017 1015 2gsm add 1744 1744 0 97 87 10gsm mult 187 187 0 75 71 4Transformation to Log Area Ratios 615 615 0 364 359 5gsm norm 245 245 0 198 204 -6legup memset 2 0 0 0 0 0 0legup memset 4 161 176 -15 105 105 0
Total 50746 50746 0 8941 8886 55
(a) Cycles (b) Stall Cycles
110
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
JPEG
Function LEAP SnoopP Error LEAP SnoopP Error
buf getb 1437809 1437809 0 527188 527188 0DecodeHuffman 748738 748738 0 22722 22722 0buf getv 641506 641506 0 67232 67232 0pgetc 74116 74116 0 10064 10064 0read byte 9580 9580 0 1117 1119 -2WriteOneBlock 206786 206786 0 29098 29098 0DecodeHuffMCU 547850 547850 0 109599 109598 1decode block 12492 12492 0 482 482 0BoundIDctMatrix 203370 203370 0 126 126 0ChenIDct 619944 619944 0 7203 7203 0IQuantize 199028 199028 0 23484 23484 0IZigzagMatrix 165262 165262 0 8158 8158 0PostshiftIDctMatrix 102051 102051 0 62 62 0YuvToRgb 669720 669720 0 2347 2347 0Write4Blocks 15925 15925 0 1951 1951 0read word 328 328 0 135 120 15next marker 448 448 0 113 113 0get dht 5260 5260 0 993 993 0huff make dhuff tb 15761 15761 0 2794 2816 -22get dqt 4578 4578 0 1073 1079 -6decode start 5719 5719 0 1514 1514 0first marker 343 343 0 315 315 0get sof 2248 2248 0 1801 1800 1get sos 1207 1207 0 832 832 0jpeg init decompress 1075 1075 0 932 932 0jpeg read 183 183 0 156 156 0main 249 249 0 232 247 -15read markers 1165 1165 0 788 783 5read dword 0 0 0 0 0 0WriteBlock 0 0 0 0 0 0jpeg2bmp main 462095 462098 -3 54716 54819 -103
Total 6,154,836 6,154,839 -3 877,227 877,353 -126
(a) Cycles (b) Stall Cycles
111
A Complete Benchmark Results
MIPS
Function LEAP SnoopP Error LEAP SnoopP Error
main 43404 43410 -6 2918 2945 -27legup memset 4 443 443 0 97 97 0
Total 43847 43853 -6 3015 3042 -27
(a) Cycles (b) Stall Cycles
MOTION
Function LEAP SnoopP Error LEAP SnoopP Error
main 1222 1227 -5 1021 1024 -3Flush Buffer 1921 1933 -12 911 898 13Fill Buffer 406 404 2 364 366 -2read 19561 19561 0 3140 3131 9motion vectors 866 866 0 748 748 0motion vector 740 740 0 650 635 15Get Bits 365 362 3 222 222 0Get motion code 430 436 -6 346 346 0Get Bits1 109 109 0 76 76 0Initialize Buffer 263 254 9 228 227 1decode motion vector 312 312 0 248 264 -16Show Bits 77 77 0 29 44 -15Get dmvector 0 0 0 0 0 0
Total 26272 26281 -9 7983 7981 2
(a) Cycles (b) Stall Cycles
SHA
Function LEAP SnoopP Error LEAP SnoopP Error
main 362 362 0 330 291 39sha stream 255 255 0 223 223 0sha update 4729 4729 0 512 512 0sha transform 1020909 1020956 -47 3256 3244 12memcpy 120372 120387 -15 23859 23839 20sha final 402 402 0 317 319 -2memset 248 248 0 144 158 -14sha init 229 229 0 206 189 17
Total 1147506 1147568 -62 28847 28775 72
(a) Cycles (b) Stall Cycles
112
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
DHRYSTONE
Function LEAP SnoopP Error LEAP SnoopP Error
main 5493 5519 -26 2074 2112 -38strcpy 5176 5177 -1 249 263 -14strcmp 9103 9107 -4 267 252 15Func 2 1530 1530 0 522 505 17Proc 1 2377 2364 13 612 603 9Proc 8 1984 1993 -9 400 399 1Func 1 535 535 0 56 54 2Proc 7 488 451 37 87 72 15Proc 6 1234 1234 0 276 276 0Proc 3 858 858 0 241 250 -9Proc 4 468 468 0 108 108 0Proc 2 477 477 0 79 79 0Func 3 136 136 0 17 16 1Proc 5 208 208 0 49 48 1legup memcpy 4 2261 2261 0 136 135 1
Total 32328 32318 10 5173 5172 1
(a) Cycles (b) Stall Cycles
113
A Complete Benchmark Results
A.1.3 Full Results for Power Overhead
Table A.1: Power overhead results for LEAP, measured in milliwatts, for 16, 32, and 64 func-tions. Note: “C” refers to cycle profiling, “S” refers to stall-cycle profiling, “N” isthe max number of functions profiled, and “CW” is the counter width.
N 16 32 64
CW 16 32 64 16 32 64 16 32 64Scheme C S C S C S C S C S C S C S C S C S
adpcm 4.48 4.72 6.85 5.55 6.43 6.82 4.89 4.69 5.76 6.16 6.15 6.25 5.23 5.01 6.06 6.02 6.17 6.62aes 4.31 4.61 6.67 5.41 6.29 6.69 4.74 4.60 5.57 6.05 5.94 6.17 5.02 4.87 5.88 5.90 5.97 6.46blowfish 4.81 5.00 7.18 5.88 6.69 7.10 5.16 4.94 6.02 6.39 6.41 6.50 5.47 5.23 6.33 6.27 6.38 6.83dfadd 4.55 4.84 6.98 5.66 6.55 6.95 5.00 4.82 5.86 6.30 6.25 6.39 5.29 5.13 6.15 6.18 6.25 6.75dfdiv 4.52 4.76 6.92 5.59 6.48 6.85 4.95 4.73 5.85 6.21 6.22 6.29 5.31 5.06 6.11 6.06 6.24 6.74dfmul 4.40 4.70 6.80 5.51 6.41 6.81 4.85 4.71 5.70 6.17 6.09 6.27 5.12 4.99 6.00 6.04 6.09 6.60dfsin 4.39 4.65 6.78 5.47 6.36 6.74 4.83 4.64 5.71 6.11 6.07 6.20 5.16 4.97 5.98 5.96 6.12 6.62gsm 4.44 4.74 6.81 5.54 6.45 6.85 4.89 4.72 5.72 6.19 6.08 6.30 5.14 5.03 6.03 6.06 6.12 6.58jpeg 4.58 4.85 6.99 5.53 6.55 6.95 5.00 4.78 5.95 6.27 6.25 6.37 5.32 5.11 6.13 6.14 6.26 6.72mips 4.66 5.01 7.03 5.73 6.62 7.02 5.08 4.86 5.95 6.35 6.30 6.44 5.42 5.20 6.26 6.20 6.35 6.78motion 4.59 4.91 6.97 5.71 6.58 7.01 5.02 4.87 5.89 6.31 6.23 6.43 5.31 5.21 6.18 6.19 6.30 6.78sha 4.79 5.20 7.16 5.86 6.70 7.08 5.15 4.94 6.11 6.39 6.40 6.49 5.52 5.27 6.34 6.28 6.44 6.87dhrystone 4.63 4.91 7.03 5.72 6.62 6.03 5.08 4.87 5.93 6.36 6.27 6.46 5.34 5.20 6.20 6.25 6.32 6.78
average 4.55 4.84 6.94 5.63 6.52 6.84 4.97 4.78 5.85 6.25 6.20 6.35 5.28 5.10 6.13 6.12 6.23 6.70
Table A.2: Power overhead results for LEAP, measured in milliwatts, for 128 and 256 functions.Note: “C” refers to cycle profiling, “S” refers to stall-cycle profiling, “N” is the maxnumber of functions profiled, and “CW” is the counter width.
N 128 256
CW 16 32 64 16 32 64Scheme C S C S C S C S C S C S
dfadd 5.04 4.85 6.72 6.60 6.30 6.16 5.04 5.21 5.89 5.93 6.16 7.09dfmul 4.90 4.72 6.55 6.36 6.16 6.02 4.86 5.06 5.75 5.83 6.02 6.91motion 5.03 4.88 6.73 6.64 6.27 6.16 5.06 5.22 5.98 5.00 6.11 7.15
average 4.99 4.82 6.67 6.53 6.24 6.11 4.99 5.16 5.87 5.59 6.10 7.05
114
A.1 Full Results for Chapter 3: Cycle-Accurate Profiling
Table A.3: Power overhead results for SnoopP, measured in milliwatts, for 16, 32, and 64 func-tions. Note: “C” refers to cycle profiling, “S” refers to stall-cycle profiling, “N” isthe max number of functions profiled, and “CW” is the counter width.
N 16 32 64
CW 16 32 64 16 32 64 16 32 64Scheme C S C S C S C S C S C S C S C S C S
adpcm 2.90 2.61 3.94 4.20 5.99 3.86 6.46 5.89 6.97 5.71 9.38 5.47 9.37 9.00 11.54 9.01 16.46 10.93aes 2.75 2.58 3.80 4.26 5.83 4.13 5.92 5.56 6.66 5.85 8.84 6.03 8.83 7.88 10.99 9.30 15.38 11.32blowfish 2.98 2.71 4.01 4.30 6.04 3.96 6.21 5.69 7.13 5.89 9.20 5.72 9.77 8.43 11.91 9.43 16.12 10.55dfadd 2.95 2.75 3.98 4.44 6.03 4.27 6.48 6.07 7.07 6.16 9.37 6.25 9.53 8.53 11.72 9.86 16.41 11.68dfdiv 3.02 2.79 4.04 4.40 6.09 4.07 6.37 6.26 7.28 6.06 9.72 5.84 9.97 8.66 12.11 9.70 17.13 10.91dfmul 2.85 2.69 3.89 4.40 5.92 4.30 5.96 5.94 6.86 6.11 8.87 6.35 9.11 8.27 11.31 9.75 16.02 11.88dfsin 2.91 2.74 3.94 4.39 5.99 4.15 6.15 6.11 7.05 6.07 9.05 6.04 9.61 8.49 11.74 9.71 15.83 11.34gsm 2.93 2.69 3.98 4.32 6.02 4.05 6.13 5.87 7.02 5.92 9.28 5.82 9.50 8.29 11.66 9.43 16.26 10.89jpeg 2.95 2.74 3.97 4.37 6.02 4.11 6.26 5.75 7.16 6.07 9.18 5.97 9.72 8.53 11.88 9.67 15.98 11.05mips 3.30 3.00 4.29 4.61 6.34 4.27 6.80 6.75 7.70 6.46 10.20 6.19 10.86 9.52 13.02 10.51 18.22 10.59motion 3.08 2.90 4.11 4.59 6.15 4.39 6.41 6.53 7.29 6.38 9.80 6.44 10.00 9.02 12.18 10.33 17.31 12.09sha 3.05 2.76 4.07 4.33 6.11 3.96 6.31 5.73 7.26 5.93 9.25 5.62 10.01 8.54 12.10 9.47 16.25 10.49dhrystone 3.33 3.09 4.33 4.73 6.40 4.43 6.87 6.62 7.76 6.63 10.00 6.46 10.87 9.68 13.02 10.76 17.71 12.12
average 3.00 2.77 4.03 4.41 6.07 4.15 6.33 6.06 7.17 6.10 9.40 6.02 9.78 8.68 11.94 9.76 16.54 11.22
Table A.4: Power overhead results for SnoopP, measured in milliwatts, for 128 and 256 func-tions. Note: “C” refers to cycle profiling, “S” refers to stall-cycle profiling, “N” isthe max number of functions profiled, and “CW” is the counter width.
N 128 256
CW 16 32 64 16 32 64Scheme C S C S C S C S C S C S
dfadd 16.71 14.80 20.51 16.76 27.98 20.64 28.63 25.04 35.53 28.56 52.62 40.08dfmul 15.87 14.24 19.64 16.51 27.12 21.01 26.63 23.73 33.36 27.78 50.09 40.09motion 17.72 15.70 21.47 17.71 28.92 21.57 30.40 26.80 37.27 30.31 54.25 41.91
average 16.77 14.91 20.54 16.99 28.01 21.07 28.55 25.19 35.39 28.88 52.32 40.69
115
A Complete Benchmark Results
A.2 Full Results for Chapter 4: Energy Consumption Profiling
A.2.1 Full Results for Cache Stall Energy Estimates
ADPCM
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28abs 0 100 0 900 100 0 55 2,960.50reset 0 0 382 48 0 0 2323 4,035.83filtez 1200 3992 0 8800 0 0 715 35,573.37filtep 400 0 0 2200 0 1000 55 9,986.15quantl 993 1664 3126 6632 471 521 730 34,992.36logscl 200 300 700 1891 100 0 234 8,382.52scalel 0 200 600 1200 600 0 287 7,190.45upzero 2496 6722 6839 14375 1200 0 946 79,712.37uppol2 2792 1000 200 6198 586 200 233 27,070.45uppol1 600 600 400 5371 1000 0 126 20,692.39logsch 200 300 200 1668 100 496 133 8,083.42decode 3746 10328 20271 22541 2050 1200 3682 155,220.31encode 3992 9227 13358 29370 700 1300 4241 151,221.70adpcm main 0 356 958 1467 0 200 1615 9,799.36main 0 901 302 1656 300 150 1806 10,846.07
Total 16,619 35,692 47,337 121,586 7,207 5,067 17,266 565,891
AES
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28encrypt 1 128 127 359 0 16 1279 3,242.21ByteSub ShiftRow 0 440 780 880 0 320 6521 14,651.42SubByte 0 80 240 560 160 160 86 3,359.29KeySchedule 204 2861 1740 3668 2 190 5370 28,441.28InversShiftRow ByteSub 0 451 715 892 0 320 2640 9,602.37MixColumn AddRoundKey 43 2413 1396 1733 198 333 6441 23,544.89AddRoundKey InversMixColumn 52 8356 5429 7257 1026 2421 1513 64,871.36AddRoundKey 8 326 118 238 36 36 465 2,499.79decrypt 1 119 108 327 0 16 981 2,710.19aes main 0 5 64 51 0 0 915 1,470.95main 0 3 5 9 0 0 202 300.42
Total 309 15,184 10,723 15,977 1,422 3,812 26,498 154,817
116
A.2 Full Results for Chapter 4: Energy Consumption Profiling
BLOWFISH
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28memcpy 2 2086 1042 5781 0 0 6173 30,508.65BF encrypt 0 158114 196728 221319 0 56208 42034 1,677,869.71BF cfb64 encrypt 0 24452 20486 60749 18868 20819 21194 417,796.34BF set key 0 1822 2656 9479 147 593 3686 42,913.71blowfish main 130 21455 796 71654 15860 5236 22889 332,437.50main 0 1 2 6 0 0 133 192.55
Total 132 207,932 221,711 368,991 34,875 82,856 96,194 2,501,842
DFADD
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28shift64RightJamming 0 224 283 585 8 0 388 3,279.09countLeadingZeros32 0 56 64 96 0 0 172 760.15countLeadingZeros64 0 70 32 136 0 0 215 873.12float raise 0 4 2 8 0 0 39 84.96float64 is nan 0 105 24 69 0 0 64 564.17float64 is signaling nan 0 56 132 122 0 0 94 898.47propagateFloat64NaN 0 220 312 506 0 0 555 3,323.38extractFloat64Frac 0 0 276 276 0 0 39 1,464.67extractFloat64Exp 0 0 184 368 0 0 23 1,458.57extractFloat64Sign 0 0 92 460 0 0 39 1,493.23packFloat64 0 0 0 104 0 52 39 498.53roundAndPackFloat64 0 414 522 750 0 18 534 4,968.14normalizeRoundAndPackFloat64 0 48 116 245 8 0 263 1,403.25addFloat64Sigs 24 610 1120 1504 28 98 1342 10,317.65subFloat64Sigs 0 582 1028 1305 42 88 1294 9,398.98float64 add 0 322 1065 1390 0 0 377 7,536.89main 0 556 188 758 138 46 1992 6,828.17
lshrdi3 0 24 0 56 16 16 71 392.45ashldi3 0 32 0 104 28 12 103 599.04
Total 24 3,325 5,441 8,845 268 330 7,728 56,266
117
A Complete Benchmark Results
DFDIV
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28sub128 0 240 118 394 120 0 238 2,528.47mul64To128 160 180 120 861 0 0 268 3,628.84estimateDiv128To64 0 420 488 914 0 0 987 5,847.33float raise 0 6 3 12 0 0 110 193.05float64 is nan 0 27 6 18 0 0 64 205.89float64 is signaling nan 0 14 33 31 0 0 87 306.84propagateFloat64NaN 0 55 84 120 0 0 451 1,226.40extractFloat64Frac 0 0 132 132 0 0 65 759.54extractFloat64Exp 0 0 88 176 0 0 23 712.86extractFloat64Sign 0 0 44 220 0 0 39 740.08packFloat64 0 0 0 68 0 34 39 343.16roundAndPackFloat64 0 280 337 519 0 12 458 3,474.05float64 div 0 761 1366 1989 14 88 2025 13,307.71main 0 268 92 374 66 22 1185 3,601.89
udivdi3 48 786 1119 1116 72 96 915 9,313.76udivsi3 0 603 428 240 0 0 171 3,298.16umodsi3 0 585 439 240 0 0 102 3,196.33
udivmodsi4 0 20749 9751 21145 280 2848 621 138,782.39
Total 208 24,976 14,649 28,572 552 3,100 7,933 191,590
DFMUL
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28mul64To128 64 72 46 333 0 0 273 1,627.69float raise 0 4 2 8 0 0 62 114.26float64 is nan 0 26 6 17 0 0 71 209.88float64 is signaling nan 0 16 36 32 0 0 87 321.52propagateFloat64NaN 0 56 79 116 0 0 476 1,237.55extractFloat64Frac 0 0 120 120 0 0 49 677.64extractFloat64Exp 0 0 80 160 0 0 47 681.30extractFloat64Sign 0 0 40 200 0 0 42 681.14packFloat64 0 0 0 60 0 30 39 308.63roundAndPackFloat64 0 184 217 330 0 8 474 2,463.48float64 mul 0 575 1057 1540 4 32 1830 10,462.91main 0 244 84 342 60 20 1084 3,290.03
Total 64 1,179 1,768 3,261 64 90 4,619 22,199
118
A.2 Full Results for Chapter 4: Energy Consumption Profiling
DFSIN
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28shift64RightJamming 0 2772 3234 9722 231 0 78701 140,991.99sub128 0 3513 1752 7197 1752 0 16656 57,616.95mul64To128 4752 5648 2496 26193 0 0 94754 218,031.13estimateDiv128To64 0 9345 12636 20595 0 0 35302 152,202.19countLeadingZeros32 0 3157 3824 6112 0 484 38459 83,503.61countLeadingZeros64 0 1281 732 2379 0 0 22721 39,986.11extractFloat64Frac 0 0 5040 5040 0 0 48 25,899.99extractFloat64Exp 0 0 4432 8864 0 0 39 34,476.29extractFloat64Sign 0 0 2216 11080 0 0 46 34,829.15packFloat64 0 0 0 4432 0 2216 39 19,177.23roundAndPackFloat64 0 19956 27847 36131 0 836 73170 306,416.55normalizeRoundAndPackFloat64 0 1098 2013 6798 183 0 41501 78,867.23int32 to float64 0 3553 2678 6988 536 0 225 35,068.72addFloat64Sigs 84 2491 4500 6036 89 336 972 35,613.92subFloat64Sigs 0 6,364 9,951 11,877 603 736 1,191 76,391.010 float64 add 0 1876 6128 8169 0 0 292 41,480.29float64 mul 0 11795 18012 26330 0 1208 1194 146,921.82float64 div 0 10381 17804 24297 267 1890 1547 141,286.79float64 le 0 5360 5941 12079 0 0 31818 99,493.93float64 ge 0 268 536 3216 0 0 14247 28,590.92float64 neg 0 36 72 108 0 0 838 1,614.35float64 abs 0 0 804 804 0 0 10451 17,436.48sin 338 1736 4520 11560 0 0 1578 48,395.20main 0 364 114 488 72 36 1285 4,369.11
lshrdi3 0 693 0 1617 462 462 16540 29,792.17ashldi3 0 1405 0 4632 1293 611 26950 55,510.91udivdi3 1068 21365 22975 27813 1602 2136 204219 453,453.01udivsi3 0 18476 9709 5339 0 0 204 80,856.23umodsi3 0 16697 9652 5340 0 0 6242 84,330.70
udivmodsi4 0 885250 420277 899521 16786 127290 122467 6,080,143.23
Total 6,242 1,034,882 599,896 1,200,760 23,876 138,241 843,781 8,652,870
119
A Complete Benchmark Results
GSM
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28gsm add 79 158 158 1104 79 79 87 4,420.06gsm mult 8 8 16 76 0 8 71 391.21gsm mult r 223 223 446 1561 0 1115 71 9,989.43gsm abs 0 446 356 1504 90 90 110 6,577.01gsm norm 0 10 12 24 2 0 197 372.80gsm div 0 323 154 612 105 218 119 3,919.96Autocorrelation 1405 812 6980 11279 1199 3175 3434 70,074.92Reflection coefficients 39 132 235 836 118 263 1855 6,748.06Transformation to Log Area Ratios 8 29 21 143 27 24 363 1,131.08Quantization and coding 16 18 26 123 16 18 1023 1,870.07Gsm LPC Analysis 0 3 7 20 0 0 182 308.97main 0 657 322 2754 640 968 1351 16,362.96legup memset 4 0 12 10 32 0 2 105 277.35
Total 1,778 2,833 8,744 20,071 2,276 5,960 9,053 122,567
120
A.2 Full Results for Chapter 4: Energy Consumption Profiling
JPEG
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28read byte 0 603 1206 6030 631 0 1117 23,522.31read word 0 20 30 80 25 38 120 682.94first marker 0 5 5 18 0 0 315 472.74next marker 0 52 43 240 0 0 122 1,015.38get sof 1 56 106 218 25 41 1790 3,458.37get sos 1 50 93 200 16 15 828 2,026.50get dht 16 516 1008 2644 8 76 987 12,236.50get dqt 2 546 694 1873 2 388 1087 10,617.50read markers 0 64 62 240 0 11 774 1,957.61ChenIDct 129011 48960 44057 266226 112868 11520 7203 1,537,415.41IZigzagMatrix 0 27648 36864 74160 0 18441 8149 423,033.82IQuantize 55292 27652 27648 55728 0 9216 23511 444,672.79PostshiftIDctMatrix 9216 18432 27648 37454 0 9216 62 258,277.35BoundIDctMatrix 9216 36790 18447 129486 0 9305 122 520,926.38WriteOneBlock 288 36864 1152 120178 1152 18054 29078 506,111.68Write4Blocks 0 2146 4462 6718 0 720 1943 38,641.07YuvToRgb 24672 60613 72995 447561 55292 6144 2344 1,729,816.63pgetc 0 9148 18296 32026 4582 0 10079 176,900.12buf getb 0 293289 194476 402265 0 20756 526837 2,958,965.09buf getv 1,995 87,890 204,327 241,263 26,190 12,600 67,165 1,552,590.86huff make dhuff tb 303 1971 1206 8412 0 1075 2804 37,478.92DecodeHuffman 0 133512 175044 379172 5750 32256 22702 1,895,934.53DecodeHuffMCU 15880 61639 99121 229894 144 31615 109551 1,270,764.75decode block 0 864 4592 6199 144 144 482 31,272.85decode start 0 555 917 2731 0 3 1504 12,690.62jpeg init decompress 1 11 58 67 4 2 946 1,571.51jpeg read 0 2 9 16 0 0 156 267.96jpeg2bmp main 0 68936 42285 211695 79341 5207 54686 1,134,902.81main 0 3 5 9 0 0 238 346.29
Total 245,894 918,839 976,857 2,662,806 286,174 186,843 876,787 14,588,695
MIPS
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28main 0 6316 7672 24387 172 1889 2947 108,528.60legup memset 4 0 68 66 202 0 2 97 983.73
Total 0 6,386 7,739 24,592 172 1,891 3,129 109,636
121
A Complete Benchmark Results
MOTION
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28read 1 2049 0 10253 2048 2048 3147 48,468.90Fill Buffer 0 5 20 17 0 0 364 569.81Show Bits 0 6 12 24 6 0 29 160.97Flush Buffer 30 149 277 490 57 0 895 3,683.25Get Bits 0 15 42 78 0 5 222 644.37Get Bits1 0 3 6 24 0 0 76 182.00Get motion code 2 13 14 50 3 2 346 656.47decode motion vector 2 12 7 28 6 0 252 460.75motion vector 0 20 31 52 1 1 635 1,075.43motion vectors 0 22 44 41 0 2 776 1,263.42Initialize Buffer 0 1 13 13 0 0 232 364.51main 0 53 39 96 8 4 1021 1,808.89
Total 35 2,350 506 11,169 2,129 2,062 8,080 59,462.06
SHA
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 94 134.75memset 0 16 16 68 3 1 144 451.38memcpy 258 16898 12798 24584 25344 16384 23892 293,333.64sha transform 0 255210 135389 539703 20560 61680 3243 2,616,217.93sha init 0 0 19 5 0 0 205 321.60sha update 0 274 1061 2880 0 2 512 11,530.65sha final 0 10 29 29 1 1 317 581.65sha stream 0 3 11 18 0 0 223 365.86main 0 17 14 17 0 0 312 516.16
Total 258 272,430 149,338 567,307 45,908 78,068 28,942 2,923,454
122
A.2 Full Results for Chapter 4: Energy Consumption Profiling
DHRYSTONE
CacheFunction A B C D E F Stalls Energy (nJ)
wrap 0 2 1 3 0 0 85 123.28strcpy 0 682 0 2882 682 682 260 13,729.38strcmp 0 1519 23 6189 1069 40 248 23,373.74Func 3 0 40 0 80 0 0 16 323.37Proc 6 0 220 140 578 0 20 276 2,797.53Proc 7 0 0 60 369 0 0 72 1,215.40Proc 8 60 119 260 950 0 196 399 4,706.37Func 1 0 60 120 300 0 0 55 1,298.23Func 2 40 160 178 607 40 0 504 3,243.83Proc 3 0 120 158 339 0 0 241 1,870.21Proc 1 0 260 463 1029 0 0 623 5,258.31Proc 2 0 60 40 258 40 0 79 1,132.73Proc 4 0 80 80 180 20 0 108 1,052.11Proc 5 0 0 40 100 0 20 48 491.67main 81 484 1128 1572 104 60 2073 11,356.06legup memcpy 4 0 520 260 1346 0 0 135 5,565.31
Total 181 4,326 2,951 16,782 1,955 1,018 5,222 77,538
123
A Complete Benchmark Results
A.2.2 Full Results for Pipeline Stall Energy Estimates
ADPCM
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89abs 0 0 299 748 0 0 108 2,978.35reset 0 0 65 96 0 0 2594 5,025.29filtez 0 2395 2195 3384 1196 0 5513 33,643.84filtep 0 0 598 799 600 0 1658 8,438.64quantl 0 2183 3364 2471 991 0 5120 32,289.59logscl 0 399 1191 795 99 0 948 8,056.15scalel 0 200 1591 798 0 0 298 7,158.11upzero 0 4137 5376 7153 1848 0 14038 73,001.07uppol2 0 1597 3392 1783 799 0 3652 25,952.73uppol1 0 798 1984 2184 398 0 2733 18,820.20logsch 0 400 1030 630 200 0 837 7,314.58decode 0 6872 14744 13062 2547 0 26632 143,372.68encode 0 7782 20665 12490 2642 0 18669 144,953.36adpcm main 0 354 1383 773 200 0 1871 10,318.75main 0 713 1049 875 149 0 2338 11,256.73
Total 0 27,831 58,927 135,138 11,669 0 87,097 532,741
AES
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89encrypt 0 93 279 162 16 0 1364 3,823.67ByteSub ShiftRow 0 337 444 536 296 0 7334 17,255.80SubByte 0 80 638 312 159 0 82 3,264.29KeySchedule 80 1296 1120 2028 192 99 9264 28,854.28InversShiftRow ByteSub 0 316 530 674 311 0 3181 10,478.86MixColumn AddRoundKey 0 1262 1706 1917 259 0 7434 26,378.71AddRoundKey InversMixColumn 0 4805 9934 6479 2424 0 2413 65,287.42AddRoundKey 0 158 113 256 38 0 666 2,640.75decrypt 0 84 255 135 16 0 1018 3,055.52aes main 0 5 31 35 0 0 968 1,901.49main 0 1 3 8 0 0 207 398.74
Total 80 8,438 15,054 12,542 3,711 99 34,019 163,500
124
A.2 Full Results for Chapter 4: Energy Consumption Profiling
BLOWFISH
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89memcpy 0 2086 3906 2100 0 0 6976 32,702.59BF encrypt 0 99515 294703 180006 56207 0 44132 1,708,240.37BF cfb64 encrypt 0 13254 42732 50993 19495 0 39836 406,465.08BF set key 0 1672 4337 3365 663 0 8341 40,708.50blowfish main 0 10923 49435 37223 15599 0 24712 341,822.71main 0 1 0 4 0 0 137 256.17
Total 0 127,452 395,114 273,691 91,964 0 124,222 2,530,356
DFADD
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89shift64RightJamming 0 128 233 437 0 0 687 3,287.69countLeadingZeros32 0 38 54 83 0 0 217 833.87countLeadingZeros64 0 37 61 85 0 0 270 948.60float raise 0 2 3 7 0 0 41 103.91float64 is nan 0 24 69 77 0 0 92 601.32float64 is signaling nan 0 24 79 156 0 0 138 923.01propagateFloat64NaN 0 164 263 329 0 0 823 3,393.23extractFloat64Frac 0 0 183 365 0 0 43 1,529.35extractFloat64Exp 0 0 183 368 0 0 24 1,503.82extractFloat64Sign 0 0 91 459 0 0 41 1,550.43packFloat64 0 0 51 50 51 0 43 495.27roundAndPackFloat64 0 265 458 550 18 0 939 4,977.42normalizeRoundAndPackFloat64 0 48 139 140 0 0 338 1,440.19addFloat64Sigs 8 462 724 1233 96 0 2190 10,425.61subFloat64Sigs 14 392 635 1154 87 0 2028 9,520.49float64 add 0 321 753 887 0 0 1161 7,103.72main 0 324 598 531 46 0 2176 7,692.21
lshrdi3 0 23 48 37 0 0 75 407.25ashldi3 0 28 82 61 0 0 125 658.18
Total 22 2,281 4,708 7,009 298 0 11,539 57,556
125
A Complete Benchmark Results
DFDIV
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89sub128 0 156 138 433 0 0 383 2,563.36mul64To128 79 79 313 258 0 79 778 3,492.80estimateDiv128To64 0 293 482 596 0 0 1413 6,013.39float raise 0 3 5 10 0 0 113 247.10float64 is nan 0 5 17 18 0 0 75 236.23float64 is signaling nan 0 6 19 36 0 0 104 343.97propagateFloat64NaN 0 36 64 76 0 0 534 1,397.58extractFloat64Frac 0 0 87 154 0 0 88 793.76extractFloat64Exp 0 0 88 174 0 0 25 738.96extractFloat64Sign 0 0 43 219 0 0 41 776.72packFloat64 0 0 34 34 32 0 41 347.80roundAndPackFloat64 0 176 315 376 12 0 728 3,547.34float64 div 0 500 1114 1470 85 0 3055 13,626.43main 0 156 286 255 22 0 1274 4,097.91
udivdi3 0 474 894 688 143 0 1953 9,107.31udivsi3 0 237 238 431 0 0 536 3,266.96umodsi3 0 240 194 429 0 0 502 3,097.56
udivmodsi4 0 16556 2566 19549 1424 0 15280 129,169.32
Total 79 18,918 6,898 25,206 1,718 79 27,011 183,025
DFMUL
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89mul64To128 31 30 123 93 0 31 495 1,682.78float raise 0 2 2 7 0 0 65 143.96float64 is nan 0 5 16 18 0 0 81 244.35float64 is signaling nan 0 6 17 44 0 0 104 360.71propagateFloat64NaN 0 41 60 74 0 0 549 1,420.20extractFloat64Frac 0 0 79 140 0 0 70 703.63extractFloat64Exp 0 0 79 159 0 0 49 718.09extractFloat64Sign 0 0 39 198 0 0 42 711.30packFloat64 0 0 29 30 29 0 41 315.25roundAndPackFloat64 0 116 198 241 8 0 641 2,581.46float64 mul 0 374 789 1173 32 0 2674 10,875.01main 0 142 260 232 20 0 1199 3,798.68
Total 31 717 1,692 2,409 89 31 6,098 23,716
126
A.2 Full Results for Chapter 4: Energy Consumption Profiling
DFSIN
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89shift64RightJamming 0 1848 3446 5011 0 0 84327 176,110.31sub128 0 2333 1811 5730 0 0 21025 62,802.32mul64To128 2074 1308 8715 6059 0 2074 113611 254,515.93estimateDiv128To64 0 6669 10405 12002 0 0 48831 160,759.75countLeadingZeros32 0 1987 2889 4615 484 0 42039 100,420.04countLeadingZeros64 0 915 1097 1828 0 0 23249 51,071.07extractFloat64Frac 0 0 3358 6147 0 0 623 26,270.23extractFloat64Exp 0 0 4431 8862 0 0 42 35,324.12extractFloat64Sign 0 0 2216 11077 0 0 49 35,798.43packFloat64 0 0 2216 2214 2215 0 42 18,390.51roundAndPackFloat64 0 11913 21526 27004 836 0 96710 329,051.91normalizeRoundAndPackFloat64 0 731 2562 2745 0 0 45534 96,348.83int32 to float64 0 1338 5847 3215 0 0 3581 32,871.20addFloat64Sigs 84 1795 3122 5078 334 0 4130 34,262.09subFloat64Sigs 90 3,800 6,000 11,385 735 0 8,706 72,613.20float64 add 0 1872 4467 5273 0 0 4853 38,488.92float64 mul 0 7275 13803 22779 1207 0 13490 141,002.24float64 div 0 6614 14975 19401 1888 0 13309 134,979.52float64 le 0 2752 5960 6572 0 0 39895 109,951.01float64 ge 0 268 1716 1340 0 0 14952 35,093.94float64 neg 0 0 107 73 0 0 863 1,998.13float64 abs 0 0 305 840 0 0 10918 22,416.40sin 0 1734 8159 5283 268 0 4299 47,306.39main 0 183 396 328 36 0 1416 4,929.44
lshrdi3 0 462 693 1155 0 0 17432 36,864.98ashldi3 0 990 3161 2697 0 0 28340 67,824.10udivdi3 0 10129 16019 12364 2937 0 239711 531,273.10udivsi3 0 5340 5338 9609 0 0 13443 75,693.16umodsi3 0 5072 4272 9611 0 0 18964 82,195.32
udivmodsi4 0 715228 105213 833886 63912 0 753291 5,709,598.97
Total 2,248 792,557 264,226 1,044,183 74,852 2,074 1,667,763 8,526,386
127
A Complete Benchmark Results
GSM
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89gsm add 0 237 156 856 79 0 416 4,244.40gsm mult 0 8 15 26 15 0 123 390.54gsm mult r 0 222 668 891 445 0 1413 8,470.80gsm abs 0 355 444 1062 90 0 645 6,239.74gsm norm 0 6 9 18 2 0 210 463.93gsm div 0 323 168 578 217 0 245 3,831.53Autocorrelation 0 812 5320 7374 2361 0 12398 64,458.12Reflection coefficients 0 139 650 562 47 0 2073 7,301.83Transformation to Log Area Ratios 0 36 69 84 7 0 419 1,249.20Quantization and coding 0 17 83 53 0 0 1094 2,332.22Gsm LPC Analysis 0 37 72 89 7 0 477 1,375.51main 0 320 1923 1752 805 0 1857 16,072.95legup memset 4 0 9 24 13 2 0 113 322.88
Total 0 2,522 9,602 13,358 4,077 0 21,571 116,915
128
A.2 Full Results for Chapter 4: Energy Consumption Profiling
JPEG
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89read byte 0 603 1808 5383 602 0 1198 24,531.19read word 0 10 39 75 39 0 150 709.52first marker 0 3 4 11 0 0 339 648.14next marker 0 50 86 127 0 0 185 1,004.88get sof 0 42 183 98 18 0 1922 4,286.66get sos 0 604 1808 5377 600 0 1317 24,722.15get dht 0 515 1999 1437 82 0 1236 12,559.63get dqt 0 544 1441 681 372 0 1545 10,595.25read markers 0 72 240 160 23 0 2119 5,032.22ChenIDct 0 57912 214544 107384 48376 0 191699 1,451,413.10IZigzagMatrix 0 27647 64308 27934 18432 0 26938 404,951.28IQuantize 0 36863 36244 27719 18432 0 79753 448,731.02PostshiftIDctMatrix 0 27648 27790 18718 9216 0 18670 245,478.29BoundIDctMatrix 0 46003 37169 55611 9305 0 55282 476,886.79WriteOneBlock 0 36287 78610 55669 2411 0 33805 499,798.27Write4Blocks 0 1583 1446 4837 144 0 7906 34,898.29YuvToRgb 0 60610 173445 159234 30815 0 245628 1,537,795.97pgetc 0 9147 17881 22876 4580 0 19641 176,968.37buf getb 0 124534 287817 252246 20756 0 752268 3,093,447.70buf getv 0 84,462 137,450 196,352 7,652 0 215,473 1,479,462.47huff make dhuff tb 0 2156 4295 3290 1072 0 4965 36,766.45DecodeHuffman 0 133244 151886 255078 32120 0 176365 1,792,712.27DecodeHuffMCU 0 73468 152048 116265 32188 0 173975 1,273,775.99decode block 0 864 4744 3013 144 0 3741 29,179.98decode start 0 413 1634 919 0 0 2729 12,397.24jpeg init decompress 0 7 50 36 1 0 988 1,995.06jpeg read 0 1 8 9 0 0 165 339.54jpeg2bmp main 0 52007 142434 120623 37064 0 110147 1,113,682.89main 0 1 3 7 0 0 238 451.01
Total 0 777,301 1,541,415 1,441,169 274,444 0 2,130,475 14,195,382
MIPS
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89main 0 5800 16235 10416 1879 0 9019 104,185.73legup memset 4 0 65 193 69 1 0 105 1,011.57
Total 0 5,866 16,429 10,485 1,880 0 9,212 105,358
129
A Complete Benchmark Results
MOTION
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89read 0 2049 8064 4104 2048 0 3284 48,186.70Fill Buffer 0 4 9 7 0 0 386 735.48Show Bits 0 6 18 23 0 0 30 174.83Flush Buffer 0 125 340 315 30 0 1085 4,014.60Get Bits 0 20 38 40 0 0 264 718.59Get Bits1 0 3 11 14 0 0 81 216.30Get motion code 0 13 16 24 0 0 377 804.08decode motion vector 0 9 18 18 0 0 264 583.15motion vector 0 15 18 27 1 0 692 1,383.59motion vectors 0 12 7 38 2 0 812 1,594.85Initialize Buffer 0 1 11 6 0 0 236 464.84main 0 24 55 59 3 0 1089 2,294.54
Total 0 2,282 8,606 4,675 2,084 0 8,688 61,332.44
SHA
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89memset 0 14 46 25 1 0 162 506.19memcpy 0 4866 6143 42240 32767 0 34159 301,618.58sha transform 0 128736 464374 288781 61678 0 72190 2,564,198.26sha init 0 0 7 9 0 0 213 419.86sha update 0 269 2339 1052 2 0 1081 11,282.42sha final 0 7 15 19 1 0 354 736.39sha stream 0 3 10 8 0 0 234 468.82main 0 6 14 15 0 0 327 669.77
Total 0 133,902 472,949 332,149 94,449 0 108,808 2,880,061
DHRYSTONE
PipelineFunction A B C D E F Stalls Energy (nJ)
wrap 0 1 1 0 0 0 88 160.89memset 0 14 46 25 1 0 162 506.19memcpy 0 4866 6143 42240 32767 0 34159 301,618.58sha transform 0 128736 464374 288781 61678 0 72190 2,564,198.26sha init 0 0 7 9 0 0 213 419.86sha update 0 269 2339 1052 2 0 1081 11,282.42sha final 0 7 15 19 1 0 354 736.39sha stream 0 3 10 8 0 0 234 468.82main 0 6 14 15 0 0 327 669.77
Total 0 133,902 472,949 332,149 94,449 0 108,808 2,880,061
130
A.2 Full Results for Chapter 4: Energy Consumption Profiling
A.2.3 Full Results for Energy/Time Correlation
ADPCM
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.0% 123.28 0.0% 160.89 0.0%abs 1155 0.5% 2,960.50 0.5% 2,978.35 0.6%reset 2753 1.2% 4,035.83 0.7% 5,025.29 0.9%filtez 14707 6.3% 35,573.37 6.3% 33,643.84 6.3%filtep 3655 1.6% 9,986.15 1.8% 8,438.64 1.6%quantl 14137 6.1% 34,992.36 6.2% 32,289.59 6.1%logscl 3425 1.5% 8,382.52 1.5% 8,056.15 1.5%scalel 2887 1.2% 7,190.45 1.3% 7,158.11 1.3%upzero 32578 14.0% 79,712.37 14.1% 73,001.07 13.7%uppol2 11209 4.8% 27,070.45 4.8% 25,952.73 4.9%uppol1 8097 3.5% 20,692.39 3.7% 18,820.20 3.5%logsch 3097 1.3% 8,083.42 1.4% 7,314.58 1.4%decode 63818 27.3% 155,220.31 27.4% 143,372.68 26.9%encode 62188 26.6% 151,221.70 26.7% 144,953.36 27.2%adpcm main 4596 2.0% 9,799.36 1.7% 10,318.75 1.9%main 5115 2.2% 10,846.07 1.9% 11,256.73 2.1%
Total 233508 100.0% 565,891 100.0% 532,741 100.0%
131
A Complete Benchmark Results
AES
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.1% 123.28 0.1% 160.89 0.1%encrypt 1910 2.6% 3,242.21 2.1% 3,823.67 2.3%ByteSub ShiftRow 8941 12.1% 14,651.42 9.5% 17,255.80 10.6%SubByte 1286 1.7% 3,359.29 2.2% 3,264.29 2.0%KeySchedule 14035 19.0% 28,441.28 18.4% 28,854.28 17.6%InversShiftRow ByteSub 5018 6.8% 9,602.37 6.2% 10,478.86 6.4%MixColumn AddRoundKey 12557 17.0% 23,544.89 15.2% 26,378.71 16.1%AddRoundKey InversMixColumn 26054 35.2% 64,871.36 41.9% 65,287.42 39.9%AddRoundKey 1227 1.7% 2,499.79 1.6% 2,640.75 1.6%decrypt 1552 2.1% 2,710.19 1.8% 3,055.52 1.9%aes main 1035 1.4% 1,470.95 1.0% 1,901.49 1.2%main 219 0.3% 300.42 0.2% 398.74 0.2%
Total 73925 100.0% 154,817 100.0% 163,500 100.0%
BLOWFISH
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.0% 123.28 0.0% 160.89 0.0%memcpy 15084 1.5% 30,508.65 1.2% 32,702.59 1.3%BF encrypt 674403 66.6% 1,677,869.71 67.1% 1,708,240.37 67.5%BF cfb64 encrypt 166568 16.4% 417,796.34 16.7% 406,465.08 16.1%BF set key 18383 1.8% 42,913.71 1.7% 40,708.50 1.6%blowfish main 138020 13.6% 332,437.50 13.3% 341,822.71 13.5%main 142 0.0% 192.55 0.0% 256.17 0.0%
Total 1012691 100.0% 2,501,842 100.0% 2,530,356 100.0%
132
A.2 Full Results for Chapter 4: Energy Consumption Profiling
DFADD
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.4% 123.28 0.2% 160.89 0.3%shift64RightJamming 1488 5.7% 3,279.09 5.8% 3,287.69 5.7%countLeadingZeros32 388 1.5% 760.15 1.4% 833.87 1.4%countLeadingZeros64 453 1.7% 873.12 1.6% 948.60 1.6%float raise 53 0.2% 84.96 0.2% 103.91 0.2%float64 is nan 262 1.0% 564.17 1.0% 601.32 1.0%float64 is signaling nan 404 1.6% 898.47 1.6% 923.01 1.6%propagateFloat64NaN 1593 6.1% 3,323.38 5.9% 3,393.23 5.9%extractFloat64Frac 591 2.3% 1,464.67 2.6% 1,529.35 2.7%extractFloat64Exp 575 2.2% 1,458.57 2.6% 1,503.82 2.6%extractFloat64Sign 591 2.3% 1,493.23 2.7% 1,550.43 2.7%packFloat64 195 0.8% 498.53 0.9% 495.27 0.9%roundAndPackFloat64 2238 8.6% 4,968.14 8.8% 4,977.42 8.6%normalizeRoundAndPackFloat64 680 2.6% 1,403.25 2.5% 1,440.19 2.5%addFloat64Sigs 4726 18.2% 10,317.65 18.3% 10,425.61 18.1%subFloat64Sigs 4339 16.7% 9,398.98 16.7% 9,520.49 16.5%float64 add 3154 12.1% 7,536.89 13.4% 7,103.72 12.3%main 3678 14.2% 6,828.17 12.1% 7,692.21 13.4%
lshrdi3 183 0.7% 392.45 0.7% 407.25 0.7%ashldi3 279 1.1% 599.04 1.1% 658.18 1.1%
Total 25961 100.0% 56,266 100.0% 57,556 100.0%
133
A Complete Benchmark Results
DFDIV
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.1% 123.28 0.1% 160.89 0.1%sub128 1110 1.4% 2,528.47 1.3% 2,563.36 1.4%mul64To128 1589 2.0% 3,628.84 1.9% 3,492.80 1.9%estimateDiv128To64 2809 3.5% 5,847.33 3.1% 6,013.39 3.3%float raise 131 0.2% 193.05 0.1% 247.10 0.1%float64 is nan 115 0.1% 205.89 0.1% 236.23 0.1%float64 is signaling nan 165 0.2% 306.84 0.2% 343.97 0.2%propagateFloat64NaN 710 0.9% 1,226.40 0.6% 1,397.58 0.8%extractFloat64Frac 329 0.4% 759.54 0.4% 793.76 0.4%extractFloat64Exp 287 0.4% 712.86 0.4% 738.96 0.4%extractFloat64Sign 303 0.4% 740.08 0.4% 776.72 0.4%packFloat64 141 0.2% 343.16 0.2% 347.80 0.2%roundAndPackFloat64 1606 2.0% 3,474.05 1.8% 3,547.34 1.9%float64 div 6243 7.8% 13,307.71 6.9% 13,626.43 7.4%main 2007 2.5% 3,601.89 1.9% 4,097.91 2.2%
udivdi3 4152 5.2% 9,313.76 4.9% 9,107.31 5.0%udivsi3 1442 1.8% 3,298.16 1.7% 3,266.96 1.8%umodsi3 1366 1.7% 3,196.33 1.7% 3,097.56 1.7%
udivmodsi4 55394 69.3% 138,782.39 72.4% 129,169.32 70.6%
Total 79990 100.0% 191,590 100.0% 183,025 100.0%
DFMUL
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.8% 123.28 0.6% 160.89 0.7%mul64To128 788 7.1% 1,627.69 7.3% 1,682.78 7.1%float raise 76 0.7% 114.26 0.5% 143.96 0.6%float64 is nan 120 1.1% 209.88 0.9% 244.35 1.0%float64 is signaling nan 171 1.5% 321.52 1.4% 360.71 1.5%propagateFloat64NaN 727 6.6% 1,237.55 5.6% 1,420.20 6.0%extractFloat64Frac 289 2.6% 677.64 3.1% 703.63 3.0%extractFloat64Exp 287 2.6% 681.30 3.1% 718.09 3.0%extractFloat64Sign 282 2.6% 681.14 3.1% 711.30 3.0%packFloat64 129 1.2% 308.63 1.4% 315.25 1.3%roundAndPackFloat64 1213 11.0% 2,463.48 11.1% 2,581.46 10.9%float64 mul 5038 45.6% 10,462.91 47.1% 10,875.01 45.9%main 1834 16.6% 3,290.03 14.8% 3,798.68 16.0%
Total 11045 100.0% 22,199 100.0% 23,716 100.0%
134
A.2 Full Results for Chapter 4: Energy Consumption Profiling
DFSIN
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.0% 123.28 0.0% 160.89 0.0%shift64RightJamming 94660 2.5% 140,991.99 1.6% 176,110.31 2.1%sub128 30870 0.8% 57,616.95 0.7% 62,802.32 0.7%mul64To128 133843 3.5% 218,031.13 2.5% 254,515.93 3.0%estimateDiv128To64 77878 2.0% 152,202.19 1.8% 160,759.75 1.9%countLeadingZeros32 52036 1.4% 83,503.61 1.0% 100,420.04 1.2%countLeadingZeros64 27113 0.7% 39,986.11 0.5% 51,071.07 0.6%extractFloat64Frac 10128 0.3% 25,899.99 0.3% 26,270.23 0.3%extractFloat64Exp 13335 0.3% 34,476.29 0.4% 35,324.12 0.4%extractFloat64Sign 13342 0.3% 34,829.15 0.4% 35,798.43 0.4%packFloat64 6687 0.2% 19,177.23 0.2% 18,390.51 0.2%roundAndPackFloat64 157940 4.1% 306,416.55 3.5% 329,051.91 3.9%normalizeRoundAndPackFloat64 51593 1.3% 78,867.23 0.9% 96,348.83 1.1%int32 to float64 13980 0.4% 35,068.72 0.4% 32,871.20 0.4%addFloat64Sigs 14508 0.4% 35,613.92 0.4% 34,262.09 0.4%subFloat64Sigs 30722 0.8% 76,391.01 0.9% 72,613.20 0.9%float64 add 16465 0.4% 41,480.29 0.5% 38,488.92 0.5%float64 mul 58539 1.5% 146,921.82 1.7% 141,002.24 1.7%float64 div 56186 1.5% 141,286.79 1.6% 134,979.52 1.6%float64 le 55198 1.4% 99,493.93 1.1% 109,951.01 1.3%float64 ge 18267 0.5% 28,590.92 0.3% 35,093.94 0.4%float64 neg 1054 0.0% 1,614.35 0.0% 1,998.13 0.0%float64 abs 12059 0.3% 17,436.48 0.2% 22,416.40 0.3%sin 19732 0.5% 48,395.20 0.6% 47,306.39 0.6%main 2359 0.1% 4,369.11 0.1% 4,929.44 0.1%
lshrdi3 19774 0.5% 29,792.17 0.3% 36,864.98 0.4%ashldi3 34891 0.9% 55,510.91 0.6% 67,824.10 0.8%udivdi3 281178 7.3% 453,453.01 5.2% 531,273.10 6.2%udivsi3 33728 0.9% 80,856.23 0.9% 75,693.16 0.9%umodsi3 37931 1.0% 84,330.70 1.0% 82,195.32 1.0%
udivmodsi4 2471591 64.2% 6,080,143.23 70.3% 5,709,598.97 67.0%
Total 3847678 100.0% 8,652,870 100.0% 8,526,386 100.0%
135
A Complete Benchmark Results
GSM
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.2% 123.28 0.1% 160.89 0.1%gsm add 1744 3.4% 4,420.06 3.6% 4,244.40 3.6%gsm mult 187 0.4% 391.21 0.3% 390.54 0.3%gsm mult r 3639 7.2% 9,989.43 8.2% 8,470.80 7.2%gsm abs 2596 5.1% 6,577.01 5.4% 6,239.74 5.3%gsm norm 245 0.5% 372.80 0.3% 463.93 0.4%gsm div 1531 3.0% 3,919.96 3.2% 3,831.53 3.3%Autocorrelation 28284 55.8% 70,074.92 57.2% 64,458.12 55.1%Reflection coefficients 3478 6.9% 6,748.06 5.5% 7,301.83 6.2%Transformation to Log Area Ratios 615 1.2% 1,131.08 0.9% 1,249.20 1.1%Quantization and coding 1240 2.4% 1,870.07 1.5% 2,332.22 2.0%Gsm LPC Analysis 212 0.4% 308.97 0.3% 1,375.51 1.2%main 6692 13.2% 16,362.96 13.4% 16,072.95 13.7%legup memset 4 161 0.3% 277.35 0.2% 322.88 0.3%
Total 50715 100.0% 122,567 100.0% 116,915 100.0%
136
A.2 Full Results for Chapter 4: Energy Consumption Profiling
JPEG
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.0% 123.28 0.0% 160.89 0.0%read byte 9587 0.2% 23,522.31 0.2% 24,531.19 0.2%read word 313 0.0% 682.94 0.0% 709.52 0.0%first marker 343 0.0% 472.74 0.0% 648.14 0.0%next marker 457 0.0% 1,015.38 0.0% 1,004.88 0.0%get sof 2237 0.0% 3,458.37 0.0% 4,286.66 0.0%get sos 1203 0.0% 2,026.50 0.0% 24,722.15 0.2%get dht 5255 0.1% 12,236.50 0.1% 12,559.63 0.1%get dqt 4592 0.1% 10,617.50 0.1% 10,595.25 0.1%read markers 1151 0.0% 1,957.61 0.0% 5,032.22 0.0%ChenIDct 619845 10.1% 1,537,415.41 10.5% 1,451,413.10 10.2%IZigzagMatrix 165262 2.7% 423,033.82 2.9% 404,951.28 2.9%IQuantize 199047 3.2% 444,672.79 3.0% 448,731.02 3.2%PostshiftIDctMatrix 102028 1.7% 258,277.35 1.8% 245,478.29 1.7%BoundIDctMatrix 203366 3.3% 520,926.38 3.6% 476,886.79 3.4%WriteOneBlock 206766 3.4% 506,111.68 3.5% 499,798.27 3.5%Write4Blocks 15989 0.3% 38,641.07 0.3% 34,898.29 0.2%YuvToRgb 669621 10.9% 1,729,816.63 11.9% 1,537,795.97 10.8%pgetc 74131 1.2% 176,900.12 1.2% 176,968.37 1.2%buf getb 1437623 23.4% 2,958,965.09 20.3% 3,093,447.70 21.8%buf getv 641430 10.4% 1,552,590.86 10.6% 1,479,462.47 10.4%huff make dhuff tb 15771 0.3% 37,478.92 0.3% 36,766.45 0.3%DecodeHuffman 748436 12.2% 1,895,934.53 13.0% 1,792,712.27 12.6%DecodeHuffMCU 547844 8.9% 1,270,764.75 8.7% 1,273,775.99 9.0%decode block 12425 0.2% 31,272.85 0.2% 29,179.98 0.2%decode start 5710 0.1% 12,690.62 0.1% 12,397.24 0.1%jpeg init decompress 1089 0.0% 1,571.51 0.0% 1,995.06 0.0%jpeg read 183 0.0% 267.96 0.0% 339.54 0.0%jpeg2bmp main 462150 7.5% 1,134,902.81 7.8% 1,113,682.89 7.8%main 255 0.0% 346.29 0.0% 451.01 0.0%
Total 6154200 100.0% 14,588,695 100.0% 14,195,382 100.0%
MIPS
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.2% 123.28 0.1% 160.89 0.2%main 43383 98.8% 108,528.60 99.0% 104,185.73 98.9%legup memset 4 435 1.0% 983.73 0.9% 1,011.57 1.0%
Total 43909 100.0% 109,636 100.0% 105,358 100.0%
137
A Complete Benchmark Results
MOTION
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.3% 123.28 0.2% 160.89 0.3%read 19546 74.2% 48,468.90 81.5% 48,186.70 78.6%Fill Buffer 406 1.5% 569.81 1.0% 735.48 1.2%Show Bits 77 0.3% 160.97 0.3% 174.83 0.3%Flush Buffer 1898 7.2% 3,683.25 6.2% 4,014.60 6.5%Get Bits 362 1.4% 644.37 1.1% 718.59 1.2%Get Bits1 109 0.4% 182.00 0.3% 216.30 0.4%Get motion code 430 1.6% 656.47 1.1% 804.08 1.3%decode motion vector 307 1.2% 460.75 0.8% 583.15 1.0%motion vector 740 2.8% 1,075.43 1.8% 1,383.59 2.3%motion vectors 885 3.4% 1,263.42 2.1% 1,594.85 2.6%Initialize Buffer 259 1.0% 364.51 0.6% 464.84 0.8%main 1221 4.6% 1,808.89 3.0% 2,294.54 3.7%
Total 26331 100.0% 59,462.06 100.0% 61,332.44 100.0%
SHA
Benchmark Cycles cStall Energy pStall Energy
wrap 100 0.0% 134.75 0.0% 160.89 0.0%memset 248 0.0% 451.38 0.0% 506.19 0.0%memcpy 120158 10.5% 293,333.64 10.0% 301,618.58 10.5%sha transform 1015785 88.9% 2,616,217.93 89.5% 2,564,198.26 89.0%sha init 229 0.0% 321.60 0.0% 419.86 0.0%sha update 4729 0.4% 11,530.65 0.4% 11,282.42 0.4%sha final 387 0.0% 581.65 0.0% 736.39 0.0%sha stream 255 0.0% 365.86 0.0% 468.82 0.0%main 360 0.0% 516.16 0.0% 669.77 0.0%
Total 1142251 100.0% 2,923,454 100.0% 2,880,061 100.0%
138
A.2 Full Results for Chapter 4: Energy Consumption Profiling
DHRYSTONE
Benchmark Cycles cStall Energy pStall Energy
wrap 91 0.3% 123.28 0.2% 176.85 0.2%strcpy 5188 16.0% 13,729.38 17.7% 13,341.05 17.4%strcmp 9088 28.0% 23,373.74 30.1% 22,661.35 29.6%Func 3 136 0.4% 323.37 0.4% 336.71 0.4%Proc 6 1234 3.8% 2,797.53 3.6% 2,854.12 3.7%Proc 7 501 1.5% 1,215.40 1.6% 1,080.85 1.4%Proc 8 1984 6.1% 4,706.37 6.1% 4,503.72 5.9%Func 1 535 1.6% 1,298.23 1.7% 1,236.44 1.6%Func 2 1529 4.7% 3,243.83 4.2% 3,334.40 4.4%Proc 3 858 2.6% 1,870.21 2.4% 1,950.79 2.5%Proc 1 2375 7.3% 5,258.31 6.8% 5,160.73 6.7%Proc 2 477 1.5% 1,132.73 1.5% 1,139.33 1.5%Proc 4 468 1.4% 1,052.11 1.4% 1,145.54 1.5%Proc 5 208 0.6% 491.67 0.6% 496.31 0.6%main 5502 17.0% 11,356.06 14.6% 11,548.83 15.1%legup memcpy 4 2261 7.0% 5,565.31 7.2% 5,553.35 7.3%
Total 32435 100.0% 77,538 100.0% 76,520 100.0%
139
Appendix B
Source Code
B.1 Profiling Definitions`define STACK ADDR 'hFFDFFC`define PROF ADDR 'hFFE000`define S 32 // Address Stack S i ze`define S2 5 // l og2 ( s t a c k depth )`define BUFF DEPTH 4 // Buf f e r Depth`define N 64 // Number o f Functions`define N2 4 // l og2 (`N)`define ICW 32 // i n s t r u c t i o n counter width`define CCW 32 // cyc l e counter width`define SCW 32 // s t a l l c y c l e counter width`define PW 22 // power counter i n d i v i d u a l width`define PSW 28 // power s t a l l counter width`define PGROUPS 6 // number o f groupings f o r power`define PCW ( P̀W*`PGROUPS+̀ PSW) // power counter [ t o t a l ] width`define PROF TYPE ``L ' ' // Choose p r o f i l e r (L −− LEAP, S −− SnoopP)`define PROFMETHOD ``c ' ' // Choose p r o f i l i n g method ( i −− i n s t r u c t i o n count ,
// c −− cy c l e count , s −− s t a l l c y c l e count ,
// p −− power count )`define CW ((`PROF METHOD==`` i ' ' ) ? `ICW : (`PROF METHOD==``c ' ' ) ? C̀CW : (`PROF METHOD==``s ' ' )? `SCW : (`PROF METHOD==``p ' ' ) ? P̀CW : 0)
// `de f ine CW64 // uncomment t h i s l i n e i f ICW = 64
140
B.1 Profiling Definitions`define START ADDR 32 ' b0000 11 0000 1000 0000 0000 0000 0000 00
// ' h00800000 = ' b0000 0000 |1000 0000 0000 0000 0000 0000 |00`define DO HIER 1 'b1
// addresses corresponding to the j a l and ra o f wrap . c`define WRAP MAIN BEGIN ' h800018`define WRAP MAIN END ' h800074
// power p r o f i l i n g cons tan ts`define A 1`define B 2`define C 3`define D 4`define E 5`define F 6`define STALL 7
// p S t a l l s (BRAVO)`define P ADD `C`define P ADDI `C`define P ADDIU `C`define P ADDU `C`define P AND `D`define P ANDI `C`define P BEQ `B`define P BGEZ `B`define P BGEZAL `B`define P BGTZ `D`define P BLEZ `C`define P BLTZ `A`define P BLTZAL `A`define P BNE `B`define P DIV `F`define P DIVU `F`define P J `C`define P JAL `C`define P JR `D`define P LB `C`define P LBU `E`define P LH `D`define P LHU `E141
B Source Code`define P LUI `C`define P LW `B`define P MFHI `A`define P MFLO `C`define P MULT `E`define P MULTU `F`define P NOP `D`define P OR `D`define P ORI `D`define P SB `D`define P SH `D`define P SLL `D`define P SLLV `E`define P SLT `D`define P SLTI `B`define P SLTIU `B`define P SLTU `D`define P SRA `C`define P SRAV `C`define P SRL `C`define P SRLV `B`define P SUB `D`define P SUBU `D`define P SW `D`define P XOR `C`define P XORI `C`define P ADD ADDU `C`define P DIV DIVU `F`define P ADDI ADDIU `C`define P SB SH `D
142
B.2 LEAP Source Code
B.2 LEAP Source Code
// Top− l e v e l module f o r LEAP p r o f i l e r
module LEAP (
input clk ,
input g l ob r e s e t ,
input [ 2 5 : 0 ] pc in ,
input [ 3 1 : 0 ] i n s t r i n 2 ,
input [ 3 1 : 0 ] in s t rEx in ,
input s t a l l i n ,
input i n sVa l i d i n ,
// Handshaking por t s
input i n i t s t a r t i n ,
output i n i t done ,
input r e t r i e v e s t a r t i n ,
output r e t r i e v e don e ,
// Avalon P ro f i l e Master por t s
output avm prof i leMaste r read ,
output avm pro f i l eMaste r wr i t e ,
output [ 3 1 : 0 ] avm pro f i l eMaste r addre s s ,
output [ 3 1 : 0 ] avm pro f i l eMaste r wr i t edata ,
output [ 3 : 0 ] avm pro f i l eMaste r by teenab le ,
input [ 3 1 : 0 ] avm prof i leMaster readdata ,
input avm pro f i l eMaste r wa i t r eques t ,
input avm pro f i l eMaste r r eaddatava l id
) ;
// Delay p r o f i l e r ' s consumption o f inpu ts to g i v e ex t ra cy c l e to Address Hash
reg [ 2 5 : 0 ] pc ;
reg [ 2 5 : 0 ] pc minus 4 ;
reg [ 3 1 : 0 ] i n s t r i n ;
reg [ 3 1 : 0 ] instrEx ;
reg s t a l l ;
reg i n sVa l id ;
reg i n i t s t a r t ;
reg r e t r i e v e s t a r t ;
always @(posedge c l k ) begin
i f ( g l o b r e s e t ) begin
143
B Source Code
pc <= ' b0 ;
pc minus 4 <= 'b0 ;
i n s t r i n <= 'b0 ;
instrEx <= 'b0 ;
s t a l l <= 'b0 ;
in sVa l id <= 'b0 ;
i n i t s t a r t <= 'b0 ;
r e t r i e v e s t a r t <= ' b0 ;
end else begin
pc <= pc in ;
pc minus 4 <= pc in − 'h4 ;
i n s t r i n <= i n s t r i n 2 ;
instrEx <= in s t rEx in ;
s t a l l <= s t a l l i n ;
in sVa l id <= in sVa l i d i n ;
i n i t s t a r t <= i n i t s t a r t i n ;
r e t r i e v e s t a r t <= r e t r i e v e s t a r t i n ;
end
end
// wires f o r OpDecode
wire [ 2 5 : 0 ] opTarget ;
wire [ 2 5 : 0 ] opJALrTarget ;
wire opCall ;
wire opJALr ;
wire opRet ;
wire opJALr done ;
wire opPC8 ;
// wires f o r DataCounter
wire [`CW−1:0 ] dcCountOut ;
reg dcPCdiff ;
// wires f o r CounterStorage
wire [ `N2−1:0 ] csFuncNum;
wire [`CW−1:0 ] csCountIn ;
wire csRdReq ;
wire csEmpty ;
wire csSave ;
wire [`CW−1:0 ] csSaveData ;
wire [ `N2−1:0 ] csNumFuncs ;
144
B.2 LEAP Source Code
wire csRamWren;
wire [ `N2−1:0 ] csRamAddr ;
wire [`CW−1:0 ] csRamWrData ;
wire [`CW−1:0 ] csRamRdData ;
// wires f o r Ca l l S tack
wire [ `N2−1:0 ] asFuncNumOut ;
wire asRamWren ;
wire [ `S2−1:0 ] asRamAddr ;
wire [ `N2−1:0 ] asRamWrData ;
wire [ `N2−1:0 ] asRamRdData ;
// wires f o r HierarchyStack
wire hsStackRamWren ;
wire [ `S2−1:0 ] hsStackRamAddr ;
wire [`CW−1:0 ] hsStackRamWrData ;
wire [`CW−1:0 ] hsStackRamRdData ;
wire [`CW−1:0 ] hsChildCount ;
// wires f o r AddressHash
wire [ 2 5 : 0 ] ahTarget ;
wire [ `N2−1:0 ] ahFuncNum;
wire ahJALr ;
wire [ `N2−1:0 ] ahRamAddr ;
wire [ 7 : 0 ] ahRamRdData ;
// wires f o r DataStorage
wire dsHashRamWren ;
wire [ `N2−1:0 ] dsHashRamAddr ;
wire [ 7 : 0 ] dsHashRamWrData ;
wire [ 7 : 0 ] dsHashRamRdData;
wire dsStackRamWren ;
wire [ `S2−1:0 ] dsStackRamAddr ;
wire [ `N2−1:0 ] dsStackRamWrData ;
wire [ `N2−1:0 ] dsStackRamRdData ;
wire dsStorageRamWren ;
wire [ `N2−1:0 ] dsStorageRamAddr ;
wire [`CW−1:0 ] dsStorageRamWrData ;
wire [`CW−1:0 ] dsStorageRamRdData ;
145
B Source Code
// wires f o r I n i t i a l i z e r
wire i I n i t ;
wire iHashRamWren ;
wire [ `N2−1:0 ] iHashRamAddr ;
wire [ 7 : 0 ] iHashRamWrData ;
wire [ 7 : 0 ] iHashRamRdData ;
wire i n i t h a s h s t a r t ;
wire i n i t h a sh don e ;
wire h a s h f i r s t s t a r t ;
wire i n i t a vm p r o f i l eMa s t e r w r i t e ;
wire [ 3 1 : 0 ] i n i t avm pro f i l eMas t e r add r e s s ;
wire [ 3 1 : 0 ] i n i t avm pro f i l eMas t e r wr i t eda t a ;
wire [ 3 : 0 ] i n i t avm pro f i l eMas t e r by t e enab l e ;
wire i n i t r e s e t ;
// wires f o r Retre i ver
wire rRet r i ev e ;
wire [`CW−1:0 ] rProfData ;
wire [ `N2−1:0 ] rProfIndex ;
wire r e t r i e v e avm pro f i l eMas t e r wr i t e ;
wire [ 3 1 : 0 ] r e t r i e v e avm pro f i l eMas t e r add r e s s ;
wire [ 3 1 : 0 ] r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ;
wire [ 3 : 0 ] r e t r i e v e avm pro f i l eMas t e r by t e enab l e ;
// genera l r e g i s t e r s
reg [ 2 5 : 0 ] o ld pc ;
reg [ 3 1 : 0 ] o l d i n s t r ;
reg p c d i f f ;
// Keep track o f curren t f unc t i on
reg [ `N2−1:0 ] cu r fun c r e g ;
reg c u r f u n c s r c ;
wire [ `N2−1:0 ] cu r func = ( cu r f u n c s r c ) ? ahFuncNum : cu r fun c r e g ;
reg [ 1 : 0 ] s t a t e ;
reg [ 2 : 0 ] d e lay count ;
reg j a l r p r e v ;
reg s a v e d c a l l ;
reg h a sh f i r s t d on e ;
// combine g l o b a l & i n i t r e s e t s
wire r e s e t = g l ob r e s e t | i n i t r e s e t ;
reg i n s t r i n i t ;
146
B.2 LEAP Source Code
reg s t o r e f i r s t c a l l ;
wire [ 3 1 : 0 ] i n s t r = i n s t r i n ;
// Create l agged s i g n a l to determine when JALr t a r g e t i s found
always @(posedge ( c l k ) ) begin
j a l r p r e v <= opJALr ;
end
// update p c d i f f , s t o r e f i r s t j a l
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
cu r fun c r e g <= 'h0 ;
c u r f u n c s r c <= 1 'b0 ;
s t a t e <= 2 ' b00 ;
h a s h f i r s t d on e <= 1 'b0 ;
de lay count <= 'b0 ;
o ld pc <= 'b0 ;
o l d i n s t r <= 'b0 ;
s t o r e f i r s t c a l l <= 1 'b0 ;
i n s t r i n i t <= 'b0 ;
// i n i t i a l i z e wi th hash o f `START ADDR
end else i f ( i n i t s t a r t ) begin
// Sta te 0 : Wait f o r i n i t ha sh done
i f ( s t a t e <= 2 ' b00 ) begin
i f ( h a s h f i r s t s t a r t & ! h a s h f i r s t d on e ) begin
s t a t e <= 2 ' b01 ;
de lay count <= ' b0 ;
i n s t r i n i t <= 1 ' b1 ; // pretend we s t a r t wi th a jump to 0x8003000 (GXemul) or 0
x800000 ( Tiger )
end else i f ( h a s h f i r s t s t a r t & ha sh f i r s t d on e ) begin
h a sh f i r s t d on e <= 1 'b0 ;
end
// Sta te 1 : Delay u n t i l hash f i n i s h e d (2 c y c l e s )
end else i f ( s t a t e == 2 ' b01 ) begin
p c d i f f <= 1 'b0 ;
// s tay in t h i s s t a t e u n t i l de l ay f i n i s h e d
i f ( de lay count < 3 ' b111 ) begin // 2 c y c l e s
147
B Source Code
de lay count <= de lay count + 1 'b1 ;
end else i f ( de lay count < 3 ' b011 ) begin // 2 c y c l e s
i n s t r i n i t <= 1 ' b0 ;
de lay count <= de lay count + 1 'b1 ;
// update cur func
end else begin
i n s t r i n i t <= 32 'b0 ;
h a s h f i r s t d on e <= 1 'b1 ;
s t o r e f i r s t c a l l <= 1 'b1 ;
cu r fun c r e g <= ahFuncNum;
s t a t e <= 2 ' b10 ;
end
end else i f ( s t a t e == 2 ' b10 ) begin
s t o r e f i r s t c a l l <= 1 'b0 ;
s t a t e <= 2 ' b00 ;
end
// s e t / c l e a r p c d i f f s i g n a l so modules know when PC changes
end else begin
p c d i f f <= ( pc != o ld pc ) ;
o ld pc <= pc ;
o l d i n s t r <= i n s t r ;
// Sta te 0 : Wait f o r r e t / c a l l
i f ( s t a t e <= 2 ' b00 ) begin
i f ( opCall | opRet ) begin
s t a t e <= 2 ' b01 ;
de lay count <= opCall ? ' b01 : ' b10 ;
s a v e d c a l l <= opCall ;
// Handle JALr d i f f e r e n t l y
end else i f ( ! opJALr & j a l r p r e v ) begin // use j a l r p r e v to i n d i c a t e we WERE in j a l r −−
once opJALr goes low , t ha t means the t a r g e t i s ready
s t a t e <= 2 ' b01 ;
de lay count <= 1 'b0 ;
s a v e d c a l l <= 1 ' b1 ;
end
// Sta te 1 : Delay u n t i l hash f i n i s h e d (2 c y c l e s )
end else i f ( s t a t e == 2 ' b01 ) begin
148
B.2 LEAP Source Code
// s tay in t h i s s t a t e u n t i l de l ay f i n i s h e d
i f ( de lay count < 3 ' b10 ) begin // 2 c y c l e s
c u r f u n c s r c <= 1 'b1 ; // make mux use ahFuncNum ( f o r bypass o f c a l l−then−
r e t i s su e )
de lay count <= de lay count + 1 'b1 ;
// update cur func
end else begin
i f ( opRet ) cu r fun c r e g <= asFuncNumOut ;
else cu r fun c r e g <= sav ed c a l l ? ahFuncNum : asFuncNumOut ;
c u r f u n c s r c <= 1 'b0 ; // make mux use cur func reg , not d i r e c t l y ahFuncNum ( f o r
bypass o f c a l l −then−r e t i s su e )
de lay count <= ' b0 ;
// t h i s i s a bypass cond i t i on to avoid the case where a hash i s a v a i l a b l e in the same
cyc l e t ha t another c a l l happens
i f ( ! opCall & ! opRet ) s t a t e <= 2 ' b00 ;
else s t a t e <= 2 ' b10 ;
end
// Sta te 2 : Wait u n t i l c a l l / r e t i s low
end else i f ( s t a t e == 2 ' b10 ) begin
i f ( ! opCall & ! opRet ) s t a t e <= 2 ' b00 ;
end
end
end
// Make sure c a l l s a r r i v e at the r i g h t time to the counter & counter s t o ra g e modules ( f o r
cases when i t s t a l l s between the j a l i n s t r and the ac tua l jump)
wire c a l l j a l r = opCall | opJALr ;
wire c a l l j a l r p c ;
reg c a l l j a l r p c s t a t e ;
reg [ 3 1 : 0 ] c a l l j a l r t a r g e t ;
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
c a l l j a l r p c s t a t e <= 1 'b0 ;
c a l l j a l r t a r g e t <= ' b0 ;
end else begin
i f ( c a l l j a l r & h a sh f i r s t d on e ) begin
c a l l j a l r p c s t a t e <= 1 'b1 ;
c a l l j a l r t a r g e t <= opTarget ;
149
B Source Code
end
i f ( c a l l j a l r p c s t a t e & ( pc == c a l l j a l r t a r g e t | ( opJALr & ! opPC8) ) ) begin
c a l l j a l r p c s t a t e <= 1 'b0 ;
end
end
end
assign c a l l j a l r p c = ( c a l l j a l r p c s t a t e & ( pc == c a l l j a l r t a r g e t | (opJALr & ! opPC8) ) ) ;
// Make sure re turns a r r i v e at the r i g h t time to the counter & counter s t o ra g e modules ( f o r
cases when i t s t a l l s between the r e t i n s t r and the ac tua l jump)
wire r e t p c ;
reg r e t p c s t a t e ;
reg [ 3 1 : 0 ] r e t p c v a l ;
reg [ 3 1 : 0 ] r e t i n s t r v a l ;
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
r e t p c s t a t e <= 1 'b0 ;
r e t p c v a l <= 'b0 ;
r e t i n s t r v a l <= 'b0 ;
end else begin
i f ( opRet & ha sh f i r s t d on e ) begin
r e t p c s t a t e <= 1 ' b1 ;
r e t p c v a l <= pc ;
end
i f ( r e t p c s t a t e & ( pc != r e t p c v a l ) & ( pc minus 4 != r e t p c v a l ) ) begin
r e t p c s t a t e <= 1 ' b0 ;
end
end
end
assign r e t p c = ( r e t p c s t a t e & ( pc != r e t p c v a l ) & ( pc minus 4 != r e t p c v a l ) ) ; // +4 to
make sure i t s not j u s t the branch de l ay s l o t
// Make sure coun t e r s t o ra g e uses co r r e c t ``cur func ' ' by s t o r i n g i t
reg [ `N2−1:0 ] cu r func pc ;
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
cu r func pc <= 'b0 ;
end else begin
i f ( ! r e t p c s t a t e & ! c a l l j a l r p c s t a t e ) begin // i f n e i t h e r c a l l /
r e t are wa i t i ng f o r the ac tua l jump , keep track o f curren t f unc t i on
cu r func pc <= ( de lay count == 3 ' b010 ) ? ahFuncNum : cur func ; // a second bypass to
have hash va lue go s t r a i g h t to coun t e r s t o ra g e so i t can read the data a cyc l e
150
B.2 LEAP Source Code
e a r l i e r
end
end
end
// New way o f doing ``pc d i f f ' ' −− j u s t send prev pc to i n s t r c o un t e r and have i t compare
reg [ 3 1 : 0 ] prev pc ;
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
prev pc <= 'b0 ;
end else begin
prev pc <= pc ;
end
end
// Connect modules
assign ahTarget = opTarget ;
assign csCountIn = dcCountOut ;
assign dsHashRamWren = iHashRamWren ; // Hash RAM i s only wr i t t en by I n i t i a l i z e r
assign dsHashRamAddr = ( i n i t s t a r t & ! i n i t d on e & ! i n i t h a s h s t a r t & ! i n i t h a sh don e &
! h a s h f i r s t s t a r t & ! h a s h f i r s t d on e ) ? iHashRamAddr : ahRamAddr ;
assign ahRamRdData = dsHashRamRdData;
assign dsHashRamWrData = iHashRamWrData ;
assign dsStackRamWren = asRamWren ; // Stack RAM i s only accessed by c a l l S t a c k
assign dsStackRamAddr = asRamAddr ;
assign asRamRdData = dsStackRamRdData ;
assign dsStackRamWrData = asRamWrData ;
assign dsStorageRamWren = csRamWren; // Storage RAM i s only wr i t t en by
counterS torage
assign dsStorageRamAddr = ( r e t r i e v e s t a r t & ! r e t r i e v e don e ) ? rProfIndex : csRamAddr ;
assign csRamRdData = dsStorageRamRdData ;
assign rProfData = dsStorageRamRdData ;
assign dsStorageRamWrData = csRamWrData ;
// mux ava lon s i g n a l s
assign avm pro f i l eMaste r wr i t e = ( i n i t s t a r t & ! i n i t d on e ) ? i n i t a vm p r o f i l eMa s t e r w r i t e :
r e t r i e v e avm pro f i l eMa s t e r wr i t e ;
assign avm pro f i l eMaste r addre s s = ( i n i t s t a r t & ! i n i t d on e ) ? i n i t avm pro f i l eMas t e r add r e s s
: r e t r i e v e avm pro f i l eMas t e r add r e s s ;
151
B Source Code
assign avm pro f i l eMaste r wr i t edata = ( i n i t s t a r t & ! i n i t d on e ) ?
i n i t avm pro f i l eMas t e r wr i t eda t a : r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ;
assign avm pro f i l eMaste r by teenab le = ( i n i t s t a r t & ! i n i t d on e ) ?
i n i t avm pro f i l eMas t e r by t e enab l e : r e t r i e v e avm pro f i l eMas t e r by t e enab l e ;
// I n s t a n t i a t e the modules
OpDecode decoder (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. pc ( pc ) ,
. pc nolag ( pc in ) ,
. i n s t r ( i n s t r ) ,
. i n s t r n o l a g ( i n s t r i n 2 ) ,
. in sVa l id ( in sVa l id ) ,
. r e t ( opRet ) ,
. c a l l ( opCall ) ,
. j a l r ( opJALr) ,
. j a l r d on e ( opJALr done ) ,
. t a r g e t ( opTarget ) ,
. j a l r t a r g e t ( opJALrTarget ) ,
. p c wi th in 8 (opPC8)
) ;
parameter PROFMETHOD = `PROF METHOD;
generate
i f (PROFMETHOD==`` i ' ' ) begin
In s t ruc t ionCounte r i n s t r c ou n t e r (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. prev pc ( prev pc ) ,
. pc ( pc ) ,
. r e t ( r e t p c ) ,
. c a l l ( c a l l j a l r p c ) ,
. count out ( dcCountOut ) ,
. i n i t s t a r t ( i n i t s t a r t )
) ;
end else i f (PROFMETHOD==``c ' ' ) begin
CycleCounter cy c l e c oun t e r (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. r e t ( r e t p c ) ,
. c a l l ( c a l l j a l r p c ) ,
152
B.2 LEAP Source Code
. count out ( dcCountOut ) ,
. i n i t s t a r t ( i n i t s t a r t )
) ;
end else i f (PROFMETHOD==``s ' ' ) begin
Sta l lCounte r s t a l l c o u n t e r (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. s t a l l ( s t a l l ) ,
. r e t ( r e t p c ) ,
. c a l l ( c a l l j a l r p c ) ,
. count out ( dcCountOut ) ,
. i n i t s t a r t ( i n i t s t a r t )
) ;
end else i f (PROFMETHOD==``p ' ' ) begin
PowerCounter power counter (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. p c d i f f ( p c d i f f ) ,
. s t a l l ( s t a l l ) ,
. i n s t r i n ( i n s t r ) ,
. r e t ( r e t p c ) ,
. c a l l ( c a l l j a l r p c ) ,
. count out ( dcCountOut ) ,
. i n i t s t a r t ( i n i t s t a r t )
) ;
end
endgenerate
CounterStorage coun t e r s t o r age (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. c a l l ( c a l l j a l r p c ) ,
. r e t ( r e t p c ) ,
. cur func num ( cur func pc ) ,
. count ( csCountIn ) ,
. c h i l d c oun t ( hsChildCount ) ,
. i n i t s t a r t ( i n i t s t a r t ) ,
. p c d i f f ( p c d i f f ) ,
// For access ing RAM
. ram wren (csRamWren) ,
. ram addr ( csRamAddr) ,
153
B Source Code
. ram wr data (csRamWrData) ,
. ram rd data ( csRamRdData)
) ;
Cal lStack c a l l s t a c k (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. r e t ( opRet ) ,
. c a l l ( c a l l j a l r ) ,
. func num in ( cu r func ) ,
. func num out (asFuncNumOut ) ,
. p c d i f f ( p c d i f f ) ,
. i n i t s t a r t ( i n i t s t a r t ) ,
. s t o r e f i r s t c a l l ( s t o r e f i r s t c a l l ) ,
// For access ing RAM
. ram wren (asRamWren) ,
. ram addr (asRamAddr ) ,
. ram wr data (asRamWrData ) ,
. ram rd data (asRamRdData )
) ;
parameter DO HIER = `DO HIER;
generate
i f (DO HIER) begin
HierarchyStack h i e r s t a c k (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. r e t ( r e t p c ) ,
. c a l l ( c a l l j a l r p c ) ,
. c h i l d c oun t ( hsChildCount ) ,
. count in ( dcCountOut ) ,
. i n i t s t a r t ( i n i t s t a r t ) ,
// For access ing RAM
. ram wren (hsStackRamWren) ,
. ram addr ( hsStackRamAddr) ,
. ram wr data (hsStackRamWrData) ,
. ram rd data ( hsStackRamRdData )
) ;
end
endgenerate
154
B.2 LEAP Source Code
AddressHash addr hash (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. i n s t r n o l a g ( i n s t r i n 2 ) ,
. i n s t rV a l i d n o l a g ( i n sVa l i d i n ) ,
. addr in ( opJALrTarget ) ,
. funcNum(ahFuncNum) ,
. j a l r ( ahJALr) ,
. i n i t s t a r t ( i n i t h a s h s t a r t ) ,
. i n i t d on e ( i n i t h a sh don e ) ,
. i n s t r i n i t ( i n s t r i n i t ) ,
// For access ing RAM
. ram addr (ahRamAddr) ,
. ram rd data (ahRamRdData)
) ;
DataStorage data s torage (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. i n i t s t a r t ( i n i t s t a r t ) ,
. r e t r i e v e s t a r t ( r e t r i e v e s t a r t ) ,
. hash wren (dsHashRamWren) ,
. hash addr (dsHashRamAddr) ,
. hash wr data (dsHashRamWrData) ,
. hash rd data (dsHashRamRdData) ,
. c a l l s t a c k w r e n (dsStackRamWren) ,
. c a l l s t a c k add r (dsStackRamAddr) ,
. c a l l s t a c k w r d a t a (dsStackRamWrData) ,
. c a l l s t a c k r d d a t a (dsStackRamRdData ) ,
. h i e r s t a ck wr en (hsStackRamWren) ,
. h i e r s t a ck add r (hsStackRamAddr) ,
. h i e r s t a ck wr da t a (hsStackRamWrData) ,
. h i e r s t a c k r d d a t a (hsStackRamRdData ) ,
. s torage wren ( dsStorageRamWren ) ,
. s t o rage addr ( dsStorageRamAddr) ,
155
B Source Code
. s t o rage wr data (dsStorageRamWrData) ,
. s t o r age rd da t a ( dsStorageRamRdData )
) ;
v I n i t i a l i z e r i n i t (
. c l k ( c l k ) ,
. r e s e t ( g l o b r e s e t ) ,
. r e s e t ou t ( i n i t r e s e t ) ,
. pc ( pc ) ,
. i n i t s t a r t ( i n i t s t a r t ) ,
. i n i t d on e ( i n i t d on e ) ,
. i n i t h a s h s t a r t ( i n i t h a s h s t a r t ) ,
. i n i t h a sh don e ( i n i t h a sh don e ) ,
. h a s h f i r s t s t a r t ( h a s h f i r s t s t a r t ) ,
. h a s h f i r s t d on e ( h a s h f i r s t d on e ) ,
. hashWren ( iHashRamWren ) ,
. hashAddr( iHashRamAddr ) ,
. hashWrData ( iHashRamWrData ) ,
//Avalon Bus s i d e s i g n a l s
. avm pro f i l eMaste r r ead ( avm pro f i l eMaste r r ead ) ,
. avm pro f i l eMaste r wr i t e ( i n i t a vm p r o f i l eMa s t e r w r i t e ) ,
. avm pro f i l eMaste r addre s s ( i n i t avm pro f i l eMas t e r add r e s s ) ,
. avm pro f i l eMaste r wr i t edata ( i n i t avm pro f i l eMas t e r wr i t eda t a ) ,
. avm pro f i l eMaste r by teenab le ( i n i t avm pro f i l eMas t e r by t e enab l e ) ,
. avm pro f i l eMaste r wa i t r eques t ( avm pro f i l eMaste r wa i t r eques t ) ,
. avm pro f i l eMaste r r eaddata ( avm pro f i l eMaste r r eaddata ) ,
. avm pro f i l eMaste r r eaddatava l id ( avm pro f i l eMaste r r eaddatava l id )
) ;
generate
i f (PROFMETHOD!=``p ' ' ) begin
vRet r i ev e r r e t r i e v e (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. pc ( pc ) ,
. r e t r i e v e s t a r t ( r e t r i e v e s t a r t ) ,
. r e t r i e v e don e ( r e t r i e v e don e ) ,
. p ro f data ( rProfData ) ,
. p r o f i nd ex ( rProfIndex ) ,
156
B.2 LEAP Source Code
//Avalon Bus s i d e s i g n a l s
. avm pro f i l eMaste r wr i t e ( r e t r i e v e avm pro f i l eMas t e r wr i t e ) ,
. avm pro f i l eMaste r addre s s ( r e t r i e v e avm pro f i l eMas t e r add r e s s ) ,
. avm pro f i l eMaste r wr i t edata ( r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ) ,
. avm pro f i l eMaste r by teenab le ( r e t r i e v e avm pro f i l eMas t e r by t e enab l e ) ,
. avm pro f i l eMaste r wa i t r eques t ( avm pro f i l eMaste r wa i t r eques t )
) ;
end else begin
vPowerRetriever r e t r i e v e (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. pc ( pc ) ,
. r e t r i e v e s t a r t ( r e t r i e v e s t a r t ) ,
. r e t r i e v e don e ( r e t r i e v e don e ) ,
. p ro f data ( rProfData ) ,
. p r o f i nd ex ( rProfIndex ) ,
//Avalon Bus s i d e s i g n a l s
. avm pro f i l eMaste r wr i t e ( r e t r i e v e avm pro f i l eMas t e r wr i t e ) ,
. avm pro f i l eMaste r addre s s ( r e t r i e v e avm pro f i l eMas t e r add r e s s ) ,
. avm pro f i l eMaste r wr i t edata ( r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ) ,
. avm pro f i l eMaste r by teenab le ( r e t r i e v e avm pro f i l eMas t e r by t e enab l e ) ,
. avm pro f i l eMaste r wa i t r eques t ( avm pro f i l eMaste r wa i t r eques t )
) ;
end
endgenerate
endmodule
module OpDecode (
input clk ,
input r e se t ,
input [ 2 5 : 0 ] pc ,
input [ 2 5 : 0 ] pc nolag ,
input [ 3 1 : 0 ] i n s t r ,
input [ 3 1 : 0 ] i n s t r n o l ag ,
input insVal id ,
output ret ,
output c a l l ,
output reg j a l r ,
output reg j a l r d on e ,
157
B Source Code
output [ 2 5 : 0 ] target ,
output [ 2 5 : 0 ] j a l r t a r g e t ,
output reg pc wi th in 8
) ;
reg [ 2 5 : 0 ] p c j a l r ;
reg [ 2 : 0 ] p c d i f f c o u n t ;
// Detect c a l l s and re turns
wire tCa l l = in sVa l id & ( i n s t r [ 3 1 : 2 6 ] == 6 ' b00 0011 ) ;
wire tRet = in sVa l id & ( i n s t r [ 3 1 : 0 ] == 32 ' b0000 0011 1110 0000 0000 0000 0000 1000 ) ; // $s =$ra = 31
wire tJALR = ( i n s t r n o l a g [ 3 1 : 2 6 ] == 6 'b0 & in s t r n o l a g [ 2 0 : 1 6 ] == 5 'b0 & in s t r n o l a g [ 1 0 : 6 ] ==
5 'b0 & in s t r n o l a g [ 5 : 0 ] == 6 ' b00 1001 ) ;
wire [ 2 5 : 0 ] tTarget = { i n s t r [ 2 3 : 0 ] , 2 'b0 } ;
assign c a l l = tCa l l ;
assign r e t = tRet ;
assign t a r g e t = tTarget ;
assign j a l r t a r g e t = p c j a l r ;
// On a JALr , f i nd when the ac tua l jump occurs , use t h i s PC va lue as the t a r g e t address
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
j a l r d on e <= 1 'b0 ;
p c d i f f c o u n t <= 'b0 ;
j a l r <= 1 'b0 ;
p c j a l r <= 'b0 ;
end else begin
i f ( j a l r d on e ) begin
i f ( p c d i f f c o u n t < 3 ' b010 ) p c d i f f c o u n t <= pc d i f f c o u n t + 1 'b1 ;
else begin
p c d i f f c o u n t <= 'b0 ;
j a l r d on e <= 1 ' b0 ;
end
end else i f (tJALR) begin
p c d i f f c o u n t <= ' b1 ;
p c j a l r <= pc nolag ;
j a l r <= 1 'b1 ;
j a l r d on e <= 1 'b0 ;
158
B.2 LEAP Source Code
end else i f ( j a l r & ! pc wi th in 8 ) begin
p c d i f f c o u n t <= ' b0 ;
p c j a l r <= pc ;
j a l r <= 1 'b0 ;
j a l r d on e <= 1 'b1 ;
end
end
pc wi th in 8 <= ( pc nolag == pc ) | ( pc nolag == pc + 'h4 ) ;
end
endmodule
module In s t ruc t ionCounte r (
input clk ,
input r e se t ,
input [ 3 1 : 0 ] prev pc ,
input [ 3 1 : 0 ] pc ,
input ret ,
input c a l l ,
output [`CW−1:0 ] count out ,
input i n i t s t a r t
) ;
reg [`CW−1:0 ] counter ;
assign count out = counter ;
wire p c d i f f = ( prev pc != pc ) ;
wire [ 3 1 : 0 ] counte r p lu s one = counter + 1 'b1 ;
always @(posedge ( c l k ) ) begin
i f ( i n i t s t a r t | c a l l | r e t ) begin
counter <= 'b1 ;
end else i f ( r e s e t ) begin
counter <= 'b0 ;
end else i f ( p c d i f f ) begin
counter <= counte r p lu s one ;
end
end
endmodule
module CycleCounter (
input clk ,
input r e se t ,
159
B Source Code
input ret ,
input c a l l ,
output [`CW−1:0 ] count out ,
input i n i t s t a r t
) ;
reg [`CW−1:0 ] counter ;
assign count out = counter ;
wire [ 3 1 : 0 ] counte r p lu s one = counter + 1 'b1 ;
always @(posedge ( c l k ) ) begin
i f ( i n i t s t a r t | c a l l | r e t ) begin
counter <= 'b1 ;
end else i f ( r e s e t ) begin
counter <= 'b0 ;
end else begin
counter <= counte r p lu s one ;
end
end
endmodule
module Sta l lCounte r (
input clk ,
input r e se t ,
input s t a l l ,
input ret ,
input c a l l ,
output [`CW−1:0 ] count out ,
input i n i t s t a r t
) ;
reg [`CW−1:0 ] counter ;
assign count out = counter ;
wire [ 3 1 : 0 ] counte r p lu s one = counter + 1 'b1 ;
always @(posedge ( c l k ) ) begin
i f ( i n i t s t a r t | c a l l | r e t ) begin
counter <= s t a l l ;
end else i f ( r e s e t ) begin
counter <= 'b0 ;
end else i f ( s t a l l ) begin
counter <= counte r p lu s one ;
160
B.2 LEAP Source Code
end
end
endmodule
module PowerCounter (
input clk ,
input r e se t ,
input p c d i f f ,
input s t a l l ,
input [ 3 1 : 0 ] i n s t r i n ,
input ret ,
input c a l l ,
output [`CW−1:0 ] count out ,
input i n i t s t a r t
) ;
// This assumes 6 groups + s t a l l s ( not parameter i zed )
reg [`PW−1:0 ] A;
reg [`PW−1:0 ] B;
reg [`PW−1:0 ] C;
reg [`PW−1:0 ] D;
reg [`PW−1:0 ] E ;
reg [`PW−1:0 ] F ;
reg [`PSW−1:0 ] STALL;
reg [`PW−1:0 ] UNKNOWN;
assign count out = { A,B,C,D,E,F ,STALL } ;
wire [ 5 : 0 ] op = i n s t r i n [ 3 1 : 2 6 ] ;
wire [ 5 : 0 ] funct = i n s t r i n [ 5 : 0 ] ;
reg [ 3 : 0 ] cur group ;
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
A <= ' b0 ;
B <= ' b0 ;
C <= ' b0 ;
D <= ' b0 ;
E <= ' b0 ;
F <= ' b0 ;
STALL <= 'b0 ;
cur group <= 'b0 ;
end else begin
161
B Source Code
cur group = 'b0 ;
// PC DIFF
i f (˜ s t a l l & p c d i f f ) begin
// NOP
i f ( i n s t r i n == ' b0 ) begin
cur group = `P NOP; // NOP
// BEQ, BNE, BLEZ, BGTZ
end else i f ( op [ 5 : 2 ] == 4 ' b0001 ) begin
i f ( op [ 1 : 0 ] == 2 ' b00 ) cur group = `P BEQ; // BEQ
i f ( op [ 1 : 0 ] == 2 ' b01 ) cur group = `P BNE ; // BNE
i f ( op [ 1 : 0 ] == 2 ' b10 ) cur group = `P BLEZ ; // BLEZ
i f ( op [ 1 : 0 ] == 2 ' b11 ) cur group = `P BGTZ ; // BGTZ
// BLTZ, BGEZ, BGEZAL, BLTZAL
end else i f ( op == 6 ' b000001 ) begin
i f ( i n s t r i n [ 1 6 ] == 1 'b0 ) cur group = `P BLTZ ; // BLTZ, BLTZAL
else cur group = `P BGEZ ; // BGEZ, BGEZAL
// R−type i n s t r u c t i o n
end else begin
i f ( op == ' b000000 ) begin
// ADD, ADDU, SUB, SUBU
i f ( funct [ 5 : 2 ] == 4 ' b1000 ) begin
i f ( funct [ 1 ] == 1 'b0 ) cur group = `P ADD ADDU; // ADD, ADDU
else begin
i f ( funct [ 0 ] == 1 'b0 ) cur group = `P SUB ; // SUB
else cur group = `P SUBU; // SUBU
end
// MUL, MULU, DIV, DIVU
end else i f ( funct [ 5 : 2 ] == 4 ' b0110 ) begin
i f ( funct [ 1 : 0 ] == 2 ' b00 ) cur group = `P MULT; // MULT
i f ( funct [ 1 : 0 ] == 2 ' b01 ) cur group = `P MULTU; // MULTU
i f ( funct [ 1 ] == 1 'b1 ) cur group = `P DIV DIVU ; // DIV, DIVU
// AND, OR, XOR
end else i f ( funct [ 5 : 2 ] == 4 ' b1001 ) begin
i f ( funct [ 1 : 0 ] == 2 ' b00 ) cur group = `P AND; // AND
i f ( funct [ 1 : 0 ] == 2 ' b01 ) cur group = `P OR; // OR
i f ( funct [ 1 : 0 ] == 2 ' b10 ) cur group = `P XOR; // XOR
162
B.2 LEAP Source Code
// MFHI, MFLO,
end else i f ( funct [ 5 : 2 ] == ' b0100 ) begin
i f ( funct [ 1 ] == 1 'b0 ) cur group = `P MFHI ; // MFHI
else cur group = `P MFLO; // MFLO
// SLL , SRA, SRL, SLLV, SRLV
end else i f ( funct [ 5 : 3 ] == 3 ' b000 && ( funct [ 1 ] | | ˜ funct [ 0 ] ) ) begin
i f ( funct [ 2 : 0 ] == 2 ' b000 ) cur group = `P SLL ; // SLL ,
i f ( funct [ 2 : 0 ] == 2 ' b100 ) cur group = `P SLLV ; // SLLV
i f ( funct [ 1 : 0 ] == 2 ' b11 ) cur group = `P SRA ; // SRA
i f ( funct [ 2 : 0 ] == 3 ' b010 ) cur group = `P SRL ; // SRL
i f ( funct [ 2 : 0 ] == 3 ' b110 ) cur group = `P SRLV ; // SRLV
// SLT, SLTU
end else i f ( funct [ 5 : 1 ] == 5 ' b10101 ) begin
i f ( funct [ 0 ] == 1 'b0 ) cur group = `P SLT ; // SLT
else cur group = `P SLTU ; // SLTU
// JR, JALR
end else i f ( funct [ 5 : 1 ] == 5 ' b00100 ) begin
cur group = `P JR ; // JR, JALR
end else begin
UNKNOWN <= UNKNOWN + 1 'b1 ;
end
// ADDI, ADDIU, SLTI , SLTIU, ANDI, ORI, XORI, LUI
end else i f ( op [ 5 : 3 ] == 3 ' b001 ) begin
i f ( op [ 2 : 0 ] == 3 ' b101 ) cur group = `P ORI ; // ORI
else i f ( op [ 2 : 0 ] == 3 ' b111 ) cur group = `P LUI ; // LUI
else i f ( op [ 2 : 1 ] == 2 ' b00 ) cur group = `P ADDI ADDIU ; // ADDI, ADDIU
else i f ( op [ 2 : 1 ] == 2 ' b10 ) cur group = `P ANDI ; // ANDI
else i f ( op [ 2 : 0 ] == 3 ' b010 ) cur group = `P SLTI ; // SLTI
else i f ( op [ 1 : 0 ] == 2 ' b11 ) cur group = `P SLTIU ; // SLTIU
else i f ( op [ 1 : 0 ] == 2 ' b10 ) cur group = `P XORI ; // XORI
// LBU, LHU
end else i f ( op [ 5 : 1 ] == ' b10010 ) begin
i f ( op [ 0 ] == 1 'b0 ) cur group = `P LBU ; // LBU
else cur group = `P LHU ; // LHU
163
B Source Code
// LH
end else i f ( op == ' b100001 ) begin
cur group = `P LH ; // LH
// LB, LH, LW, SB, SH, SW
end else i f ( op [ 5 : 4 ] == ' b10 && ˜op [ 2 ] ) begin
i f ( op [ 1 : 0 ] == 2 ' b11 ) begin
i f ( op [ 3 ] == 1 'b0 ) cur group = `P LW; // LW
else cur group = `P SW; // SW
end else begin
i f ( op [ 3 ] == 1 'b0 ) cur group = `P LB ; // LB
else cur group = `P SB SH ; // SB, SH
end
// J , JAL
end else i f ( op [ 5 : 1 ] == ' b00001 ) begin
i f ( op [ 0 ] == 1 'b0 ) cur group = `P J ; // J
else cur group = `P JAL ; // JAL
end
end
// Cache S t a l l
end else i f ( s t a l l ) begin
cur group = `STALL;
// P i pe l i n e S t a l l
end else begin
cur group = `STALL;
end
i f ( i n i t s t a r t | c a l l | r e t ) begin
A <= ( cur group == `A) ;
B <= ( cur group == `B) ;
C <= ( cur group == `C) ;
D <= ( cur group == `D) ;
E <= ( cur group == `E) ;
F <= ( cur group == `F) ;
STALL <= ( cur group == `STALL) ;
end else begin
case ( cur group )
164
B.2 LEAP Source Code`A: A <= A + 1 ' b1 ;`B : B <= B + 1 ' b1 ;`C : C <= C + 1 ' b1 ;`D: D <= D + 1 ' b1 ;`E : E <= E + 1 ' b1 ;`F : F <= F + 1 ' b1 ;`STALL : STALL <= STALL + 1 'b1 ;
endcase
end
end
end
endmodule
module CounterStorage (
input clk ,
input r e se t ,
input c a l l ,
input ret ,
input [ `N2−1:0 ] cur func num ,
input [`CW−1:0 ] count ,
input [`CW−1:0 ] ch i ld count ,
input i n i t s t a r t ,
input p c d i f f ,
// For access ing RAM
output reg ram wren ,
output reg [ `N2−1:0 ] ram addr ,
output reg [`CW−1:0 ] ram wr data ,
input [`CW−1:0 ] ram rd data
) ;
reg main func ;
reg [ 1 : 0 ] s t a t e ;
wire c a l l r e t ;
reg p r e v c a l l r e t ;
reg [`CW−1:0 ] p r ev ch i l d coun t ;
reg [`CW−1:0 ] rd data ;
reg [`CW−1:0 ] wr data ;
assign c a l l r e t = (`DO HIER) ? r e t : ( r e t | c a l l ) ; // s t o r e only on re t i f doing
h i e r a r c h i c h a l
// perform add i t i on s s e p a r a t e l y to avoid c r i t i c a l path
165
B Source Code
always @(posedge ( c l k ) ) begin
ram addr <= cur func num ;
i f (`PROF METHOD != ``p ' ' )i f (`DO HIER) ram wr data <= ram rd data + wr data + p r ev ch i l d coun t ;
else ram wr data <= ram rd data + wr data ;
else begin
ram wr data <= { rd data [ 1 5 9 : 1 3 8 ] + wr data [ 1 5 9 : 1 3 8 ] , rd data [ 1 3 7 : 1 1 6 ] + wr data
[ 1 3 7 : 1 1 6 ] ,
rd data [ 1 1 5 : 9 4 ] + wr data [ 1 1 5 : 9 4 ] , rd data [ 9 3 : 7 2 ] + wr data [ 9 3 : 7 2 ] ,
rd data [ 7 1 : 5 0 ] + wr data [ 7 1 : 5 0 ] , rd data [ 4 9 : 2 8 ] + wr data [ 4 9 : 2 8 ] , rd data [ 2 7 : 0 ]
+ wr data [ 2 7 : 0 ] } ;
end
end
// whenever c a l l or r e t go high , load o l d va lue and add count to i t
always @(posedge ( c l k ) ) begin
i f ( r e s e t | i n i t s t a r t ) begin
ram wren <= 1 'b0 ;
p r e v c a l l r e t <= 1 ' b0 ;
s t a t e <= 2 ' b00 ;
p r ev ch i l d coun t <= ' b0 ;
// main func used to count f i r s t c y c l e in ``main ' ' s ince i t s u s u a l l y compensated f o r by
the l a s t i n s t r in prev i ous func t i on
main func <= (`DO HIER) ? 1 'b0 : 1 'b1 ;
end else begin
i f ( s t a t e == 2 ' b00 ) begin
p r e v c a l l r e t <= c a l l r e t ;
p r ev ch i l d coun t <= ch i l d coun t ;
ram wren <= 1 'b0 ;
// i f we have a *new* c a l l or ret , read from RAM @ cur func num ( shou l d be a l ready
there )
i f ( c a l l r e t & ! p r e v c a l l r e t ) begin
i f (`PROF METHOD != ``p ' ' ) wr data <= count + ( p c d i f f & main func ) ;
else wr data <= count ;
s t a t e <= 2 ' b01 ;
main func <= 1 ' b0 ;
end
166
B.2 LEAP Source Code
end else i f ( s t a t e == 2 ' b01 ) begin
ram wren <= 1 'b1 ;
s t a t e <= 2 ' b00 ;
end
end
end
endmodule
module Cal lStack (
input clk ,
input r e se t ,
input ret ,
input c a l l ,
input [ `N2−1:0 ] func num in ,
output [ `N2−1:0 ] func num out ,
input p c d i f f ,
input i n i t s t a r t ,
input s t o r e f i r s t c a l l ,
// For access ing RAM
output reg ram wren ,
output reg [ `S2−1:0 ] ram addr ,
output [ `N2−1:0 ] ram wr data ,
input [ `N2−1:0 ] ram rd data
) ;
reg [ `S2−1:0 ] sp ; // s t a c k po i n t e r
reg i n i t c a l l d o n e ; // so we only do a c a l l to t ha t f unc t i on once
// Assign wires f o r RAM
assign ram wr data = func num in ;
assign func num out = ram rd data ;
reg s t a t e ;
always @(posedge ( c l k ) ) begin
i f ( r e s e t | ( i n i t s t a r t & ! s t o r e f i r s t c a l l & ! i n i t c a l l d o n e ) ) begin
sp <= ' b0 ;
ram wren <= 1 'b0 ;
i n i t c a l l d o n e <= 1 ' b0 ;
s t a t e <= 1 'b0 ;
167
B Source Code
end else i f ( s t a t e == 1 'b0 ) begin
// pop on a re turn −− decrement sp then re turn va lue @ new sp
i f ( r e t ) begin
sp <= sp − 1 'b1 ;
ram addr <= sp − 1 'b1 ;
ram wren <= 1 'b0 ;
s t a t e <= 1 'b1 ;
// push on c a l l (when ready ) −− s t o r e va lue @ curren t sp then increment sp
end else i f ( c a l l ) begin
sp <= sp + 1 'b1 ;
ram addr <= sp ;
ram wren <= 1 'b1 ;
i n i t c a l l d o n e <= 1 'b1 ;
s t a t e <= 1 'b1 ;
// i f not c a l l or ret , s top wr i t e
end else begin
ram wren <= 1 'b0 ;
ram addr <= sp−1 'b1 ;
end
end else i f ( s t a t e == 1 'b1 ) begin
i f ( ! c a l l & ! r e t ) s t a t e <= 1 'b0 ;
ram wren <= 1 'b0 ;
end
end
endmodule
module HierarchyStack (
input clk ,
input r e se t ,
input ret ,
input c a l l ,
output reg [`CW−1:0 ] ch i ld count ,
input [`CW−1:0 ] count in ,
input i n i t s t a r t ,
// For access ing RAM
output reg ram wren ,
output reg [ `S2−1:0 ] ram addr ,
output reg [`CW−1:0 ] ram wr data ,
input [`CW−1:0 ] ram rd data
168
B.2 LEAP Source Code
) ;
reg [ `S2−1:0 ] sp ; // s t a c k po i n t e r
reg [ 1 : 0 ] s t a t e ;
reg p r ev r e t ;
reg [`CW−1:0 ] prev count ;
reg [`CW−1:0 ] rd data ;
// grab the counter va lue b e f o r e i t ' s r e s e t @ re t / c a l l !
always @(posedge ( c l k ) ) begin
i f ( r e s e t | i n i t s t a r t ) begin
sp <= ' b0 ;
ram wren <= 1 'b0 ;
ch i l d coun t <= 'b0 ;
s t a t e <= 2 ' b00 ;
prev count <= 'b0 ;
end else i f ( s t a t e == 2 ' b00 ) begin
// pop on a re turn −− decrement sp then re turn va lue @ new sp
i f ( r e t ) begin
sp <= sp − 1 'b1 ;
ram addr <= sp − 1 'b1 ;
ram wren <= 1 'b0 ;
p r ev r e t <= 1 'b1 ;
s t a t e <= 2 ' b01 ;
ch i l d coun t <= ch i l d coun t + count in ;
rd data <= ram rd data ;
// push on c a l l (when ready ) −− s t o r e va lue @ curren t sp then increment sp
end else i f ( c a l l ) begin
sp <= sp + 1 'b1 ;
ram addr <= sp ;
ram wren <= 1 'b1 ;
p r ev r e t <= 1 'b0 ;
s t a t e <= 2 ' b01 ;
ram wr data <= ch i l d coun t + count in ;
// i f not c a l l or ret , s top wr i t e
end else begin
ram wren <= 1 'b0 ;
ram addr <= sp − 1 'b1 ;
169
B Source Code
end
end else i f ( s t a t e == 2 ' b01 ) begin
ram wren <= 1 'b0 ;
s t a t e <= 2 ' b00 ;
i f ( p r ev r e t ) ch i l d coun t <= ch i l d coun t + rd data ;
else ch i l d coun t <= 'b0 ;
end
end
endmodule
module AddressHash (
input clk ,
input r e se t ,
input [ 3 1 : 0 ] i n s t r n o l a g ,
input i n s t rVa l i d no l ag ,
input [ 2 5 : 0 ] addr in ,
output [ `N2−1:0 ] funcNum ,
input enb l in ,
input i n i t s t a r t ,
output reg i n i t done ,
input i n s t r i n i t ,
// For access ing RAM
output [ `N2−1:0 ] ram addr ,
input [ 7 : 0 ] ram rd data
) ;
reg [ 3 1 : 0 ] V1 ;
reg [ 7 : 0 ] B1 ;
reg [ 7 : 0 ] B2 ;
reg [ 7 : 0 ] A1 ;
reg [ 7 : 0 ] A2 ;
reg [ 3 1 : 0 ] a ;
wire [ 3 1 : 0 ] b ;
wire [ 7 : 0 ] tab b ;
reg [ 3 1 : 0 ] val ;
reg [ `N2−1:0 ] r s l ;
reg [ 3 : 0 ] s t a t e ;
reg [ 3 : 0 ] i S t a t e ;
170
B.2 LEAP Source Code
reg [ 3 : 0 ] n ex t s t a t e ;
reg [ `N2−1:0 ] i n i t a d d r ;
reg [ 1 : 0 ] rd cn t ;
assign funcNum = r s l [ `N2−1 : 0 ] ;
assign ram addr = ( i n i t s t a r t & ! i n i t d on e ) ? i n i t a d d r : b [ `N2−1 : 0 ] ;
assign tab b = ram rd data ;
// decode non−l a gged i n s t r
wire c a l l = ( i n s t rV a l i d n o l a g & ( i n s t r n o l a g [ 3 1 : 2 6 ] == 6 ' b00 0011 ) ) ;
wire [ 2 5 : 0 ] addr = c a l l ? j a l r ? addr in : {6 'b0 , i n s t r n o l a g [ 2 3 : 0 ] , 2 'b0 } : 32 ' h00800000 ;
// i n i t i a l i z e hash parameters
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
i S t a t e <= 4 ' b0000 ;
i n i t d on e <= 1 'b0 ;
end else i f ( i n i t s t a r t & ! i n i t d on e ) begin
// Sta te 0 : S ta r t reading V1
i f ( i S t a t e == 4 ' b0000 ) begin
i n i t a d d r <= `N;
iS t a t e <= 4 ' b1111 ;
n ex t s t a t e <= 4 ' b0001 ;
rd cn t <= 2 ' b00 ;
// Sta te 1 : Read next by te o f V1
end else i f ( i S t a t e == 4 ' b0001 ) begin
i n i t a d d r <= in i t a d d r + 1 'b1 ;
rd cn t <= rd cnt + 1 'b1 ;
case ( rd cn t )
2 ' b00 : V1 [ 3 1 : 2 4 ] <= ram rd data ;
2 ' b01 : V1 [ 2 3 : 1 6 ] <= ram rd data ;
2 ' b10 : V1 [ 1 5 : 8 ] <= ram rd data ;
2 ' b11 : V1[ 7 : 0 ] <= ram rd data ;
endcase
i f ( rd cn t == 2 ' b11 ) n ex t s t a t e <= 4 ' b0010 ;
else nex t s t a t e <= 4 ' b0001 ;
i S t a t e <= 4 ' b1111 ; // go to in termed ia te s t a t e to wai t f o r read data
171
B Source Code
// Sta te 2 : Read A/B
end else i f ( i S t a t e == 4 ' b0010 ) begin
i n i t a d d r <= in i t a d d r + 1 'b1 ;
rd cn t <= rd cnt + 1 'b1 ;
case ( rd cn t )
2 ' b00 : A1 <= ram rd data ;
2 ' b01 : A2 <= ram rd data ;
2 ' b10 : B1 <= ram rd data ;
2 ' b11 : B2 <= ram rd data ;
endcase
i f ( rd cn t == 2 ' b11 ) n ex t s t a t e <= 4 ' b0011 ;
else nex t s t a t e <= 4 ' b0010 ;
i S t a t e <= 4 ' b1111 ; // go to in termed ia te s t a t e to wai t f o r read data
// Sta te 3 : Hash i n i t i a l i z a t i o n i s done
end else i f ( i S t a t e == 4 ' b0011 ) begin
i n i t d on e <= 1 'b1 ;
i S t a t e <= 4 ' b0000 ;
// Delay one cyc l e f o r RAM to read data
end else i f ( i S t a t e == 4 ' b1111 ) begin
i S t a t e <= nex t s t a t e ;
end
end else i f ( ! i n i t s t a r t & in i t d on e ) begin // c l e a r i n i t done at end o f handshake
i n i t d on e <= 1 'b0 ;
end
end
wire enb l = i n s t r i n i t | c a l l | j a l r ;
assign b = ( val >> B1) & B2 ;
// s t a t e machine f o r hashing
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
s t a t e <= 3 ' b000 ;
end else begin
i f ( s t a t e == 3 ' b000 ) begin
i f ( enb l ) begin
172
B.2 LEAP Source Code
val = addr + V1 ;
val = val + ( val << 8) ;
val = val ˆ ( val >> 4) ;
s t a t e <= 3 ' b001 ;
end else begin
s t a t e <= sta t e ;
end
end else i f ( s t a t e == 3 ' b001 ) begin
a = ( val + ( val << A1) ) ;
s t a t e <= 3 ' b010 ;
end else i f ( s t a t e == 3 ' b010 ) begin
r s l = ( ( a >> A2) ˆ tab b ) ;
s t a t e <= 3 ' b000 ;
end
end
end
endmodule
module DataStorage (
input clk ,
input r e se t ,
input i n i t s t a r t ,
input r e t r i e v e s t a r t ,
// Hash I /Os
input hash wren ,
input [ `N2−1:0 ] hash addr ,
input [ 7 : 0 ] hash wr data ,
output [ 7 : 0 ] hash rd data ,
// Address Stack I /Os
input ca l l s t a ck wr en ,
input [ `S2−1:0 ] c a l l s t a ck add r ,
input [ `N2−1:0 ] c a l l s t a ck wr da t a ,
output [ `N2−1:0 ] c a l l s t a c k r d d a t a ,
// Hierarchy Stack I /Os
input h ie r s tack wren ,
input [ `S2−1:0 ] h i e r s t a ck add r ,
input [`CW−1:0 ] h i e r s t a ck wr da t a ,
173
B Source Code
output [`CW−1:0 ] h i e r s t a ck rd da t a ,
// Storage I /Os
input storage wren ,
input [ `N2−1:0 ] storage addr ,
input [`CW−1:0 ] storage wr data ,
output [`CW−1:0 ] s t o r age rd da t a
) ;
// 9 b i t Address f o r index ing i n to M4K (2ˆ9 = 512 , 512*9 = 4096)
wire [ 8 : 0 ] hash 9b i t addr = { 1 'b0 , {(8−`N2) {1 'b0 }} , hash addr [ `N2−1:0 ] } ;
wire [ 8 : 0 ] c a l l s t a c k 9 b i t a d d r = { 1 'b1 , {(8−`S2 ) {1 'b0}} , c a l l s t a c k add r [ `S2−1:0 ] } ;
// Extend Stack Data f o r 8 b i t s
wire [ 7 : 0 ] c a l l s t a c k 8b i t w r d a t a = { {(8−`N2) {1 'b0}} , c a l l s t a c k w r d a t a [ `N2−1:0 ] } ;
wire [ 7 : 0 ] c a l l s t a c k 8 b i t r d d a t a ;
assign c a l l s t a c k r d d a t a = c a l l s t a c k 8 b i t r d d a t a [ `N2−1 : 0 ] ;
reg i n i t d on e ;
// muxing wires f o r r e s e t t i n g s t o ra g e RAM
wire storage wren mux ;
wire [ `N2−1:0 ] storage addr mux ;
wire [`CW−1:0 ] storage wr data mux ;
reg [ `N2−1:0 ] r e s e t c n t ;
assign storage wren mux = ( i n i t s t a r t & ! i n i t d on e ) ? 1 'b1 : ( r e t r i e v e s t a r t ) ? 1 'b0 :
storage wren ;
assign storage addr mux = ( i n i t s t a r t & ! i n i t d on e ) ? r e s e t c n t : s to rage addr ;
assign storage wr data mux = ( i n i t s t a r t & ! i n i t d on e ) ? 'b0 : s torage wr data ;
// s t a t e machine to r e s e t a l l s t o ra g e RAM data
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
i n i t d on e <= 1 'b0 ;
r e s e t c n t <= 'b0 ;
end else i f ( i n i t s t a r t & ! i n i t d on e ) begin
i f ( r e s e t c n t < `N−1) r e s e t c n t <= r e s e t c n t + 1 'b1 ;
else i n i t d on e <= 1 'b1 ;
end else i f ( ! i n i t s t a r t & in i t d on e ) begin
i n i t d on e <= 1 'b0 ;
r e s e t c n t <= 'b0 ;
174
B.2 LEAP Source Code
end
end
altsyncram h i e r s t a c k (
. wren a ( h i e r s t a ck wr en ) ,
. c l ock0 ( c l k ) ,
. c locken0 (1 ' b1 ) ,
. addre s s a ( h i e r s t a ck add r ) ,
. data a ( h i e r s t a ck wr da t a ) ,
. q a ( h i e r s t a c k r d d a t a ) ,
. a c l r 0 (1 ' b0 ) ,
. a c l r 1 (1 ' b0 ) ,
. addre s s b (1 ' b1 ) ,
. a d d r e s s s t a l l a (1 ' b0 ) ,
. a d d r e s s s t a l l b (1 ' b0 ) ,
. byteena a (1 ' b1 ) ,
. byteena b (1 ' b1 ) ,
. c l ock1 (1 ' b1 ) ,
. c locken1 (1 ' b1 ) ,
. c locken2 (1 ' b1 ) ,
. c locken3 (1 ' b1 ) ,
. data b (1 ' b1 ) ,
. e c c s t a tu s ( ) ,
. q b ( ) ,
. rden a (1 ' b1 ) ,
. rden b (1 ' b1 ) ,
. wren b (1 ' b0 )
) ;
defparam
h i e r s t a c k . c l o ck enab l e i npu t a = ``BYPASS ' ' ,
h i e r s t a c k . c l o ck enab l e ou tpu t a = ``BYPASS ' ' ,
h i e r s t a c k . i n t end ed d ev i c e f am i l y = ``Cyclone I I ' ' ,
h i e r s t a c k . lpm hint = ``ENABLE RUNTIME MOD=NO' ' ,
h i e r s t a c k . lpm type = ``altsyncram ' ' ,
h i e r s t a c k . numwords a = `S ,
h i e r s t a c k . operat ion mode = ``SINGLE PORT ' ' ,
h i e r s t a c k . ou tda t a ac l r a = ``NONE' ' ,
h i e r s t a c k . ou tdata reg a = ``UNREGISTERED ' ' ,
h i e r s t a c k . p owe r up un in i t i a l i z ed = ``FALSE ' ' ,
h i e r s t a c k . ram block type = ``M4K ' ' ,
h i e r s t a c k . widthad a = `S2 ,
h i e r s t a c k . width a = `CW,
175
B Source Code
h i e r s t a c k . width byteena a = 1 ;
// I n s t a n t i a t e the Hash Array & Ca l l Stack using AltSyncRam
altsyncram hash stack (
// Hash S i gna l s ( Port A)
. wren a ( hash wren ) ,
. addre s s a ( hash 9b i t addr ) ,
. data a ( hash wr data ) ,
. q a ( hash rd data ) ,
. byteena a (1 ' b1 ) ,
// Stack S i gna l s ( Port B)
. wren b ( c a l l s t a c k w r e n ) ,
. addre s s b ( c a l l s t a c k 9 b i t a d d r ) ,
. data b ( c a l l s t a c k 8b i t w r d a t a ) ,
. q b ( c a l l s t a c k 8 b i t r d d a t a ) ,
. byteena b (1 ' b1 ) ,
// Common S i gna l s
. c l ock0 ( c l k ) ,
. c locken0 (1 ' b1 ) ,
. a c l r 0 (1 ' b0 ) ,
. a c l r 1 (1 ' b0 ) ,
. a d d r e s s s t a l l a (1 ' b0 ) ,
. a d d r e s s s t a l l b (1 ' b0 ) ,
. c l ock1 (1 ' b1 ) ,
. c locken1 (1 ' b1 ) ,
. c locken2 (1 ' b1 ) ,
. c locken3 (1 ' b1 ) ,
. e c c s t a tu s ( ) ,
. rden a (1 ' b1 ) ,
. rden b (1 ' b1 )
) ;
defparam
hash s tack . add r e s s r e g b = ``CLOCK0' ' ,
hash s tack . c l o ck enab l e i npu t a = ``BYPASS ' ' ,
hash s tack . c l o ck enab l e i npu t b = ``BYPASS ' ' ,
hash s tack . c l o ck enab l e ou tpu t a = ``BYPASS ' ' ,
hash s tack . c l o ck enab l e ou tpu t b = ``BYPASS ' ' ,
hash s tack . i nda t a r e g b = ``CLOCK0' ' ,
hash s tack . i n t end ed d ev i c e f am i l y = ``Cyclone I I ' ' ,
hash s tack . lpm type = ``altsyncram ' ' ,
176
B.2 LEAP Source Code
hash s tack . numwords a = 512 ,
hash s tack . numwords b = 512 ,
hash s tack . operat ion mode = ``BIDIR DUAL PORT ' ' ,
hash s tack . ou tda t a ac l r a = ``NONE' ' ,
hash s tack . ou tda t a ac l r b = ``NONE' ' ,
hash s tack . ou tdata reg a = ``UNREGISTERED ' ' ,
hash s tack . ou tdata reg b = ``UNREGISTERED ' ' ,
hash s tack . p owe r up un in i t i a l i z ed = ``FALSE ' ' ,
hash s tack . r ead dur ing wr i t e mode mixed por t s = ``DONT CARE' ' ,
hash s tack . widthad a = 9 , // (2ˆ9) *8 = 4096. Just use the whole M4K( s )
hash s tack . widthad b = 9 , // (2ˆ9) *8 = 4096. Just use the whole M4K( s )
hash s tack . width a = 8 , // i f N <= 256 , widthad can be 8 . i f more w i l l need separa te
w id ths f o r each por t
hash s tack . width b = 8 , // i f N <= 256 , widthad can be 8 . i f more w i l l need separa te
w id ths f o r each por t
hash s tack . width byteena a = 1 ,
hash s tack . width byteena b = 1 ,
hash s tack . wrcont ro l wraddre s s r eg b = ``CLOCK0 ' ' ;// I n s t a n t i a t e the Storage RAM
altsyncram storage (
. wren a ( storage wren mux ) ,
. c l ock0 ( c l k ) ,
. c locken0 (1 ' b1 ) ,
. addre s s a ( storage addr mux ) ,
. data a ( storage wr data mux ) ,
. q a ( s t o r age rd da t a ) ,
. a c l r 0 (1 ' b0 ) ,
. a c l r 1 (1 ' b0 ) ,
. addre s s b (1 ' b1 ) ,
. a d d r e s s s t a l l a (1 ' b0 ) ,
. a d d r e s s s t a l l b (1 ' b0 ) ,
. byteena a (1 ' b1 ) ,
. byteena b (1 ' b1 ) ,
. c l ock1 (1 ' b1 ) ,
. c locken1 (1 ' b1 ) ,
. c locken2 (1 ' b1 ) ,
. c locken3 (1 ' b1 ) ,
. data b (1 ' b1 ) ,
. e c c s t a tu s ( ) ,
. q b ( ) ,
177
B Source Code
. rden a (1 ' b1 ) ,
. rden b (1 ' b1 ) ,
. wren b (1 ' b0 )
) ;
defparam
s torage . c l o ck enab l e i npu t a = ``BYPASS ' ' ,
s t o rage . c l o ck enab l e ou tpu t a = ``BYPASS ' ' ,
s t o rage . i n t end ed d ev i c e f am i l y = ``Cyclone I I ' ' ,
s t o rage . lpm hint = ``ENABLE RUNTIME MOD=NO' ' ,
s t o rage . lpm type = ``altsyncram ' ' ,
s t o rage . numwords a = `N,
storage . operat ion mode = ``SINGLE PORT ' ' ,
s t o rage . ou tda t a ac l r a = ``NONE' ' ,
s t o rage . ou tdata reg a = ``UNREGISTERED ' ' ,
s t o rage . p owe r up un in i t i a l i z ed = ``FALSE ' ' ,
s t o rage . ram block type = ``M4K ' ' ,
s t o rage . widthad a = `N2 ,
s torage . width a = `CW,
storage . width byteena a = 1 ;
endmodule
module v I n i t i a l i z e r (
input clk ,
input r e se t ,
output r e s e t ou t ,
input [ 2 5 : 0 ] pc ,
input i n i t s t a r t ,
output reg i n i t done ,
output reg i n i t h a s h s t a r t ,
input i n i t hash done ,
output reg h a s h f i r s t s t a r t ,
input hash f i r s t d on e ,
output reg hashWren ,
output reg [ `N2−1:0 ] hashAddr ,
output reg [ 7 : 0 ] hashWrData ,
//Avalon Bus s i d e s i g n a l s
output reg avm prof i leMaste r read ,
output reg avm pro f i l eMaste r wr i t e ,
output reg [ 3 1 : 0 ] avm pro f i l eMaste r addre s s ,
178
B.2 LEAP Source Code
output reg [ 3 1 : 0 ] avm pro f i l eMaste r wr i t edata ,
output reg [ 3 : 0 ] avm pro f i l eMaste r by teenab le ,
input avm pro f i l eMaste r wa i t r eques t ,
input [ 3 1 : 0 ] avm prof i leMaster readdata ,
input avm pro f i l eMaste r r eaddatava l id
) ;
reg [ 2 : 0 ] s t a t e ;
reg [ 2 : 0 ] n ex t s t a t e ;
reg [ `N2 : 0 ] read count ;
reg [ 2 : 0 ] hashByte ;
reg [ 3 1 : 0 ] hashWrData32 ;
reg needData ;
assign r e s e t ou t = ( i n i t s t a r t & ! i n i t d on e & s t a t e ==3'b000 ) ;
// Read each va lue from counter storage ' s RAM, wr i te to SDRAM
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
s t a t e <= 3 ' b000 ;
avm pro f i l eMaste r wr i t e <= 1 'b0 ;
avm pro f i l eMaste r r ead <= 1 'b0 ;
i n i t d on e <= 1 'b0 ;
i n i t h a s h s t a r t <= 1 'b0 ;
h a s h f i r s t s t a r t <= 1 'b0 ;
hashByte <= 3 ' b000 ;
needData <= 1 'b0 ; // so we can s t a r t the next sdram read immediate ly
end else begin
i f ( i n i t s t a r t & ! i n i t d on e ) begin
// Sta te 0 : Reset p r o f i l e r
i f ( s t a t e == 3 ' b000 ) begin
// ge t sdram read ready
avm pro f i l eMaste r by teenab le <= 4 ' b1111 ;
// ge t ready to read V1/A/B from SDRAM
avm pro f i l eMaste r addre s s <= `PROF ADDR + `N;
avm pro f i l eMaste r r ead <= 1 ' b1 ;
read count <= ' b0 ;
needData <= 1 'b1 ;
s t a t e <= 3 ' b001 ;
// Sta te 1 : Read V1/A1/A2/B1/B2 from SDRAM, wr i te to hash RAM
179
B Source Code
end else i f ( s t a t e == 3 ' b001 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
i f ( needData) begin
avm pro f i l eMaste r r ead <= 1 'b1 ;
avm pro f i l eMaste r addre s s <= avm pro f i l eMaste r addre s s + 'h4 ;
needData <= 1 'b0 ;
end else begin
avm pro f i l eMaste r r ead <= 1 'b0 ;
end
end
i f ( avm pro f i l eMaste r r eaddatava l id ) begin
// prepare wri te−out to hash BRAM
s t a t e <= 3 ' b111 ;
hashAddr <= read count + `N;
hashByte <= 3 ' b001 ; // s ince i 'm pass ing in the f i r s t by te now , don ' t re−send i t
next c y c l e
hashWrData32 <= avm pro f i l eMaste r r eaddata ;
hashWrData <= avm pro f i l eMaste r r eaddata [ 3 1 : 2 4 ] ;
read count <= read count + ' h4 ;
// s top a f t e r V1/A/B are f u l l y read & wr i t t en
i f ( read count >= 'h8 ) begin // `N−4 b/c when reading at 28 , we read the 28 th ,
29 th , 30 th , and 31 s t b i t s so we ' re done
read count <= 'b0 ;
hashWren <= 1 'b0 ;
n ex t s t a t e <= 3 ' b010 ;
end else begin
hashWren <= 1 'b1 ;
n ex t s t a t e <= 3 ' b001 ;
avm pro f i l eMaste r r ead <= 1 'b1 ; // ge t next sdram read s t a r t e d
end
i f ( read count >= 'h4 ) needData <= 1 'b0 ;
else needData <= 1 'b1 ;
end
// Sta te 2 : I n i t i a l i z e Hash (Put V1/A/B in to r e g i s t e r s )
end else i f ( s t a t e == 3 ' b010 ) begin
i f ( ! i n i t h a sh don e ) begin
i n i t h a s h s t a r t <= 1 'b1 ;
180
B.2 LEAP Source Code
end else i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
i n i t h a s h s t a r t <= 1 'b0 ;
avm pro f i l eMaste r r ead <= 1 'b1 ;
avm pro f i l eMaste r addre s s <= `PROF ADDR;
s t a t e <= 3 ' b011 ;
end
// Sta te 3 : Read tab [ ] from sdram , put to hash RAM
end else i f ( s t a t e == 3 ' b011 ) begin
// i f we ' re not done g e t t i n g tab [ ] , keep reading
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
i f ( needData) begin
avm pro f i l eMaste r r ead <= 1 'b1 ;
avm pro f i l eMaste r addre s s <= avm pro f i l eMaste r addre s s + 'h4 ;
needData <= 1 'b0 ;
end else begin
avm pro f i l eMaste r r ead <= 1 'b0 ;
end
end
i f ( avm pro f i l eMaste r r eaddatava l id ) begin
// prepare wri te−out to hash BRAM
s t a t e <= 3 ' b111 ;
hashAddr <= read count [ `N2−1 : 0 ] ;
hashByte <= 3 ' b001 ; // s ince i 'm pass ing in the f i r s t by te now , don ' t re−send i t
next c y c l e
hashWrData32 <= avm pro f i l eMaste r r eaddata ;
hashWrData <= avm pro f i l eMaste r r eaddata [ 3 1 : 2 4 ] ;
// i f we ' ve wr i t t en a l l 4 b y t e s AND we ' re done reading from sdram , s t a r t reading V1
i f ( read count >= (`N−4) ) begin // `N−4 b/c when reading at 28 , we read the 28 th ,
29 th , 30 th , and 31 s t b i t s so we ' re done
read count <= 'b0 ;
hashWren <= 1 'b0 ;
n ex t s t a t e <= 3 ' b100 ;
end else begin // otherwise , come back here when bram ' s done
read count <= read count + 'h4 ;
hashWren <= 1 'b1 ;
n ex t s t a t e <= 3 ' b011 ;
needData <= 1 'b1 ;
end
181
B Source Code
end
// Sta te 4 : Add f i r s t f unc t i on (main/wrapper ) to s t a c k
end else i f ( s t a t e == 3 ' b100 ) begin
h a s h f i r s t s t a r t <= 1 'b1 ;
i f ( h a s h f i r s t d on e ) begin
h a s h f i r s t s t a r t <= 1 'b0 ;
s t a t e <= 3 ' b101 ;
end
// Sta te 5 : Te l l p rocessor to cont inue
end else i f ( s t a t e == 3 ' b101 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
avm pro f i l eMaste r r ead <= 1 'b0 ;
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
avm pro f i l eMaste r addre s s <= `STACK ADDR;
avm pro f i l eMaste r wr i t edata <= 32 'hDEADBEEF;
s t a t e <= 3 ' b110 ;
end
// Sta te 6 : Wait u n t i l wr i t e f i n i s h e d
end else i f ( s t a t e == 3 ' b110 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
avm pro f i l eMaste r wr i t e <= 1 'b0 ;
i f ( pc >= `WRAP MAIN BEGIN & pc <= `WRAP MAIN END) begin
i n i t d on e <= 1 'b1 ;
s t a t e <= 3 ' b000 ;
end
end
// Sta te 7 : Write the 4 by t e s to the hash RAM
end else i f ( s t a t e == 3 ' b111 ) begin
// i f we haven ' t w r i t t en a l l 4 bytes , keep going
i f ( hashByte < 3 ' b100 ) begin
hashByte <= hashByte + 1 'b1 ;
hashAddr <= hashAddr + 1 'b1 ;
// choose by te to send (3 ' b000 i s a l ready sen t )
case ( hashByte )
3 ' b001 : hashWrData <= hashWrData32 [ 2 3 : 1 6 ] ;
3 ' b010 : hashWrData <= hashWrData32 [ 1 5 : 8 ] ;
3 ' b011 : hashWrData <= hashWrData32 [ 7 : 0 ] ;
182
B.2 LEAP Source Code
endcase
end else begin
hashWren <= 1 ' b0 ;
s t a t e <= nex t s t a t e ;
end
end
end else i f ( ! i n i t s t a r t & in i t d on e ) begin
i n i t d on e <= 1 'b0 ; // r e s e t i n i t done so handshaking can occur again
end
end
end
endmodule
module vRet r i ev e r ( // c y c l e s
input clk ,
input r e se t ,
input [ 2 5 : 0 ] pc ,
input r e t r i e v e s t a r t ,
output reg r e t r i e v e don e ,
input [`CW−1:0 ] prof data ,
output reg [ `N2−1:0 ] p ro f index ,
//Avalon Bus s i d e s i g n a l s
output reg avm pro f i l eMaste r wr i t e ,
output reg [ 3 1 : 0 ] avm pro f i l eMaste r addre s s ,
output reg [ 3 1 : 0 ] avm pro f i l eMaste r wr i t edata ,
output reg [ 3 : 0 ] avm pro f i l eMaste r by teenab le ,
input avm pro f i l eMaste r wa i t r eques t
) ;
reg [ 2 : 0 ] s t a t e ;
reg [ `N2 : 0 ] p ro f count ; // used b/c p ro f i ndex can only go up to `N−1, so i s always < `N (
c i r c u l a r )
// Read each va lue from counter storage ' s RAM, wr i te to SDRAM
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
r e t r i e v e don e <= 1 ' b0 ;
s t a t e <= 3 ' b000 ;
p r o f i nd ex <= 'b0 ;
183
B Source Code
avm pro f i l eMaste r wr i t e <= 1 'b0 ;
end else begin
i f ( r e t r i e v e s t a r t & ! r e t r i e v e don e ) begin
// Sta te 0 : Setup muxes
i f ( s t a t e == 3 ' b000 ) begin
avm pro f i l eMaste r by teenab le <= 4 ' b1111 ;
p r o f i nd ex <= ' b0 ;
p ro f count <= ' b0 ;
s t a t e <= 3 ' b001 ;
end
// Sta te 1 : Prepare / wr i te data to sdram −− PROBLEM, 64 b i t s w i l l r e qu i r e 2 wr i t e s / reads
else i f ( s t a t e == 3 ' b001 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
i f ( p ro f count < `N) begin` i fdef CW64
avm pro f i l eMaste r addre s s <= `PROF ADDR + { pro f index , 3 ' b000 } ;`else
avm pro f i l eMaste r addre s s <= `PROF ADDR + { pro f index , 2 ' b00 } ;`endif
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
avm pro f i l eMaste r wr i t edata <= pro f data ;
p r o f i nd ex <= pro f i nd ex + 1 'b1 ;
p ro f count <= pro f count + 1 'b1 ;` i fdef CW64
s t a t e <= 3 ' b011 ;`endif
end else begin
avm pro f i l eMaste r addre s s <= `STACK ADDR;
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
avm pro f i l eMaste r wr i t edata <= 32 'hCADFEED;
s t a t e <= 3 ' b010 ;
end
end
end
// Sta te 2 : Wait u n t i l u n s t a l l wr i t e i s f i n i s h e d
else i f ( s t a t e == 3 ' b010 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
avm pro f i l eMaste r wr i t e <= 0 ;
184
B.2 LEAP Source Code
s t a t e <= 3 ' b000 ;
r e t r i e v e don e <= 1 'b1 ;
end
end
// Sta te 3 : Write out top 32 b i t s f o r a 64− b i t counter` i fdef CW64
else i f (`CW>32 & s t a t e == 3 ' b011 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
avm pro f i l eMaste r addre s s <= `PROF ADDR + { pro f index , 3 ' b100 } ;
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
avm pro f i l eMaste r wr i t edata <= pro f data [ 6 3 : 3 2 ] ;
s t a t e <= 3 ' b001 ;
end
end`endif
end else i f ( ! r e t r i e v e s t a r t & r e t r i e v e don e ) begin
r e t r i e v e don e <= 1 'b0 ; // r e s e t r e t r i e v e done so handshaking can occur again
end
end
end
endmodule
// Retr i eve Power Resu l t s ( assuming entry s i z e o f 160=32x5 b i t s )
module vPowerRetriever ( // power
input clk ,
input r e se t ,
input [ 2 5 : 0 ] pc ,
input r e t r i e v e s t a r t ,
output reg r e t r i e v e don e ,
input [`CW−1:0 ] prof data ,
output reg [ `N2−1:0 ] p ro f index ,
//Avalon Bus s i d e s i g n a l s
output reg avm pro f i l eMaste r wr i t e ,
output reg [ 3 1 : 0 ] avm pro f i l eMaste r addre s s ,
output reg [ 3 1 : 0 ] avm pro f i l eMaste r wr i t edata ,
output reg [ 3 : 0 ] avm pro f i l eMaste r by teenab le ,
input avm pro f i l eMaste r wa i t r eques t
) ;
185
B Source Code
reg [ 2 : 0 ] s t a t e ;
reg [ `N2 : 0 ] p ro f count ; // used b/c p ro f i ndex can only go up to `N−1, so i s always <`N ( c i r c u l a r )
reg [ 3 1 : 0 ] p ro f addr ; // t h i s i s to j u s t add 32 b i t s to the address i n s t ead o f doing
s h i f t i n g / e t c .
reg [ 2 : 0 ] r epeat count ; // used to count out a l l 32− b i t words to wr i t e to sdram
reg [ 1 5 9 : 0 ] p r o f d a t a wr i t e ou t ; // so we can s t a r t the next RAM read and not l o s e the curren t
data
// Read each va lue from counter storage ' s RAM, wr i te to SDRAM
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
r e t r i e v e don e <= 1 ' b0 ;
s t a t e <= 3 ' b000 ;
p r o f i nd ex <= 'b0 ;
avm pro f i l eMaste r wr i t e <= 1 'b0 ;
end else begin
i f ( r e t r i e v e s t a r t & ! r e t r i e v e don e ) begin
// Sta te 0 : Setup muxes
i f ( s t a t e == 3 ' b000 ) begin
avm pro f i l eMaste r by teenab le <= 4 ' b1111 ;
p r o f i nd ex <= ' b0 ;
p ro f count <= ' b0 ;
p ro f addr <= `PROF ADDR;
s t a t e <= 3 ' b001 ;
end
// Sta te 1 : Prepare / wr i te data to sdram −− PROBLEM, 64 b i t s w i l l r e qu i r e 2 wr i t e s / reads
else i f ( s t a t e == 3 ' b001 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
i f ( p ro f count < `N) begin
avm pro f i l eMaste r addre s s <= pro f addr ;
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
avm pro f i l eMaste r wr i t edata <= pro f data [ 1 5 9 : 1 2 8 ] ; // j u s t at i n t e r v a l s o f 32
f o r sdram b l o c k s i z e
p ro f d a t a wr i t e ou t <= pro f data ;
p ro f addr <= pro f addr + 32 'h4 ; // 4 by t e s = 32 b i t s = one sdram entry
p ro f i nd ex <= pro f i nd ex + 1 'b1 ; // s t a r t the next RAM read so i t s ready in
time
r epeat count <= 'b0 ;
186
B.2 LEAP Source Code
s t a t e <= 3 ' b011 ;
end else begin
avm pro f i l eMaste r addre s s <= `STACK ADDR;
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
avm pro f i l eMaste r wr i t edata <= 32 'hCADFEED;
s t a t e <= 3 ' b010 ;
end
end
end
// Sta te 2 : Wait u n t i l u n s t a l l wr i t e i s f i n i s h e d
else i f ( s t a t e == 3 ' b010 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
avm pro f i l eMaste r wr i t e <= 0 ;
s t a t e <= 3 ' b000 ;
r e t r i e v e don e <= 1 'b1 ;
end
end
// Sta te 3 : Write a l l 160 b i t s to SDRAM
else i f ( s t a t e == 3 ' b011 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
i f ( r epeat count < 3 ' b100 ) begin
avm pro f i l eMaste r addre s s <= pro f addr ;
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
p ro f addr <= pro f addr + 32 'h4 ;
// choose word to send
case ( r epeat count )
3 ' b000 : avm pro f i l eMaste r wr i t edata <= pro f d a t a wr i t e ou t [ 1 2 7 : 9 6 ] ;
3 ' b001 : avm pro f i l eMaste r wr i t edata <= pro f d a t a wr i t e ou t [ 9 5 : 6 4 ] ;
3 ' b010 : avm pro f i l eMaste r wr i t edata <= pro f d a t a wr i t e ou t [ 6 3 : 3 2 ] ;
3 ' b011 : avm pro f i l eMaste r wr i t edata <= pro f d a t a wr i t e ou t [ 3 1 : 0 ] ;
endcase
r epeat count <= repeat count + 1 'b1 ;
end else begin
pro f count <= pro f count + 1 'b1 ;
avm pro f i l eMaste r wr i t e <= 1 'b0 ;
s t a t e <= 3 ' b001 ;
187
B Source Code
end
end
end
end else i f ( ! r e t r i e v e s t a r t & r e t r i e v e don e ) begin
r e t r i e v e don e <= 1 'b0 ; // r e s e t r e t r i e v e done so handshaking can occur again
end
end
end
endmodule
188
B.3 SnoopP Source Code
B.3 SnoopP Source Code
// Top− l e v e l module f o r SnoopP p r o f i l e r
module SnoopP (
input clk ,
input r e se t ,
input [ 2 5 : 0 ] pc ,
input [ 3 1 : 0 ] i n s t r i n ,
input s t a l l ,
// Handshaking por t s
input i n i t s t a r t ,
output i n i t done ,
input r e t r i e v e s t a r t ,
output r e t r i e v e don e ,
// Avalon P ro f i l e Master por t s
output avm prof i leMaste r read ,
output avm pro f i l eMaste r wr i t e ,
output [ 3 1 : 0 ] avm pro f i l eMaste r addre s s ,
output [ 3 1 : 0 ] avm pro f i l eMaste r wr i t edata ,
input [ 3 1 : 0 ] avm prof i leMaster readdata ,
input avm pro f i l eMaste r wa i t r eques t ,
input avm pro f i l eMaste r r eaddatava l id
) ;
wire [ `N2−1:0 ] addr a ;
wire [ `N2−1:0 ] count a ;
wire [ 2 5 : 0 ] addr lo data ;
wire [ 2 5 : 0 ] addr h i data ;
wire [`CW−1:0 ] count data ;
wire p r o f i n i t s t a r t ;
wire p r o f i n i t d on e ;
// mux ava lon s i g n a l s
wire i n i t a vm p r o f i l eMa s t e r w r i t e ;
wire [ 3 1 : 0 ] i n i t avm pro f i l eMas t e r add r e s s ;
wire [ 3 1 : 0 ] i n i t avm pro f i l eMas t e r wr i t eda t a ;
wire r e t r i e v e avm pro f i l eMas t e r wr i t e ;
wire [ 3 1 : 0 ] r e t r i e v e avm pro f i l eMas t e r add r e s s ;
wire [ 3 1 : 0 ] r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ;
189
B Source Code
assign avm pro f i l eMaste r wr i t e = ( i n i t s t a r t & ! i n i t d on e ) ? i n i t a vm p r o f i l eMa s t e r w r i t e :
r e t r i e v e avm pro f i l eMa s t e r wr i t e ;
assign avm pro f i l eMaste r addre s s = ( i n i t s t a r t & ! i n i t d on e ) ? i n i t avm pro f i l eMas t e r add r e s s
: r e t r i e v e avm pro f i l eMas t e r add r e s s ;
assign avm pro f i l eMaste r wr i t edata = ( i n i t s t a r t & ! i n i t d on e ) ?
i n i t avm pro f i l eMas t e r wr i t eda t a : r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ;
// I n s t a n t i a t e the sub−modules
P r o f i l e r p r o f i l e (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. pc ( pc ) ,
. s t a l l ( s t a l l ) ,
. addr a ( addr a ) ,
. count a ( count a ) ,
. addr lo data ( addr lo data ) ,
. addr h i data ( addr h i data ) ,
. count data ( count data ) ,
. g l o b a l i n i t ( i n i t s t a r t | r e t r i e v e s t a r t ) ,
. i n i t s t a r t ( p r o f i n i t s t a r t ) ,
. i n i t d on e ( p r o f i n i t d on e )
) ;
s I n i t i a l i z e r i n i t (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. pc ( pc ) ,
. i n i t s t a r t ( i n i t s t a r t ) ,
. i n i t d on e ( i n i t d on e ) ,
. addr a ( addr a ) ,
. addr lo data ( addr lo data ) ,
. addr h i data ( addr h i data ) ,
. p r o f i n i t s t a r t ( p r o f i n i t s t a r t ) ,
. p r o f i n i t d on e ( p r o f i n i t d on e ) ,
190
B.3 SnoopP Source Code
//Avalon Bus s i d e s i g n a l s
. avm pro f i l eMaste r r ead ( avm pro f i l eMaste r r ead ) ,
. avm pro f i l eMaste r wr i t e ( i n i t a vm p r o f i l eMa s t e r w r i t e ) ,
. avm pro f i l eMaste r addre s s ( i n i t avm pro f i l eMas t e r add r e s s ) ,
. avm pro f i l eMaste r wr i t edata ( i n i t avm pro f i l eMas t e r wr i t eda t a ) ,
. avm pro f i l eMaste r wa i t r eques t ( avm pro f i l eMaste r wa i t r eques t ) ,
. avm pro f i l eMaste r r eaddata ( avm pro f i l eMaste r r eaddata ) ,
. avm pro f i l eMaste r r eaddatava l id ( avm pro f i l eMaste r r eaddatava l id )
) ;
sR e t r i e v e r r e t r i e v e (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. pc ( pc ) ,
. r e t r i e v e s t a r t ( r e t r i e v e s t a r t ) ,
. r e t r i e v e don e ( r e t r i e v e don e ) ,
. count a ( count a ) ,
. count data ( count data ) ,
//Avalon Bus s i d e s i g n a l s
. avm pro f i l eMaste r wr i t e ( r e t r i e v e avm pro f i l eMa s t e r wr i t e ) ,
. avm pro f i l eMaste r addre s s ( r e t r i e v e avm pro f i l eMas t e r add r e s s ) ,
. avm pro f i l eMaste r wr i t edata ( r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ) ,
. avm pro f i l eMaste r wa i t r eques t ( avm pro f i l eMaste r wa i t r eques t )
) ;
endmodule
module P r o f i l e r (
input clk ,
input r e se t ,
input [ 2 5 : 0 ] pc ,
input s t a l l ,
// addresses to index i n to r e g i s t e r arrays
input [ `N2−1:0 ] addr a ,
input [ `N2−1:0 ] count a ,
// va l u e s returned from r e g i s t e r arrays
input [ 2 5 : 0 ] addr lo data ,
191
B Source Code
input [ 2 5 : 0 ] addr h i data ,
output [`CW−1:0 ] count data ,
// i n i t handshaking s i g n a l s
input g l o b a l i n i t , // to t e l l i f the system i s i n i t i a l i z i n g
input i n i t s t a r t , // t e l l i f new data i s ready to wr i t e i n t o reg ' soutput reg i n i t d on e
) ;
// Reg i s te r arrays to s t o r e add r l o / addr h i / counters
reg [ 2 5 : 0 ] addr lo [ `N−1 : 0 ] ;
reg [ 2 5 : 0 ] addr h i [ `N−1 : 0 ] ;
reg [`CW−1:0 ] count [ `N−1 : 0 ] ;
assign count data = count [ count a ] ;
// i n i t i a l i z e addr ranges
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
i n i t d on e <= 1 'b0 ;
end else begin
i f ( i n i t s t a r t & ! i n i t d on e ) begin
addr lo [ addr a ] <= addr lo data ;
addr h i [ addr a ] <= addr h i data ;
i n i t d on e <= 1 'b1 ;
end else i f ( ! i n i t s t a r t & in i t d on e ) begin
i n i t d on e <= 1 'b0 ;
end
end
end
// se tup each addr range p r o f i l e r
integer i ;
reg [`CW−1:0 ] count va l ;
reg [ `N2−1:0 ] count cur ;
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
// Reset Counters
for ( i =0; i<`N; i=i +1)
count [ i ] <= 'b0 ;
end else begin
i f ( ! g l o b a l i n i t ) begin
192
B.3 SnoopP Source Code
i f ( s t a l l | (`PROF METHOD!=``s ' ' ) ) begin // only count on s t a l l s ( or always i f not
measuring s t a l l s )
for ( i =0; i<`N; i=i +1) begin
// Check which func t i on range the PC l i e s in
i f ( pc >= addr lo [ i ] & pc <= addr h i [ i ] ) begin
count [ i ] <= count [ i ] + 1 ' b1 ;
count va l <= count [ i ] + 1 'b1 ;
count cur <= i ;
end
end
end
end
end
end
endmodule
module s I n i t i a l i z e r (
input clk ,
input r e se t ,
input [ 2 5 : 0 ] pc ,
input i n i t s t a r t ,
output reg i n i t done ,
output reg [ `N2−1:0 ] addr a ,
output reg [ 2 5 : 0 ] addr lo data ,
output reg [ 2 5 : 0 ] addr h i data ,
output reg p r o f i n i t s t a r t ,
input p ro f i n i t d on e ,
//Avalon Bus s i d e s i g n a l s
output reg avm prof i leMaste r read ,
output reg avm pro f i l eMaste r wr i t e ,
output reg [ 3 1 : 0 ] avm pro f i l eMaste r addre s s ,
output reg [ 3 1 : 0 ] avm pro f i l eMaste r wr i t edata ,
input avm pro f i l eMaste r wa i t r eques t ,
input [ 3 1 : 0 ] avm prof i leMaster readdata ,
input avm pro f i l eMaste r r eaddatava l id
) ;
193
B Source Code
reg [ 2 : 0 ] s t a t e ;
reg [ 2 : 0 ] n ex t s t a t e ;
reg [ 2 5 : 0 ] sdram addr ;
reg [ `N2+1:0] read count ;
wire [ `N2−1:0 ] reg addr = read count [ `N2 : 1 ] ; // use to d i v i d e by 2 s ince l o and h i reg ' sa l t e r n a t e
// Read each va lue from counter storage ' s RAM, wr i te to SDRAM
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
s t a t e <= 3 ' b000 ;
avm pro f i l eMaste r wr i t e <= 1 'b0 ;
avm pro f i l eMaste r r ead <= 1 'b0 ;
avm pro f i l eMaste r wr i t edata <= ' b0 ;
i n i t d on e <= 1 'b0 ;
p r o f i n i t s t a r t <= 1 'b0 ;
end else begin
i f ( i n i t s t a r t & ! i n i t d on e ) begin
// Sta te 0 : Reset p r o f i l e r
i f ( s t a t e == 3 ' b000 ) begin
// ge t ready to read ranges from SDRAM
avm pro f i l eMaste r addre s s <= `PROF ADDR;
sdram addr <= `PROF ADDR;
avm pro f i l eMaste r r ead <= 1 ' b1 ;
p r o f i n i t s t a r t <= 1 'b1 ;
read count <= ' b0 ;
s t a t e <= 3 ' b001 ;
end else i f ( s t a t e == 3 ' b001 ) begin
// Handshaking wi th p r o f i l e r module
i f ( p r o f i n i t s t a r t & p r o f i n i t d on e )
p r o f i n i t s t a r t <= 1 'b0 ;
// I f wai t r e que s t i s low we can g i v e another address to read from
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
// I f we ' ve g i ven address f o r a l l the b l o c k s we want , s top reading
i f ( avm pro f i l eMaste r addre s s == `PROF ADDR + (2*4*`N) )
avm pro f i l eMaste r r ead <= 1 'b0 ;
else
avm pro f i l eMaste r addre s s <= avm pro f i l eMaste r addre s s + 4 'h4 ;
end
194
B.3 SnoopP Source Code
// I f we have v a l i d data
i f ( avm pro f i l eMaste r r eaddatava l id ) begin
// i f we ' ve read a l l a dd r l o / addr hi , t e l l p rocessor to u n s t a l l
i f ( read count == (2*`N) ) begin
avm pro f i l eMaste r r ead <= 1 'b0 ;
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
avm pro f i l eMaste r addre s s <= `STACK ADDR;
avm pro f i l eMaste r wr i t edata <= 32 'hDEADBEEF;
s t a t e <= 3 ' b011 ;
end else begin
read count <= read count + 1 'b1 ;
sdram addr <= sdram addr + 4 'h4 ; // t h i s v a r i a b l e i s on ly used f o r debug output
// s t o r e data
i f ( read count [ 0 ] ) begin
addr h i data <= avm pro f i l eMaste r r eaddata ; // s t o r e to add r l o r e g i s t e r s
addr a <= reg addr ;
p r o f i n i t s t a r t <= 1 'b1 ;
end else begin
addr lo data <= avm pro f i l eMaste r r eaddata ; // s t o r e to add r l o r e g i s t e r s
end
end
end
// Sta te 3 : Wait u n t i l wr i t e f i n i s h e d
end else i f ( s t a t e == 3 ' b011 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
avm pro f i l eMaste r wr i t e <= 1 'b0 ;
i n i t d on e <= 1 'b1 ;
s t a t e <= 3 ' b000 ;
end
end
end else i f ( ! i n i t s t a r t & in i t d on e ) begin
i n i t d on e <= 1 'b0 ; // r e s e t i n i t done so handshaking can occur again
end
end
end
endmodule
module sRe t r i e v e r (
195
B Source Code
input clk ,
input r e se t ,
input [ 2 5 : 0 ] pc ,
input r e t r i e v e s t a r t ,
output reg r e t r i e v e don e ,
output reg [ `N2−1:0 ] count a ,
input [`CW−1:0 ] count data ,
//Avalon Bus s i d e s i g n a l s
output reg avm pro f i l eMaste r wr i t e ,
output reg [ 3 1 : 0 ] avm pro f i l eMaste r addre s s ,
output reg [ 3 1 : 0 ] avm pro f i l eMaste r wr i t edata ,
input avm pro f i l eMaste r wa i t r eques t
) ;
reg [ 2 : 0 ] s t a t e ;
reg [ `N2 : 0 ] p ro f count ; // used b/c p ro f i ndex can only go up to `N−1, so i s always < `N (
c i r c u l a r )
// Read each va lue from counter storage ' s RAM, wr i te to SDRAM
always @(posedge ( c l k ) ) begin
i f ( r e s e t ) begin
r e t r i e v e don e <= 1 ' b0 ;
s t a t e <= 3 ' b000 ;
count a <= 'b0 ;
avm pro f i l eMaste r wr i t e <= 1 'b0 ;
end else begin
i f ( r e t r i e v e s t a r t & ! r e t r i e v e don e ) begin
// Sta te 0 : Setup muxes
i f ( s t a t e == 3 ' b000 ) begin
count a <= 'h0 ;
p ro f count <= ' b0 ;
s t a t e <= 3 ' b001 ;
end
// Sta te 1 : Prepare / wr i te data to sdram
else i f ( s t a t e == 3 ' b001 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
i f ( p ro f count < `N) begin` i fdef CW64
196
B.3 SnoopP Source Code
avm pro f i l eMaste r addre s s <= `PROF ADDR + { count a , 2 ' b000 } ;`else
avm pro f i l eMaste r addre s s <= `PROF ADDR + { count a , 2 ' b00 } ;`endif
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
avm pro f i l eMaste r wr i t edata <= count data ;
count a <= count a + 1 'b1 ;
p ro f count <= pro f count + 1 'b1 ;` i fdef CW64
s t a t e <= 3 ' b011 ;`endif
end else begin
avm pro f i l eMaste r addre s s <= `STACK ADDR;
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
avm pro f i l eMaste r wr i t edata <= 32 'hCADFEED;
s t a t e <= 3 ' b010 ;
end
end
end
// Sta te 2 : Wait u n t i l u n s t a l l wr i t e i s f i n i s h e d
else i f ( s t a t e == 3 ' b010 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
avm pro f i l eMaste r wr i t e <= 0 ;
s t a t e <= 3 ' b000 ;
r e t r i e v e don e <= 1 'b1 ;
end
end
// Sta te 3 : Write out top 32 b i t s f o r a 64− b i t counter` i fdef CW64
else i f (`CW>32 & s t a t e == 3 ' b011 ) begin
i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin
avm pro f i l eMaste r addre s s <= `PROF ADDR + { count a , 3 ' b100 } ;
avm pro f i l eMaste r wr i t e <= 1 'b1 ;
avm pro f i l eMaste r wr i t edata <= count data [ 6 3 : 3 2 ] ;
s t a t e <= 3 ' b001 ;
end
197
B Source Code
end`endif
end else i f ( ! r e t r i e v e s t a r t & r e t r i e v e don e ) begin
r e t r i e v e don e <= 1 'b0 ; // r e s e t r e t r i e v e done so handshaking can occur again
end
end
end
endmodule
198
References
[1] Gnu gprof. http://sourceware.org/binutils/docs/gprof/index.html.
[2] Home :: Opencores. http://opencores.org/.
[3] Virtex-II Platform FPGAs: Complete Data Sheet.
[4] Yacc-yet another cpu cpu. {http://opencores.org/project,yacc,overview}.
[5] APEX 20K device family overview. http://www.altera.com/products/devices/apex/
overview/apx-overview.html#1.8, 2011.
[6] Altera, Corp. http: // www. altera.com .
[7] Altera, Corp., San Jose, CA. Avalon Interface Specifications, 2011.
[8] Altera, Corp., San Jose, CA. ByteBlaster II Download Cable User Guide, 2011.
[9] Altera, Corp., San Jose, CA. Cyclone II Device Family Data Sheet, 2011.
[10] Altera, Corp., San Jose, CA. DE2 Development and Education Board, 2011.
[11] Altera, Corp., San Jose, CA. Internal Memory (RAM and ROM) User Guide, 2011.
[12] Altera, Corp., San Jose, CA. Nios II Processor Reference Handbook, 2011.
[13] Altera, Corp., San Jose, CA. PowerPlay Power Analysis, 2011.
[14] Altera, Corp., San Jose, CA. Quartus II Handbook Version 10.0, 2011.
[15] Altera, Corp., San Jose, CA. Quartus II Handbook Version 10.0 Volume 4: SOPC Builder,2011.
[16] Altera, Corp., San Jose, CA. Stratix Device Handbook, 2011.
[17] Altera, Corp., San Jose, CA. Using SignalTap II Embedded Logic Analyzers in SOPCBuilder Systems, 2011.
199
B References
[18] Gene M. Amdahl. Readings in computer architecture. chapter Validity of the singleprocessor approach to achieving large scale computing capabilities, pages 79–81. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 2000.
[19] Glenn Ammons, Thomas Ball, and James R. Larus. Exploiting hardware performancecounters with flow and context sensitive profiling. In Proceedings of the ACM SIGPLAN1997 conference on Programming language design and implementation, PLDI ’97, pages85–96, New York, NY, USA, 1997. ACM.
[20] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H.Anderson, Stephen Brown, and Tomasz Czajkowski. Legup: high-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA internationalsymposium on Field programmable gate arrays, FPGA ’11, pages 33–36, New York, NY,USA, 2011. ACM.
[21] Peter Clarke. Arm clone taken offline, talks between its designer and arm. November 2001.
[22] J. Cong and Y. Zou. FPGA-based hardware acceleration of lithographic aerial imagesimulation. ACM Trans. Reconfigurable Technol. Syst., 2(3):1–29, 2009.
[23] Keith DeHaven. Extensible Processing Platform - Ideal Solution for a Wide Range ofEmbedded Systems, 2010.
[24] M. Finc and A. Zemva. A systematic approach to profiling for hardware/software parti-tioning. Computers & Electrical Engineering, 31(2):93 – 111, 2005.
[25] GNU. GDB: The GNU Project Debugger.
[26] GNU. Using the GNU Compiler Collection (GCC).
[27] Ann Gordon-Ross and Frank Vahid. Frequent loop detection using efficient nonintrusiveon-chip hardware. IEEE Trans. Comput., 54:1203–1215, October 2005.
[28] Zhi Guo, Walid Najjar, Frank Vahid, and Kees Vissers. A quantitative analysis of thespeedup factors of FPGAs over processors. In Proceedings of the 2004 ACM/SIGDA 12thinternational symposium on Field programmable gate arrays, FPGA ’04, pages 162–170,New York, NY, USA, 2004. ACM.
[29] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. Proposal and quantitative analysis ofthe CHStone benchmark program suite for practical C-based high-level synthesis. Journalof Information Processing, 17:242–254, 2009.
[30] http://www.llvm.org. The LLVM Compiler Infrastructure Project, 2011.
[31] Intel Corporation. Using the RDTSC Instruction for Performance Monitoring, 1997.
200
B References
[32] Bob Jenkins. Perfect hashing. http://burtleburtle.net/bob/hash/perfect.html,2011.
[33] Chi keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney,Steven Wallace, Vijay Janapa, and Reddi Kim Hazelwood. Pin: Building customizedprogram analysis tools with dynamic instrumentation. In In Programming Language Designand Implementation, pages 190–200. ACM Press, 2005.
[34] Lawrence Latif. Intel combines the atom with an fpga chip. November 2010.
[35] J. Luu, K. Redmond, W. Lo, P. Chow, L. Lilge, and J. Rose. FPGA-based monte carlocomputation of light absorption for photodynamic cancer therapy. In IEEE Symposiumon Field-Programmable Custom Computing Machines, pages 157–164, 2009.
[36] Mentor Graphics. ModelSim - Advanced Simulation and Debugging.
[37] MIPS Technologies, Sunnyvale, CA. MIPS Architecture For Programmers, 2010.
[38] MIPS Technologies, Inc. MIPS R10000 Microprocessor User’s Manual, 1996.
[39] Simon Moore and Gregory Chadwick. The Tiger “MIPS” processor. http://www.cl.cam.ac.uk/teaching/0910/ECAD+Arch/mips.html, 2011.
[40] Jingzhao Ou and V.K. Prasanna. Rapid energy estimation of computations on fpga basedsoft processors. In SOC Conference, 2004. Proceedings. IEEE International, pages 285 –288, September 2004.
[41] Gang Qu, Naoyuki Kawabe, Kimiyoshi Usami, and Miodrag Potkonjak. Function-levelpower estimation methodology for microprocessors. In Proceedings of the 37th AnnualDesign Automation Conference, DAC ’00, pages 810–813, New York, NY, USA, 2000.ACM.
[42] El-Sayed M. Saad, Medhat H. A. Awadalla, and Kareem Ezz El-Deen. Fpga-based real-time software profiler for embedded systems. International Journal of Computers, Systemsand Signals, 9(2), 2008.
[43] Lesley Shannon and Paul Chow. Using reconfigurability to achieve real-time profiling forhardware/software codesign. In in FPGA ’04: Proceedings of the 2004 ACM/SIGDA 12thinternational symposium on Field programmable gate arrays, pages 190–199. ACM Press,2004.
[44] V. Tiwari, S. Malik, and A. Wolfe. Power analysis of embedded software: a first steptowards software power minimization. Very Large Scale Integration (VLSI) Systems, IEEETransactions on, 2(4):437–445, December 1994.
[45] Jason G. Tong and Mohammed A. S. Khalid. Profiling tools for fpga-based embeddedsystems: Survey and quantitative comparison. Journal of Computers, 3(6):1–14, 2008.
201
B References
[46] United States Bureau of Labor Statistics. Occupational Outlook Handbook 2010-2011 Edi-tion, 2010.
[47] Liz Vaughan-Adams. Arm investigates swedish university over ‘clone’ claim. March 2001.
[48] Vineet Aggarwal Wayne Marx. FPGAs Are Everywhere - In Design, Test & Control. RTCMagazine, April 2008.
[49] R. P. Weicker. Dhrystone benchmark: rationale for version 2 and measurement rules.SIGPLAN Not., 23:49–62, August 1988.
[50] Xilinx, San Jose, CA. Platform Studio User Guide, 2004.
[51] Xilinx, San Jose, CA. MicroBlaze Soft Processor Core, 2011.
[52] Xilinx, Inc. http: // www.xilinx.com .
[53] Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. Application-specific cus-tomization of soft processor microarchitecture. In Proceedings of the 2006 ACM/SIGDA14th international symposium on Field programmable gate arrays, FPGA ’06, pages 201–210, New York, NY, USA, 2006. ACM.
202
top related