low-cost hardware profiling of run-time and energy in fpga soft

Low-Cost Hardware Profiling of Run-Time and Energy in

FPGA Soft Processors

Mark Lee Aldham

A thesis submitted in conformity with the requirements

for the degree of Master of Applied Science

Department of Electrical and Computer Engineering

University of Toronto© Copyright by Mark Lee Aldham 2011

Low-Cost Hardware Profiling of Run-Time and Energy in FPGA Soft

Processors

Mark Lee Aldham

Master of Applied Science, 2011

Graduate Department of Electrical and Computer Engineering

University of Toronto

Abstract

Field Programmable Gate Arrays (FPGAs) are a reconfigurable hardware platform which en-

able the acceleration of software code through the use of custom-hardware circuits. Complex

systems combining processors with programmable logic require partitioning to decide which

code segments to accelerate. This thesis provides tools to help determine which software code

sections would most benefit from hardware acceleration.

A low-overhead profiling architecture, called LEAP, is proposed to attain real-time profiles

of an FPGA-based processor. LEAP is designed to be extensible for a variety of profiling tasks,

three of which are investigated and implemented to identify candidate software for acceleration.

1) Cycle profiling determines the most time-consuming functions to maximize speedup. 2) Cache

stall profiling detects memory-intensive code; large memory overheads reduce the benefits of

acceleration. 3) Energy consumption profiling detects energy-inefficient code through the use

of an instruction-level power database to minimize the system’s energy consumption.

Acknowledgments

First, I would like to thank my parents for their unwavering support throughout my many years

of education. They have always encouraged me to do what I enjoy, even if it meant filling the

house with the noise of drumming or resulted in sport-related hospital trips.

I would also like to thank Caliope Music Search Inc., especially Bryan Keith, for the advice,

support, and friendship over the past few years. You’ve always made time to help make sense of

my crazy ideas, and always believed that I can solve the problem at hand, even when I don’t yet

understand the question. Also, thank you Bryan and Sarah for the helpful and much-needed

editing of this thesis 1.

I would like to thank my supervisors Jason Anderson and Stephen Brown for the opportunity

to perform this research. You’ve helped to keep me on track throughout this degree and have

provided indispensable insight to help complete this project.

I would finally like to thank Cathleen. You’ve kept me sane through failed experiments, put

up with my sporadic schedules, and always kept me going. Thank you for always giving me

something to look forward to.

1Do Do Do Doooo Doooo!

Contents

Acknowledgments iii

List of Figures vii

List of Tables ix

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background and Related Work 8

2.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Time-Based Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Energy and Power-Based Profiling . . . . . . . . . . . . . . . . . . . . . . 13

2.2 MIPS Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Processor Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Processor Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.3 Communication Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.1 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Cycle-Accurate Profiling 26

Contents

3.1 Profiler Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.1 Method of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.2 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.2.1 Operation Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.2.2 Call Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.2.3 Data Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.2.4 Counter Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.2.5 Address Hash . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.3 Compilation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.4 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.4.1 Wrapper Function . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1.4.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1.4.3 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Cycle and Stall Cycle Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Hierarchical Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.1 Configurable Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.2 Experimental Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.3 Comparison with Software Profiler . . . . . . . . . . . . . . . . . . . . . . 43

3.4.4 Comparison with FPGA-Based Profiler . . . . . . . . . . . . . . . . . . . 45

3.4.4.1 Profiling Results Comparison . . . . . . . . . . . . . . . . . . . . 47

3.4.4.2 Area and Power Comparison . . . . . . . . . . . . . . . . . . . . 48

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Energy Consumption Profiling 60

4.1 Instruction Power Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Implementation in Existing Framework . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2.1 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.2.1 Overall Energy Estimation . . . . . . . . . . . . . . . . . . . . . 70

4.3.2.2 Function-level Energy Estimation . . . . . . . . . . . . . . . . . 74

Contents

4.3.2.3 Energy and Cycle Profile Correlation . . . . . . . . . . . . . . . 76

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5 Hardware/Software Partitioning 79

5.1 System Creation with LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6 Conclusions 87

6.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.2.1 Detailed Function Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.2.2 Value Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2.3 Memory Access Pattern Profiling . . . . . . . . . . . . . . . . . . . . . . . 91

6.2.4 Counter Saturation Detection . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2.5 Recursion Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A Complete Benchmark Results 95

A.1 Full Results for Chapter 3: Cycle-Accurate Profiling . . . . . . . . . . . . . . . . 95

A.1.1 Full Results for LEAP vs. gprof . . . . . . . . . . . . . . . . . . . . . . . 95

A.1.2 Full Results for LEAP vs. SnoopP . . . . . . . . . . . . . . . . . . . . . . 104

A.1.3 Full Results for Power Overhead . . . . . . . . . . . . . . . . . . . . . . . 114

A.2 Full Results for Chapter 4: Energy Consumption Profiling . . . . . . . . . . . . . 116

A.2.1 Full Results for Cache Stall Energy Estimates . . . . . . . . . . . . . . . . 116

A.2.2 Full Results for Pipeline Stall Energy Estimates . . . . . . . . . . . . . . 124

A.2.3 Full Results for Energy/Time Correlation . . . . . . . . . . . . . . . . . . 131

B Source Code 140

B.1 Profiling Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

B.2 LEAP Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

B.3 SnoopP Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

References 199

List of Figures

2.1 Architecture of the SnoopP Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Architecture of the Airwolf Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Architecture of the AddressTracer Profiler . . . . . . . . . . . . . . . . . . . . . . 12

2.4 RTL schematic for original version of the Tiger MIPS processor . . . . . . . . . . 20

2.5 Host computer source code to send ELF section to processor . . . . . . . . . . . 21

2.6 FPGA-side code to receive section from host computer . . . . . . . . . . . . . . . 22

2.7 Design flow with LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.8 High-level architecture overview of LegUp . . . . . . . . . . . . . . . . . . . . . . 25

3.1 High-level flow chart for instruction-count profiling . . . . . . . . . . . . . . . . . 28

3.2 Modular view of profiling architecture . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Opcodes used to determine function context switches . . . . . . . . . . . . . . . . 30

3.4 Code example to illustrate the need for function address hashing . . . . . . . . . 33

3.5 Parameterized hashing algorithm used by the perfect hash generator . . . . . . . 35

3.6 Compilation flow to generate programming files . . . . . . . . . . . . . . . . . . . 35

3.7 Example hash initialization file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.8 An example memory layout of program memory . . . . . . . . . . . . . . . . . . 37

3.9 Application-independent wrapper to enable initialization and data retrieval . . . 39

3.10 Architecture diagrams of Data Counter module for cycle-accurate profiling . . . . 41

3.11 Area overhead comparison of LEAP and SnoopP . . . . . . . . . . . . . . . . . . 54

3.12 Power overhead comparison of LEAP and SnoopP . . . . . . . . . . . . . . . . . 58

4.1 Instruction-level power database groupings . . . . . . . . . . . . . . . . . . . . . . 67

4.2 Data Counter module configured for power profiling . . . . . . . . . . . . . . . . 68

5.1 Area comparison for hybrid systems in four configurations . . . . . . . . . . . . . 83

List of Figures

5.2 Execution time comparison for hybrid systems in four configurations . . . . . . . 85

5.3 Energy consumption comparison for hybrid systems in four configurations . . . . 86

List of Tables

2.1 Comparison of three candidate soft processors . . . . . . . . . . . . . . . . . . . . 17

2.2 Synthesis results for Tiger processor on Altera Cyclone II FPGA . . . . . . . . . 18

3.1 Maximum depth of nested function calls . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Description of the benchmarks used in experiments . . . . . . . . . . . . . . . . . 44

3.3 Comparative results of gprof vs. LEAP for the benchmark “ADPCM” . . . . . . 46

3.4 Comparative results of gprof vs. LEAP for the benchmark “DFMUL” . . . . . . 46

3.5 Comparative results of LEAP vs. SnoopP for the benchmark “ADPCM” . . . . . 49

3.6 Comparative results of LEAP vs. SnoopP for the benchmark “DFMUL” . . . . . 50

3.7 Repeatability testing for cycle accurate profiling of benchmark “Motion” . . . . . 50

3.8 Area overhead of LEAP for three profiling schemes . . . . . . . . . . . . . . . . . 51

3.9 Area overhead of LEAP for three profiling schemes with Hierarchical Profiling . . 52

3.10 Area overhead comparison of LEAP and SnoopP . . . . . . . . . . . . . . . . . . 53

3.11 Area overhead investigation for SnoopP if counters were implemented with mem-

ory bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.12 Power overhead comparison of LEAP and SnoopP . . . . . . . . . . . . . . . . . 57

4.1 Pruned instruction-level power database for cache stalls . . . . . . . . . . . . . . 63

4.2 Pruned instruction-level power database for pipeline stalls . . . . . . . . . . . . . 64

4.3 Instruction power database for the MIPS1 instruction set . . . . . . . . . . . . . 66

4.4 Area overhead of LEAP in configured for energy profiling . . . . . . . . . . . . . 68

4.5 Energy profiling results when only cache stalls are considered . . . . . . . . . . . 71

4.6 Energy profiling results when all pipeline stalls are considered . . . . . . . . . . . 72

4.7 Energy profiling results when all pipeline stalls are considered and correction

factor applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

List of Tables

4.8 Function-level energy estimation results for benchmark “ADPCM” . . . . . . . . 75

4.9 Function-level energy estimation results for benchmark “DFMUL” . . . . . . . . 76

4.10 Comparative results of percentage energy consumption versus percentage execu-

tion time for benchmark “ADPCM” . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.11 Comparative results of percentage energy consumption versus percentage execu-

tion time for benchmark “DFMUL” . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1 Area results for hybrid systems in four configurations . . . . . . . . . . . . . . . . 82

5.2 Time results for hybrid systems in four configurations . . . . . . . . . . . . . . . 84

5.3 Power and energy results for hybrid systems in four configurations . . . . . . . . 86

A.1 Power overhead results for LEAP, measured in milliwatts, for 16, 32, and 64

functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

A.2 Power overhead results for LEAP, measured in milliwatts, for 128 and 256 functions114

A.3 Power overhead results for SnoopP, measured in milliwatts, for 16, 32, and 64

functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

A.4 Power overhead results for SnoopP, measured in milliwatts, for 128 and 256

functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Chapter 1

Introduction

Profiling involves the dynamic analysis of software executing on a processor to characterize

aspects of the program’s execution. Profilable aspects can include execution time, data locality,

and energy consumption, the measurement of which can give direction to aid in redesigning

code for improved performance. Software-based profiling approaches, such as GNU’s gprof [1],

provide time-based measurements of a program’s execution that aid in the iterative process of

code improvement. These approaches incur run-time overhead and suffer from inaccuracies due

to sampling, which emphasizes the need for low-overhead and accurate profiling tools.

Field Programmable Gate Arrays (FPGAs) offer a reconfigurable environment to enable faster

time-to-market and ease of development versus application-specific integrated circuits (ASICs).

The advent of soft processors on FPGAs has aided rapid creation of systems that include

both software running on a processor with hardware that accelerates part of the software.

With the ability to combine these computation paradigms comes the need to partition code

into hardware and software portions. A hardware implementation can provide a significant

improvement in speed and energy efficiency over a software implementation (e.g. [22, 35]), but

requires writing complex register transfer level (RTL) code which is error prone and notoriously

difficult to debug. Software design is comparatively straightforward and has mature debugging

and analysis tools freely available. Despite the apparent energy and performance benefits,

hardware design is simply too difficult and costly for most applications, so a software-based

1 Introduction

approach is often preferred.

The focus of this thesis is to enable partitioning of software code into a software portion

and a hardware portion to maximize throughput and minimize energy consumption. A Low-

overhead and Extensible Architecture for Profiling (LEAP) is proposed to aid in the process

of compiling software code into hybrid systems. These FPGA-based hybrid systems combine

soft processors with programmable logic (e.g. [15], [50]) and more recently, hard processor cores

included with FPGA chips (e.g. [23], [34]). It is estimated that over 40% of all modern FPGA

designs contain embedded processors [48]. LEAP is used in a larger project called LegUp [20], an

open-source high-level synthesis (HLS) framework which partitions a software application into

a hybrid system; code segments, chosen based on profiling results, are automatically compiled

into hardware accelerators to accelerate a soft processor’s execution.

1.1 Motivation

The advent of Field Programmable Gate Arrays (FPGAs) has narrowed the gap between the

flexible, high-level properties of software programs and the complicated, low-level nature of

hardware circuitry. This middle ground between application-specific integrated circuits (ASICs)

and pure software, FPGA chips, still require relatively low-level hardware design skills, but offer

the advantage of reconfigurability. Since the first FPGAs appeared in the mid-1980s, access to

the technology has been restricted to those with hardware design skills. However, according

to labor statistics, software engineers outnumber hardware engineers by more than 10× in the

U.S. [46]. To take advantage of the vast number of software engineers, as well as retain the

performance benefits and energy efficiencies of hardware, the FPGA development flow must

be able to use software running on a soft processor in conjunction with specialized hardware

accelerators.

To create such a system, inefficient code segments must be identified to provide the maximum

performance benefit to the system. This decision process, called hardware/software partitioning,

is a non-trivial process due to its dependence on dynamic run-time data. Once chosen, the

1.2 LegUp

software code segments must be re-implemented as hardware accelerators and connected to the

processor-centric system. Therefore, the processor must be profiled to quantify this run-time

data and enable the creation of an effective partition.

Trends in computer architecture indicate that heterogeneous many-core architectures, incor-

porating processor cores, GPUs and FPGAs, will be heavily utilized in the future. Recent

collaborations between FPGA and processor vendors support this direction; Xilinx has an-

nounced the Extensible Processing Platform [23], which combines the ARM Cortex-A9MPCore

processor with Xilinx 28nm programmable logic, and Intel has announced the configurable

Atom processor E600C series [34], which combines the Intel Atom E600C processor with an on-

package Altera FPGA. These advanced systems further necessitate the partitioning of software

code to take advantage of the wide range of computational elements available.

1.2 LegUp

This research is used within a broader project called LegUp, whose overarching goal is to

create a self-accelerating processor in which code can be accelerated automatically using custom

hardware implementations. LegUp is an open-source HLS framework that has been developed

to provide the performance and energy benefits of hardware while retaining the ease-of-use

associated with software. LegUp automatically compiles a standard C program to target a

hybrid FPGA-based hardware/software system. Some program segments execute on a 32-bit

MIPS soft processor, while other program segments are automatically synthesized into FPGA

circuits (hardware accelerators) that communicate and work in tandem with the soft processor.

The LEAP profiler is incorporated into this processor as the tool to guide the selection of

accelerators, as discussed further in Chapter 3. The LegUp distribution includes a set of

benchmark programs that the user can compile to pure software, pure hardware, or a hybrid

hardware/software system.

The targeted system is a hybrid, as opposed to pure custom hardware, because not all C

program code is appropriate for hardware implementation. Inherently sequential computations

1 Introduction

(e.g. traversing a linked list) are well-suited for software, whereas independent or parallel

computations (e.g. addition of integer arrays) are ideally suited for hardware. Incorporating

a processor into the target platform also offers the advantage of increased high-level language

coverage. Program segments that use restricted C language constructs (e.g. calls to malloc/free)

can remain in software to execute on the processor.

1.3 Objectives

The principal objective of this research is to enable hardware/software partitioning decisions

to increase throughput and reduce energy consumption; these decisions are made by creating a

profiling-based flow that produces accurate measurements of desired metrics. Different target

systems may have differing performance objectives, but there are three important goals to

consider when partitioning to choose the best candidates for acceleration:

1. Maximize throughput to execute segments of computation in as few cycles as possible

2. Minimize energy consumption for both battery driven and other applications

3. High data locality is necessary for an accelerator to maximize its potential benefit,

as hardware experiences the same memory access latencies as software and thus cannot

improve performance while waiting for memory

These three objectives of the target system can each be associated with quantifiable met-

rics based on the result of Amdahl’s Law [18], which is used to find the maximum expected

improvement to a system when only a portion of that system is modified. This is expressed as:

Ssys =1

(1 − P ) + PS

where Ssys is the improvement of the overall system, P is the fraction of time (or energy) the

modified portion consumes relative to the entire system, and S is the improvement for that

portion. Benefits of hardware versus software, such as expected speedup and power reduction,

1.3 Objectives

are not predeterminable. Guo et al. [28] show that the inherent advantage of FPGAs over

processor models, coined the “instruction efficiency factor,” can range from 6 to 47 on the

tested benchmarks. This demonstrates the impracticality of estimating the performance bene-

fits. Therefore, S is not a known factor and the overall system improvement must be related

to P , the portion of code that can be improved. To maximize throughput or minimize en-

ergy consumption, the quantifiable metrics required are the percentage execution time and the

percentage energy consumption, respectively, of the desired portion. Conversely, data locality

directly affects the potential improvement for a given segment, S; higher memory overhead pro-

longs the completion time, which in turn reduces speedup and increases energy consumption.

Thus the required metric to quantify data locality is the percentage time performing memory

accesses, or alternatively, percentage time attributed to cache stalls. Summarized, the profilable

metrics required for effective partitioning are:

1. Percentage execution time: accelerate long-executing functions to maximize speedup

2. Percentage energy consumption: accelerate energy-inefficient functions to minimize

energy requirements

3. Percentage execution time in cache stalls: accelerate functions with low stall times,

thus high locality, to maximize potential performance benefits

The metrics just described outline the deliverable results desired, but they do not indicate

the manner in which the data is collected by the profiler. To provide useful results and to be

easy to use, the profiler must satisfy the following requirements:� Non-intrusive: the processor’s execution must not be inhibited or affected� Low overhead: the area usage and power consumption must be minimal compared with

that of the system

1 Introduction� Accurate: the data collected must be correct without requiring many iterations of the

application; it therefore cannot employ sampling, the act of measuring data only at specific

time intervals� Dynamic: the profiler must be able to profile any application without requiring recom-

pilation of the system� Flexible: the profiler must be configurable to measure any of the above metrics, and

must be extensible for future research

1.4 Thesis Organization

This research focuses on providing an effective mechanism to perform hardware/software par-

titioning through the use of an area- and power-efficient FPGA-based profiler. The remainder

of this thesis is organized as follows:

Chapter 2 provides background material relevant to this research, including previous efforts

in time-, energy-, and power-based profiling, an overview of the LegUp framework, and a

description of the MIPS-based soft processor used in this system.

Chapter 3 describes the design of a scalable and extensible profiling framework. The frame-

work is shown to be both area- and power-efficient when compared to previous profiling systems,

requiring only 12.2% area and 10.7% power consumption when compared to the processor alone

for a typical configuration. This chapter also presents profiling schemes which provide the means

to satisfy two of the performance goals described above, namely throughput and data locality.

Throughput, associated with measuring percentage execution time, is maximized by measuring

the cycles spent in each function and choosing those with the highest cycle count to be ac-

celerated. Data locality, associated with measuring percentage time attributed to cache stalls,

is maximized by measuring the cycles spent in each function while the processor is stalled by

either the instruction or data cache and choosing those with the lowest stall cycle count for

acceleration. Experiments are performed to verify the accuracy of these results.

1.4 Thesis Organization

Chapter 4 presents a profiling scheme that enables the system to be optimized for low energy

consumption. An energy lookup database is created based on the average energy consumption

of individual instructions in the MIPS instruction set. This database is used to supplement the

run-time data provided by the profiler to produce an estimate of the total energy consumption

of the system. Experiments are performed to determine the accuracy of the results, which have

been shown to be within 5.95% error, on average.

Chapter 5 describes the methodology of formulating the hardware/software partition based

on the results of the profiler. The partition is a recommendation in the form of an ordered

text file which indicates which functions are to be accelerated, ordered by the likely advantage

a hardware implementation would provide. Results are presented to show the performance

benefits that are attained when LEAP is used to partition software for the creation of a hybrid

system with LegUp.

Chapter 6 presents concluding remarks and suggestions for future work.

Chapter 2

Background and Related Work

2.1 Prior Work

All application and system developers share two common goals: functionality and performance.

Performance is especially important when developing embedded systems, where real-time con-

straints and limited resources impose challenges in implementation. These requirements em-

phasize the importance of good profiling tools to aid in the development cycle by quickly,

efficiently, and accurately providing run-time statistics to guide possible optimizations. The

following subsections describe prior work on profiling.

2.1.1 Time-Based Profiling

There have been many previous efforts which have attempted to monitor the execution time

characteristics of applications and attribute time to specific segments within them. Relevant

approaches can be categorized as software, hardware, or FPGA approaches. Software-based

approaches, which can involve simulation or the insertion of monitoring code into an application,

create a performance overhead but can be run without special hardware. Hardware-based

approaches make use of real hardware resources, such as performance counters, to gather run-

time data. FPGA-based approaches use custom hardware that interacts with signals on the

target soft processor in order to quickly attain an accurate profile.

2.1 Prior Work

A useful and popular software-based profiling tool is gprof [1], which is supported as part

of the GCC [26] distribution. This tool modifies the desired application by inserting code to

update profiling data and set up interrupts. These interrupts are used to sample the processor’s

program counter (PC), typically at intervals of 10ms, and attribute execution time to individual

functions. Sampling reduces accuracy but allows for a quick approximate profile.

Pin [33] is a software-based dynamic instrumentation system through which, using architecture-

generic APIs, instrumentation tools (including profilers) can be created. The API enables the

observation of all architectural states of a processor, including the contents of registers, memory,

and control flow. Pin uses just-in-time (JIT) compilation, taking a native executable as input,

and compiling traces as required for execution.

Many modern processors include hardware counters that can be used to monitor various

performance aspects of execution. Specifically, Intel has included a Time Stamp Counter (TSC)

in all x86 processors since the Pentium; software can access the TSC through the x86 instruction

RDTSC [31]. Similarly, in the MIPS R10000 architecture, two performance counters are used,

each of which can indepently count one type of event at a time [38]; the MIPS instruction

MTC0 is used to specify the event for each counter, and MFC0 is used to read the counters’

values. In [19], methods for using performance counters to create a profile with more context

than was previously available.

Several FPGA-based profiling tools have been introduced in recent years. These approaches

aim to alleviate the performance overhead of code instrumentation caused by all software-based

approaches, and the accuracy degradation that can be caused by sampling in approaches like

gprof. Four such tools are reviewed below.

SnoopP [43] is an FPGA-based profiler designed for use with the Xilinx [52] MicroBlaze [51]

soft processor on the Virtex-II 2000 FPGA [3] board. It is a non-intrusive profiler, meaning it

does not disrupt the execution of the processor or change the instruction stream. This profiler

is designed to provide developers with the ability to assess which aspects of their design fail to

meet required real-time constraints by allowing arbitrary code regions to be profiled for cycle

2 Background and Related Work

counts. The address ranges of each region, as well as the number of regions to profile, are

entered as parameters in the VHDL code of the profiler that must be set pre-synthesis. During

execution, comparators check the PC of the processor each clock cycle to see if it falls in any of

the specified regions. If it does, the 64-bit counter corresponding to that region is incremented

as shown in Figure 2.1. The circuit for this profiler, configured with the maximum 16 profilable

regions, uses 1129 flip-flops and 1719 look-up tables (LUTs), which is about the same size as the

MicroBlaze processor itself (1173 flip-flops and 1548 LUTs). The Virtex-II 2000 FPGA contains

21,504 of both flip-flops and 4-input LUTs, arranged in 10,752 slices; this means that the profiler

consumes 5.3% and 8.0% of the chip’s available flip-flop and LUT resources, respectively. This

area can be broken down as follows: the 16 64-bit cycle counters consume 1024 flip-flops, or 91%

of the profiler’s total flip-flop count, and the 32 32-bit comparators consume 1024 LUTs, or 60%

of the total LUT count. Due to the limited number of code regions that can be simultaneously

profiled with SnoopP, gprof was used to find regions of interest to choose appropriate code

regions for experiments.

Segment 0

Segment 1

PC ≤ hi address

PC ≥ low address

PC Comparators:

Segment N-1

valid_PC

Read Bus

Counter:

Increment Enable

Figure 2.1: Architecture of the SnoopP Profiler.

2.1 Prior Work

Airwolf [45] is an FPGA-based profiler that connects to the Altera Nios II-fast soft proces-

sor [12]. This profiler minimally impacts the execution of the processor by inserting software

drivers into each function call and return to enable/disable individual counters. These software

drivers consume 20 bytes per function in the instruction stream. Two counters are allocated for

each function to be profiled: a 32-bit hit counter counts the number of times a function is called,

and a 64-bit time counter counts the number of clock cycles spent in that function. As seen in

Figure 2.2, the Time Counter Enable Module connects to the Avalon Interface Bus to read data

from the software drivers and in turn enable or disable counters as required; the appropriate

Time Counter Enabling Lines (TCEL) are set high when a function has been called, and are set

low when the function returns. The Hits Counter Enable Line (HCEL) is held high for a single

clock cycle when the function is called. On a Stratix [16] FPGA, this circuit consumes 3345

logic elements (LEs), which corresponds to almost two times the area of the Nios II-fast (1800

LEs for just the core). Stratix devices range in capacity from 10,570-32,470 LEs, meaning the

profiler consumes 10.3-30.6% of the chip capacity.

AddressTracer [42] is an adaptation of SnoopP. It is a function-level, non-intrusive profiler

that counts the number of times a function is called and the number of cycles spent in the

function. Like SnoopP, AddressTracer profiles the MicroBlaze soft processor, but unlike SnoopP

its cycle counts represent the indicated function plus descendants. As seen in Figure 2.3, each

function to be profiled corresponds to a tracer block which contains two counters: a 32-bit hit

counter that counts the number of times a function is called, and a 64-bit time counter that

counts the number of clock cycles spent in that function and its descendants. During execution,

the PC of the processor is monitored by all tracer blocks, each comparing the PC with the start

and end addresses of the corresponding function. If the PC equals the function’s start address,

the hit counter is incremented and the time counter is enabled. If the PC equals the function’s

end address, the time counter is disabled. It is not specified in [42] how the values for the start

and end addresses are input, but since SnoopP requires these addresses pre-synthesis, it is likely

that AddressTracer functions similarly. The area of the profiler is also not specified but can be

2 Background and Related WorkR

Time Counter

Enable Module

TCEL #1

HCEL #1

TCEL #2

HCEL #2

TCEL #20

HCEL #20

TCE-EN

HITS-EN

Segment

Counter

TCE-EN

HITS-EN

Segment

Counter

TCE-EN

HITS-EN

Segment

Counter

#20System Clock

32-bit

Output Bus

32-bit

Output BusFrom HCEL

System Clock

32-Bit Hit Counter

64-Bit Time CounterFrom TCELEN

MSB [63:32]

LSB [31:0]

Segment Counter #20

Figure 2.2: Architecture of the Airwolf Profiler.

assumed to be comparable to that of SnoopP.

32-bit

Output Bus

count_enable 32-Bit Hit Counter

64-Bit Time CounterEN

MSB [31..0]

LSB [31:0]

if (addr_bus == start_addr) {

count_en = 1;

} else if (addr_bus == end_addr) {

count_en = 0;

addr_bus [31:0]

system clock

Tracer Block

Figure 2.3: Architecture of the AddressTracer Profiler.

Comet [24] is an FPGA-based profiler for the Altera Nios [6] soft processor. It requires the

user to modify the application to be profiled with pragma-like labels to specify the start and

end of the regions of interest. The ranges denoted by these pragmas are used to initialize

the profiler at run-time. The profiler counts the total number of clock cycles spent in each

2.1 Prior Work

region and the number of times the region iterates. For the experiment performed, one which

examined a JPEG decoder, the profiler was configured with 25-bit counters to profile 6 regions;

this consumed 486 LEs on an Altera APEX20K200E [5] FPGA. The APEX20K200E device

contains 8,320 LEs, meaning the profiler consumes 5.8% of the chip capacity.

2.1.2 Energy and Power-Based Profiling

Minimizing the energy consumption of mobile devices has become an important focus for soft-

ware developers in order to maximize battery life. In order to optimize applications for low-

power, the energy consumption must be monitored and improved iteratively. Common methods

to measure an application’s power are either invasive, such as directly measuring the current

draw of a system, or very slow, such as simulation. Invasive methods are undesirable because

they require specialized experimental setups and, if a printed board is used, soldering or trace

cutting may be required. Simulation is often prohibitive because the time required to simulate

only seconds of execution can be hours or days for large circuits. Therefore there is a need for

rapid, non-invasive tools to aid in the process of energy consumption profiling and minimization.

Three previous approaches in this direction are described below.

The work in [44] describes a profiling-based methodology for power reduction within non-

FPGA embedded processors. The authors developed and verified instruction-level power models

for the Intel 486DX2 and Fujitsu SPARClite 934 processors by measuring the current drawn

as the processor repeatedly executed certain instructions or instruction sequences. A standard

digital ammeter was used to determine the steady-state current drawn by the processor when

running each instruction; this value became that instruction’s “base cost.” The authors also

describe inter-instruction effects, whereby the mix of the instructions in the processor’s pipeline

affects the current draw. These effects include circuit state, pipeline stalls, and cache misses,

all of which add to the overall energy consumption of the system. Accuracy within 3% error

was achieved for applications containing no stalls or cache misses, but no results were presented

for those containing stalls and cache misses.

A power estimation methodology to predict the power consumption of non-FPGA embedded

processors is presented in [41]. A power data bank, which stores the energy consumption of

built-in library functions, user-defined functions, and basic instructions, is created through the

use of a power simulator and carefully created test code. The authors note that the energy

cost for two instructions is usually different from their average due to inter-instruction effects;

measuring 2-3 instruction sequences increases estimation accuracy. The power data bank is

then used in conjunction with tracing tools to produce a function-level power estimate. The

experimental results reveal a 3.5% average error; however, the tests performed were on simplistic

benchmarks that use 6 library functions and integer/floating point additions and subtractions.

A method to estimate the energy consumption of soft processors on FPGAs, specifically the

MicroBlaze processor, is described in [40]. ModelSim [36] was used to perform timing simu-

lations of software programs representing each instruction in the MicroBlaze instruction set.

These simulations produce Value Change Dump (VCD) files, which are used by XPower [52]

to estimate the energy consumption of the simulated system. VCD files are an IEEE standard

format that use ASCII-based dumpfiles to record a series of time-ordered value changes for sig-

nals in a given simulation model. This data was collected for each instruction, thus creating an

instruction energy look-up table. A maximum difference of 5-6× was found between the energy

dissipations of different instructions in the instruction set. Using the Xilinx Microprocessor

Debugger (XMD) [52] TCL interface, as opposed to on-chip profiling, an instruction trace was

combined with the lookup table to estimate the overall energy consumption of an application.

Through four test programs, an average estimation error of 5.9% was achieved.

2.2 MIPS Processor

In order to develop and test the hardware profiling architecture, a soft processor was required to

produce real-time profiling data. This processor must be able to interface with the prospective

profiler, so it needed to provide direct access to important signals required by a profiler, such

as the PC and the instruction bus. In addition, since the profiler will be an integral part of

2.2 MIPS Processor

the LegUp framework, the code for the processor must also be open-source. The choice of a

MIPS-based processor was not set as a stringent requirement, but rather a desirable trait due

to the popularity and availability of the MIPS architecture in mobile environments. ARM-

based processors were also considered, but due to the company’s prior issues with proprietary

rights ([21], [47]), open-source versions are not available. Other desirable traits in a prospective

processor were code readability, documentation, a working compilation tool-chain, and high

performance.

2.2.1 Processor Selection

Three open-source soft processors were found to satisfy the above requirements. These proces-

sors were tested, compared and contrasted to select the best candidate. The three processors

were:� YACC [4], a processor found on OpenCores [2], which provides implementations for use

with both Altera and Xilinx tools� SPREE [53], developed at the University of Toronto, which contains a C++ processor-

generation mechanism to enable custom-processor creation� Tiger MIPS [39], developed at Cambridge University, which was developed for the

Altera DE-2 board [10] and utilizes Altera’s SOPC Builder [15] to connect the processor

to memory and other peripherals on the board.

Table 2.1 compares these processors based on a range of features, including maximum clock

frequency, area, system features, and implementation details. As is evident in the table, each

processor has its own advantages. YACC has the highest clock frequency of the three processors

and provides a clean implementation in Verilog. SPREE (using pipe5 forwardAB mulshift stall load

configuration) is the smallest processor, consuming only 4% of the LEs available on the Cyclone

II EP2C35 FPGA, and contains a C++ processor generator to create a tailored soft processor

implementation in Verilog. The exclusive use of on-chip memory in these processors limits their

use for two reasons: first, the processor’s netlist must be reassembled with memory initialization

files and reprogrammed to the FPGA each time an application is to be run. Second, the limited

on-chip resources sets a much tighter bound on the possible application size than an off-chip

memory would. Tiger is the largest and slowest of the three processors, but these downfalls

can be partly attributed to the system features it implements; hardware division alone, imple-

mented with one signed and one unsigned divider, consumes over 2000 LEs. In addition, the

use of off-chip memory is supported through the use of JTAG UARTs, which communicate with

the host computer to enable the dynamic programming of applications.

Tiger was chosen as the target processor for this research, and consequently for use in the

LegUp system. The driving factor for this decision was largely based on the communication in-

terface it provides, which allows an application to be programmed to the processor dynamically,

and also enables real-time debugging using GDB [25]. In addition, it provides a GCC-based

tool-chain, comprehensive Verilog code, full coverage of the MIPS1 [37] instruction set, and a

working SOPC system implementation for the DE-2 board.

2.2.2 Processor Details

Tiger is a 32-bit 5-stage processor that supports the MIPS1 instruction set [37]. The RTL

schematic of the processor can be seen in Figure 2.4. This schematic shows the five pipeline

stages which are standard for reduced instruction set computing (RISC) processors:� The Instruction fetch stage retrieves the instruction to execute from memory� The Decode stage determines the operation to perform based on the binary operation

code (opcode) fetched,� The Execute stage performs the computation required by the decoded instruction,� The Memory access stage communicates with memory to load or store data,� The Writeback stage writes instruction results back to the register file

2.2 MIPS Processor

Table 2.1: Comparison of three candidate soft processors. Note: Area percentages refer to thepercent utilization on the CycloneII EP2C35 FPGA.

Criteria YACC SPREE TIGERFmax 96.9 MHz 81.22 MHz 56.29 MHz

- total LEs: 3,543 (11%) 1,235 (4%) 10, 073 (30%)

- total regs: 1191 349 4391

- total mem bits: 137,216 (28%) 395,364 (82%) 193,017 (40%)

- embed. mult: 0 8 16

Memory Types & Sizes

Single on-chipmemory: 4096x32

Instruction Memory:8192x32

Instruction & DataCaches: 512x148

Data Memory:4096x32

ApplicationMemory: 8MBSDRAM

Pipeline Depth 5 stagesVariable (3,5,7) –used 5 for results

5 stages

Features UARTCPU generator,ModelSimbenchmarking tools

SDRAM Memory,instr./data cache,Avalon busconnections, JTAGdebugger, HWdivision

Compilation tool chainGCC-based batchfile creates 4 hexfiles for memory init.

GCC-based makefilecreates MIF/RIFfiles for memory init.

GCC-based batchfile with Signal Tapprogrammer

Benchmarks included

UART calculator,int-to-text counter,Dhrystone, picalculator, reedsolomon

20 Benchmarks (incl.crc32, fft, dijkstra,qsort, sha, des, fir,vlc, Dhrystone)

“game of life”hardware via SOPC

MIPS CoverageDoes NOT support:bgezal, bltzal, break,syscall, mthi, mtli

Does NOT support:lwl, lr, swl, swr, div,divu, bgezal, blzal,syscall, break

Full coverage ofMIPS1

DocumentationHTML overviewpage, structuraldiagrams

ReadMEsthroughout directorystructure, thesis,overview PPT

HTML overviewpage, structuraldiagrams,descriptions incomments

Code QualityWell-structured,slightly commented

Well-structuredWell-structured,very well-commented

Table 2.2: Synthesis results for Tiger processor on Altera Cyclone II (2C35) FPGA. Values inparentheses indicate chip capacity.

Total logic elements 13,422 (33,216)Total registers 6,455 (33,216)Total memory bits 225,280 (483,840)Embedded Multipliers 16 (70)Maximum Frequency (MHz) 75.59 (402.5)

To facilitate the correct data flow among these pipeline stages, a stall logic module (bottom

of Figure 2.4) processes stall requests and control signals so that it may stall certain pipeline

stages as required. When implemented on an Altera Cyclone II [9] FPGA, Tiger consumes

13,422 logic elements (LEs) running at 75.59 MHz. More details are given in Table 2.2.

Tiger uses two types of memory for instruction and data storage: on-chip block RAM and

off-chip SDRAM on the DE-2 board. The communication between these memories and the

processor is achieved through the use of Altera’s SOPC Builder, which automatically generates

the arbitration logic to connect master and slave peripherals. At startup, reset, or after the

completion of an application, Tiger defaults to the code stored on-chip; this code acts as its boot

loader and is independent of the applications being run. In order to execute a given application,

the application must first be loaded into SDRAM.

In order to reduce the latency required to read and write from SDRAM, on-chip caches are

used. Tiger contains both a data cache and an instruction cache, each implemented in on-chip

block RAM. These caches are direct-mapped, meaning there is only one location in the cache a

given address can map to. Therefore, conflicts arising from the mapping of multiple addresses

to the same cache line are always resolved by replacement. They are also write-through caches,

meaning every write to the cache causes a synchronous write to the SDRAM. In order to

alleviate the inherent penalty of cache misses, these caches aim to reduce misses by performing

bursts when reading from the SDRAM. Every cache miss triggers a read of four consecutive

32-bit data words, stored as a single cache line. Tag and offset bits consume 20 bits, making

each cache line 148 bits wide.

2.2 MIPS Processor

Multiple modifications were made to the original processor system for integration into the

LegUp framework. First, a UART was added to enable the use of printf in the applications being

run. A more significant modification was to alter the data cache so that it communicates with

the processor over the Avalon interconnect fabric, as opposed to the original implementation

in which it communicated directly with the processor. Due to the high speeds achievable

using AltSyncRams and the Avalon fabric, the single-cycle access latency of the data cache is

maintained in this configuration. This change allowed a shared-memory model to be adopted

among the processor and variable numbers of accelerators for communication with main memory

via the cache. In addition, critical path delays were analyzed and optimized to boost the

maximum frequency of the processor from its initial speed of 43.86 MHz to its current maximum

of 75.59 MHz.

2.2.3 Communication Interface

In order to dynamically program an application to the processor, or perform debugging with

GDB, a communication interface between Tiger and a host computer was created. This interface

enables messages to be passed back and forth between the processor and host computer. To

program an application to the FPGA, the length of the application’s binary data (machine

code), the memory address at which the application should be stored, and the binary data

itself must be sent.

Three software tools were provided with the original Tiger processor to handle all communi-

cation between the processor and the host computer. This communication is performed on the

FPGA through the USB ByteBlaster [8], which allows UART communication via a standard

USB cable. The host computer handles the communication through sockets. First, a server is

needed to connect these two types of communication protocols by intercepting all messages and

relaying them appropriately. Clients must then be created on the FPGA and host computer

in order to send and receive messages. This flow permits programming an application to the

board, as well as debugging the application using GDB. In addition, this flow is used to initialize

Write Back

Memory Access

ExecuteDecodeinstrF

Branch

currentPC

instrDe

nextPC

controlDe

Register File

contr lEx

branchoutEx

instrEx

imm ALUControl

Shifter

En Enclear

shiftamount

Multipl ier

Divider

executeOutMA

instrMA

controlMA

instrWB

controlWB

writeRegDMAOutWB

MemoryInterface

wri teRegEnWB writeRegNumWB

writeRegEnWB wri teRegNumWB

Stall Logic

stal lDe clearDe

stallDe stallEx

stallMAclearMA

stal lWB

iStal l dStall stallEx clearEx stal lMA clearMA stallWB clearWB

branchout

controlDe instrDe controlEx instrEx wri teRegEnMA wri teRegNumMAstal lRqEx writeRegEnWB writeRegNu

wri teRegEnMA writeRegNumMA

InstructionCache

Instruction Fetch

stal RqEx

Figure 2.4: RTL schematic for original version of the Tiger MIPS processor.

and retrieve data from the profiling framework, which will be discussed in Section 3.1.4.

The server required to enable communication between the FPGA and the host computer is

a GUI-based application developed in C# called “MIPS Communication Server.” It connects

to the FPGA through the Byte-Blaster using the SignalTap II Logic Analyzer [17], a tool that

captures internal device signals for debugging, but also provides a TCL interface for JTAG

communication. The server communicates with the processor through blocking sends/receives

over a virtual JTAG UART. In addition, it creates a local server on the host computer to which

that can be connected to via the computer’s command-line, enabling communication through

2.2 MIPS Processor

socket-based sends and receives.

The command-line interface on the host computer, called “MIPSLoad,” is written in C and

is used to disassemble compiled Executable and Linkable Format (ELF) files into readable

sections, send all data to the SDRAM, and verify the contents once programmed. ELF is a

standard binary file format for executables, object code and shared libraries, and is not bound

to any specific computer architecture. This data transfer is initiated by first sending a single

character to indicate the task to be performed; each task is associated with a unique character

and this encoding is known to the server and FGPA client. After the task is established,

task-specific parameters are sent consecutively. For example, a section of the ELF file can be

programmed by first sending “M” to specify the task, and then the required parameters: the

SDRAM address to write to, the length of the section, and the data itself, as seen in Figure 2.5.

An ELF section is verified by sending “m” as the task, followed by the SDRAM address to read

from and the length of data to be read, and then continuously reading transmitted data until

all is received. This received data is then compared to the data previously sent to ensure that

no errors were encountered and the application is correctly programmed.

int sendSect ion ( char* sect ionData , unsigned int addr , unsigned int l ength ) {i f ( serverSend (“M” , 1) ) return 1 ;i f ( serverSend(&addr , 4) ) return 1 ;i f ( serverSend(&length , 4) ) return 1 ;i f ( serverSend ( sect ionData , l ength ) ) return 1 ;

return 0 ;}

Figure 2.5: Host computer source code to send ELF section to processor.

The communication mechanism on the Tiger processor is called the “debug stub,” which

uses blocking sends and receives through a virtual JTAG UART to communicate with the host

server. The debug stub is contained in the boot loader code that is run both at startup and after

the completion of an application’s execution. When called, the debug stub performs a blocking

UART read to wait on a single-character task identifier to be sent from the host computer. A

case statement is then used to receive data relevant to the required task. For example, an ELF

section is programmed to the SDRAM with the task identifier “M.” The debug stub receives

the data from the host computer and then writes that data to memory mapped addresses using

pointer arithmetic, as seen in Figure 2.6. In addition to the debug stub, the boot loading code

handles tasks such as flushing the caches and resetting registers.

// Receive t a s k i d e n t i f i e rchar id = uar t r eadc (DEBUGUART) ;

switch ( id ) {. . .case ‘M’ :

// Receive s e c t i on l engthunsigned in t Length = uart readUInt (DEBUG UART) ;

// Receive and c r ea t e po in t e r to d e s i r ed s e c t i on addressv o l a t i l e unsigned char* Addr = ( v o l a t i l e unsigned char *) uart readUInt (

DEBUG UART) ;

// Receive s e c t i on one byte at a time , s t o r e to each byte to SDRAMin t i ;f o r ( i = 0 ; i < Length ; ++i ) {*Addr = uar t r eadc (DEBUG UART) ;

Addr++;}

// Return value o f 0 i n d i c a t e s proper execu t ionuar t wr i t eUInt (DEBUGUART, 0) ;

// Flush the cache o f s t a l e dataf lushICache ( ) ;

break ;. . .

Figure 2.6: FPGA-side code to receive section from host computer.

2.3 LegUp

Hardware profiling with LEAP is a key component in a larger project at the University of

Toronto, called LegUp. LegUp is an open-source system that consists of a HLS compilation

2.3 LegUp

flow, a hybrid processor/accelerator architecture generator, and LEAP-based hardware/soft-

ware partitioning. These components result in a system that provides a semi-automated self-

acceleration flow for software code running on the MIPS processor previously described.

2.3.1 Design Flow

The LegUp design flow consists of multiple steps, many of them automated, which generate a

hybrid architecture which accelerates a particular software application. Figure 2.7 illustrates

the detailed flow.

Referring to the labels in the figure, at step ➀ the user compiles a standard C program to

a software-only binary executable using the LLVM compiler [30]. At step ➁, the executable is

run on the Tiger MIPS processor described in Section 2.2.2.

Using LEAP, sections of program code that would most benefit from hardware implementa-

tion to improve program throughput and power are selected. Specifically, the profiling results

drive hardware/software partitioning, which is the selection of program code segments to be

re-targeted to custom hardware from the C source.

Having chosen program segments to target to custom hardware, at step ➂ the LegUp HLS tool

reads a configuration file representing the hardware/software partition and compiles the required

segments to synthesizable Verilog RTL. LegUp’s HLS operates at the function level; entire

functions are synthesized to hardware from the C source. Moreover, if a hardware function calls

other functions, these called functions are also synthesized to hardware; hardware-accelerated

functions may not call software functions. The RTL produced by LegUp is synthesized to an

FPGA implementation using standard commercial tools [14] at step ➃. As illustrated in the

figure, LegUp’s hardware synthesis and software compilation are part of the same LLVM-based

compiler framework.

In step ➄, the C source is modified such that the functions implemented as hardware accel-

erators are replaced by wrapper functions that communicate with the accelerators (instead of

performing computations in software). This new modified source is compiled to a MIPS binary

executable. Finally, in step ➅ the hybrid processor/accelerator system executes on the FPGA.

Step ➆ (the dashed line) shows a software-only optimization flow, in which a developer uses

the profiling data to target optimization efforts.

Self-Profiling

MIPS Processor

Program source code

C Compiler

Profiling Data:

Execution Cycles

Cache Misses

High-level

synthesisSuggested

program

segments for HW

implementationHW Accelerators

µP Compiled

program

segments

Altered SW binary (calls HW accelerators)

FPGA fabric

y[n] = 0;

for (i = 0; i < 8; i++) {

y[n] += coeff[i] * x[n-i];

Figure 2.7: Design flow with LegUp.

2.3.2 System Architecture

The target system architecture of the LegUp framework is shown in Figure 2.8. The processor

connects to one or more custom hardware accelerators through a standard on-chip interface.

As the initial hardware platform is the Altera DE-2 Development and Education board (con-

taining a 90nm Cyclone II FPGA), the Altera Avalon interface is used for processor/accelerator

communication [7]. Synthesizable RTL code for the Avalon interface is generated automatically

using Altera’s SOPC builder tool. The DE-2 was chosen because of its widespread availability.

As shown in Figure 2.8, a shared memory architecture is adopted. The processor and ac-

celerators share an on-FPGA data cache and off-chip main memory, which are arbitrated by a

memory controller. Such an architecture allows processor/accelerator communication across the

Avalon interface or through memory. The shared single cache obviates the need to implement

2.4 Summary

MIPS ProcessorHardware

Accelerator

AVALON INTERCONNECT

Hardware

Accelerator

Memory ControllerOn-Chip

Off-Chip Memory

Figure 2.8: High-level architecture overview of LegUp.

cache coherency or automatic cache line invalidation.

2.4 Summary

Multiple efforts have been made towards the goal of accurate time-based profiling. Software

profiling methods, such as gprof, suffer from the overhead caused by inserted instructions and

also suffer from accuracy loss due to sampling. FPGA-based profiling tools have been designed

to overcome these issues. Power profiling has been researched on both embedded systems and

FPGAs, however the results presented in all cases are for small benchmark sets. In addition, no

on-chip power profiling schemes have been presented. Tiger MIPS is the target soft processor

used in this research, as well as the processor used within the hybrid systems produced by

the LegUp HLS framework. LegUp is used to compile an application to software, profile its

execution, choose function candidates for acceleration, and synthesize the software code to

create hardware accelerators which communicate with the processor.

Chapter 3

Cycle-Accurate Profiling

3.1 Profiler Design

This chapter describes the creation of a low-overhead hardware profiler that monitors a proces-

sor as it executes, through the development of an area- and power-efficient architecture called

LEAP. The design goals for LEAP are to compactly gather and store all necessary data while

minimizing the energy expended in doing so. In addition, the profiler architecture has been

designed with extensibility in mind so that the measurement of different dynamic attributes of

a processor, such as run-time, cache stall rates, and energy consumption, may be implemented

easily by replacing a single module. Although LEAP targets a MIPS-based soft processor, there

are very few MIPS-specific aspects in the architecture. The profiler could be easily adapted to

a different processor, given that required signals, such as the program counter (PC), instruction

bus, and cache stall status, are provided.

3.1.1 Method of Operation

Throughout the execution of a program, the processor’s PC contains the address of the currently

executing instruction. LEAP profiles the execution of the program by monitoring the PC and

the instruction bus. During execution, the profiler maintains a single counter, called the Data

Counter, that tracks the number of times an event has occurred. In the case of instruction count

3.1 Profiler Design

profiling, this involves incrementing the Data Counter whenever the PC changes. For cycle

count profiling, the counter is incremented every clock cycle. LEAP organizes the collected data

on a per-function basis by allocating a storage counter for each software function. Recursion

support is currently limited by the depth of the Call Stack (described below), however, details

to provide full recusion support is discussed in Section 6.2.5.

Function boundaries are identified by decoding the executing instruction to check for function

calls or returns. If a function call is detected, the Data Counter is added to any previously

stored values associated with this function (from previous executions of the function). The

Data Counter is then reset to 0 so that it can start counting the events that have occurred in

this newly called function. If a function return is detected, the Data Counter value is added to

the counter associated with the current function, and once again the Data Counter is reset.

A key feature of LEAP is that data is only stored during function context switches, which

are triggered by calls to a new function or returns from the previous function. Minimal logic

is used to dynamically update the Data Counter (as opposed to updating counters for each

function each clock cycle), only updating the storage corresponding to the relevant function at

these context switches. It is shown in Section 3.4.4.2 that using a single counter value is more

power efficient than techniques described in Section 2.1.1. Another important feature of LEAP

is that the system (profiler and processor) only require resynthesis if any of the configurable

parameters are modified (discussed in Section 3.4.1); different applications can be run and

profiled without resynthesis or reprogramming the board. A novel aspect of the design is the

use of perfect hashing in hardware to associate code segments with hardware counters that

track dynamic run-time behavior. This use of hashing leads to considerably smaller overhead

versus previously-published hardware profiler designs.

The logical flow used in LEAP can be seen in Figure 3.1 for instruction-count profiling. In

the left branch, the PC is monitored to detect the occurrence of each newly-issued instruction.

The Data Counter is incremented each time a different instruction executes. The middle branch

detects function calls; when a call is made to a new function, the Data Counter is added to

3 Cycle-Accurate Profiling

the counter corresponding to the caller function. The Data Counter is then reset so that it

now represents only instructions executed in the callee function. At the same time, the caller

function number is stored to a stack so that when the callee function returns, the profiler is able

to keep track of function context. The right branch is used to detect function returns; when a

function executes a return statement, the Data Counter is added to the counter corresponding

to this function. Popping from the aforementioned stack yields the number of the function that

will be returned to, setting the current function context. The Data Counter is then reset again.

Instr, PC

Store Data Counter

Reset Data Counter

Pop function number

off stack

Is return?

Store Data Counter Push

function

number to

Reset Data Counter

Is call?

Counter++

Change?

Figure 3.1: High-level flow chart for instruction-count profiling.

3.1.2 Components

To realize the flow in Figure 3.1 efficiently, five key modules were created, organized as shown

in Figure 3.2. The Operation Decoder (labeled “Op Decoder”) monitors the instruction bus

to determine whether a call or return has been executed. The Call Stack is used to keep track

of the currently executing function, since a return instruction does not indicate which function

will be returned to. The Data Counter is used to store scheme-specific profiling data for the

currently-executing function. The Counter Storage module takes data collected from the Data

Counter and updates the relevant function’s stored result. Lastly, the Address Hash module is

3.1 Profiler Design

used to map the address decoded from a call instruction to a unique function number.

Data Counter

Counter Storage

Call Stack

µP Instr. $

Op Decoderret call

function #

Address Hash

target

address

F#count

Popped

(ret | call)

Figure 3.2: Modular view of profiling architecture.

3.1.2.1 Operation Decoder

In order to correctly identify when a function context switch occurs, call and return instructions

must be identified and the target of a call must be decoded.

A function call may be invoked by two MIPS instructions: JAL (jump-and-link), which jumps

to a predefined address and places the return address in register 31, and JALR (jump-and-link

register), which jumps to the address contained in a register, and places the return address in

register “rd.” As seen in Figure 3.3, the JAL instruction can be identified by comparing the

top six bits of the instruction to 000011, while the target can be decoded by extracting the

lower 26 bits of the instruction, shifted left by two. The JALR instruction can be identified by

comparing the top six bits of the instruction to 000000 and the lower six bits to 001001. Since

the target is stored in one of the 32 registers in the standard MIPS register file, it is not as easy

to decode because the profiler does not have access to the processor’s register file; due to the

open-source nature of the processor, these signals could be monitored, but this would restrict

the portability of the profiler. Special logic in the profiler delays use of this unknown jump

target until the actual jump occurs, at which time this new PC is used as the target address.

It should be noted that this does not delay the processor itself, only the profiler’s consumption

of data.

A function return can be invoked only by one instruction, JR (jump register), which jumps

to the address stored in a register. However, this instruction is also commonly used for simple

branching within a function, so the two uses must be differentiated. Since function calls always

store the return address to register 31, returns can be identified by any JR instruction using

register 31; this is determined by comparing the entire instruction to

0000 0011 1110 0000 0000 0000 0000 1000.

JAL: 0000 11ii iiii iiii iiii iiii iiii iiii

JALR: 0000 00ss sss0 0000 dddd d000 0000 1001

JR: 0000 00ss sss0 0000 0000 0000 0000 1000

Figure 3.3: Opcodes used to determine function context switches. Note: i represents bits ofan immediate, s represents bits of source register number, and d represents bits ofdestination register number.

3.1.2.2 Call Stack

Since function names are not used at the machine-code level, they cannot be used to represent

functions in the profiler. Therefore a function is uniquely represented by the address of its first

instruction. This is quite useful, since the target of a function call also uses this address; this

target can be used to easily identify the function to which the processor is jumping. When a

function completes, the profiler must also determine which function the processor is returning

3.1 Profiler Design

to. Two issues exist in regards to function returns: first, the target address is not encoded into

the return instruction, and second, the return address corresponds to the instruction located

just after the original function call, not to the first instruction of that function. This means

that the return address cannot be associated with a function without checking all functions to

see if the address lies between the start and end of the function.

A solution to this issue is to maintain a stack in which function indices (described below in

Section 3.1.2.5) are pushed during function calls and popped during function returns. In this

way, the current function can be tracked easily. To compactly represent function indices, the

Call Stack stores the hashes of the function addresses, which will be described in Section 3.1.2.5.

The stack is implemented using one Altera synchronous RAM (AltSyncRam), which is an

Altera-specific on-chip memory module [11]. The number of entries in the stack represents the

maximum nested call depth the profiler can support; for all experiments in this research, the

number of entries is 32. Since the maximum depth found in all benchmarks considered was only

8, seen in Table 3.1, this value is more than sufficient. This limit on the call depth impedes the

use of recursion in a target application; recursion support is discussed in Section 6.2.5.

Table 3.1: Maximum depth of nested function calls.

Benchmark Max. Depth

ADPCM 4AES 4BLOWFISH 3DFADD 5DFDIV 5DFMUL 3DFSIN 6GSM 3JPEG 8MIPS 1MOTION 6SHA 3DHRYSTONE 3

3.1.2.3 Data Counter

The Data Counter module performs the actual measurement of any required metrics; the re-

mainder of the architecture is used to properly associate this data with the function it corre-

sponds to. In this way, the implementation of a new profiling task only requires the modification

of this module. For cycle-accurate profiling, the Data Counter increments every time the PC

changes (instruction count profiling), every cycle (cycle count profiling), or every cycle while

stalling (stall cycle profiling). These profiling schemes and the Data Counter are described in

more detail in Section 3.2.

3.1.2.4 Counter Storage

The Counter Storage module consists of one AltSyncRam instance and update logic. At any

function context switch, the RAM entry for the current function is read, added to the current

value of the Data Counter, and written back to the same address in the RAM. The function

address is hashed into a unique number that is used to index into the RAM; hashing details are

described in the following subsection. Through this design, all data for a particular function is

accrued correctly, regardless of the number of calls and returns.

3.1.2.5 Address Hash

To store data for each function in an efficient manner, the indices used to represent the functions

must be continuous over a compact range. The address of the first instruction in a function

is generally used to identify the function, but if these addresses were used to index into the

counter storage RAM, the address space of the RAM would need to be as large as that of the

processor. For a 32-bit PC, this would require 232 bits. This large address space is impossible

to realize using memory bits on an FPGA; therefore, the address space must be translated to

a more compact range. This translation is achieved through hashing.

Figure 3.4 illustrates the above issue and how it is resolved. A sample program written in

C is shown in part (a), alongside the corresponding MIPS assembly in part (b), condensed

3.1 Profiler Design

for brevity. As indicated in part (b), only three function addresses are required but their

values span over the range 0x00800000 to 0x00800094. Part (c) shows a desired mapping of the

function addresses to a continuous set of indices.

int sqr ( int a ) {return a*a ;

int sqr add ( int a , int b) {return sqr ( a ) + sqr (b ) ;

int main ( ) {int x = 5 ;int y = 6 ;

int z = sqr add (x , y ) ;return z ;

00800000 <sqr >:800004: mult a0 , a0· · · · ·800024: j r ra800028: nop

0080002 c <sqr add >:· · · · ·800048: j a l 800000 <sqr>· · · · ·80005 c : j a l 800000 <sqr>

800060: nop800064: addu v0 , s0 , v0· · · · ·80008 c : j r ra800090: nop

00800094 <main>:· · · · ·80009 c : l i v0 , 5· · · · ·8000 a4 : l i a1 , 6· · · · ·8000b4 : j a l 80002 c <sqr add>· · · · ·8000 dc : j r ra8000 e0 : nop

(a) C Code Example (b) Condensed MIPS Assembly

00800000 ⇐⇒ (Function 0)0080002c ⇐⇒ (Function 1)00800094 ⇐⇒ (Function 2)

(c) Example Function Address Hashing

Figure 3.4: Code example to illustrate the need for function address hashing.

A hash function is a mathematical function that maps a large set of data to a smaller set. A

perfect hash function is one which performs a unique mapping so that no two inputs map to

the same output. Perfect hashing is used to convert each function address in the program into

an index in the range 0 to N-1, where N is the number of functions in the program rounded to

the next power of two.

The perfect hash generator described in [32] is used to create an application-specific perfect

hash. This generator is based on a static hashing algorithm, shown in Figure 3.5, which uses

6 parameters that can be tailored to the specific application to guarantee collision-free hashes.

During compilation of the application to be profiled, all function addresses are extracted from

the executable’s source and passed through the perfect hash generator, outputting the cus-

tomized set of parameters for use with the algorithm in Figure 3.5. This algorithm is efficient

in hardware because it largely consists of simple logical operations, plus one memory access.

The perfect hash generator first uses normal hashing to determine (A,B) pairs for each input

key (function address) such that (A,B) are distinct for all keys. Parameters A1 and A2 are

determined based on the required number of bits in A that were required to find distinct (A,B)

pairs; B1 and B2 are found similarly from the number of bits in B. Parameter V is determined

based on the seed value that was used to find the distinct pairs. After distinct pairs were found,

the tab array is populated to produce a mapping such that the hash is perfect. The parameter

sizes are as follows: tab (N-entry, 1-byte array), V (4 bytes), A1, A2, B1, and B2 (each 1 byte).

These parameters, totaling only (8+N) bytes, must be generated for each application that is to

be profiled to perform the unique mapping. At execution time of the program being profiled,

the profiler initializes the Address Hash module with these parameters. This initialization flow

is discussed further in Section 3.1.4.2.

3.1.3 Compilation Flow

Multiple compilation phases are required to generate a profilable executable, as shown in Fig-

ure 3.6. The first phase uses the LLVM [30] compiler infrastructure to compile the C source

code to MIPS assembly. The front-end of the compiler parses the C code into LLVM’s internal

3.1 Profiler Design

// Input : va l , output : r s l// Parameters : tab [ ] , V, A1, A2, B1 , B2int doHash (unsigned int val ) {

unsigned int a , b , r s l ;

va l += V;val += ( val << 8) ;val ˆ= ( val >> 4) ;b = ( val >> B1) & B2 ;a = ( val + ( val << A1) ) >> A2;r s l = ( aˆ tab [ b ] ) ;

return r s l ;}

Figure 3.5: Parameterized hashing algorithm used by the perfect hash generator (implementedin C).

bytecode representation. At this stage, hardware/software partitioning may be performed to

accelerate certain functions using LegUp’s high-level synthesis, as described in Section 2.3. The

partitioned bytecode is then linked with library code, and the back-end compiler generates a

MIPS assembly file. Due to LLVM’s limited support for the MIPS backend, LLVM cannot

create executable files; it can only be used to generate MIPS assembly code.

Function

Extraction

C Code

.s.sAssemble Link

GNU Binutils

.hashPerfect Hash

Generator

Figure 3.6: Compilation flow to generate programming files.

The second phase uses a cross-compiled version of GNU Binutils, a collection of binary tools

such as a linker and an assembler. Cross-compilation is the act of compiling an executable for a

computer architecture other than the one the compiler runs on. The cross-compiler used in this

work runs on an x86 processor but targets the MIPS architecture. It is used to first assemble

the MIPS assembly file produced by LLVM into an object file, then link this object file with

library code to produce machine code capable of running on the Tiger processor.

The third phase enables profiling of the executable by analyzing all declared functions and

determining the appropriate values with which the profiler’s perfect hashing algorithm should

be initialized. First, the name and address of each function in the executable are extracted and

stored to two files, with data for each function on a separate line. Then, the list of all function

addresses is input into the perfect hash generator. The generator outputs a “.hash” file, which

is an initialization file containing all parameters required in the perfect hashing algorithm. An

example hash initialization file is shown in Figure 3.7.

tab [ ] = {17 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0}V1 = 0x62c41909A1 = 14A2 = 27B1 = 2B2 = 0x1

Figure 3.7: Example hash initialization file for a 32-function program.

3.1.4 Data Transfer

After the processor system has been synthesized and programmed to the FPGA, the target

MIPS application must also be programmed. This involves running the boot loader code on the

Tiger processor to load the entire application into program memory on the FPGA board being

used. To make the discussion in the rest of this section more concrete, this memory is assumed

to be 8MB of off-chip SDRAM. After the application is loaded, a run signal is sent to tell the

processor where to start execution. The application code is always loaded into the beginning

(lowest addresses) of the SDRAM, which is designated as address 0x800000 in the system. The

profiling initialization data, and later the profiling results, are allocated near the end of the

SDRAM, starting at address 0xFFE000. This region dedicated to profiling data consumes less

than 0.05% of the 8MB SDRAM and therefore will not interfere with the application. The

layout of the SDRAM is seen in Figure 3.8.

3.1 Profiler Design

0x00800000

0x00FFFFFF

Status Register

Profiler Data

0x00FFDFFC

0x00FFE000

Application Data

Figure 3.8: An example memory layout of program memory, assuming 8MB off-chip SDRAM.Note: address ranges are not to scale, and the profiler data region is used for bothinitialization data and retrieval data.

3.1.4.1 Wrapper Function

Three important steps are required to properly execute and profile an application. First, the

profiler must be initialized with the hashing parameters specific to the program that will be

run. Next, the program must commence execution while the profiler, which is now initialized,

monitors the run-time behavior of the processor. Finally, after the program has completed, the

results of the profiler must be returned to the user. To this end, a wrapper function was created

whose sole purpose is to perform these tasks in order. The C code for the wrapper can be seen

in Figure 3.9.

The initialization data required by the address hash block, discussed in Section 3.1.2.5, is

specific to the application being run and must be reinitialized for every program. The wrapper

allows the processor to delay execution of the program until the profiler is ready, and provides

application-independent instruction addresses to inform the profiler when to begin initialization

and retrieval. Once the application begins, however, the wrapper does not interfere with the

program flow.

First, the processor clears address 0xFFDFFC, which is used as a status register to indicate

the state of execution. Two constants are used to indicate the completion of the initialization

and retrieval routines. The wrapper polls the status register until it equals 1, indicating that

the profiler has completed its initialization. The flushCache function is called before every read

of the status register to flush both the data and instruction caches. This is necessary because

the status register, located in main memory, is modified directly without the use of caching.

Since these updates are not reflected in the caches, the cached value would always be read

without flushes.

Upon completion of the initialization routine, the details of which will be discussed in the

following subsection, the application is run by performing a call to main. Upon completion

of the program, execution returns to the wrapper and the profile results are gathered. The

retrieval routine, which will be discussed in Section 3.1.4.3, is commenced and again the status

register is polled until it equals 2, indicating the completion of data retrieval. At this point,

the application has completed and the profiling results are available on the SDRAM.

3.1.4.2 Initialization

The wrapper function provides application-independent instruction addresses to indicate when

initialization should be performed. The profiler monitors the processor’s PC until it is equal

to 0x800018, which corresponds to an instruction that precedes the initialization loop which

polls the status register in Figure 3.9. Since the wrapper is independent of the application to

execute, it is precompiled and linked in at compile-time; this guarantees that the monitored

address will always correspond to the same instruction.

By monitoring the PC, the profiler can recognize when the processor is executing the wrap-

per; it can then initialize its hashing constants V, A1, A2, B1, and B2, by directly reading

redetermined addresses from SDRAM, as shown in Figure 3.8. The tab array, consisting of

3.1 Profiler Design

#de f i n e STACK ADDR 0x00FFDFFC

void wrap ( ) {int r e t ;unsigned long int Pr o f i l e r S t a t u s ;

// Reset P r o f i l e r S t a t u s*( volat i l e unsigned long int *) (STACK ADDR) = 0 ;

// Wait f o r p r o f i l e r to be i n i t i a l i z e ddo {

f lushCache ( ) ;P r o f i l e r S t a t u s = *( volat i l e unsigned long int *) (STACK ADDR) ;

} while ( P r o f i l e r S t a t u s != 1) ;

// Run ac tua l programr e t = main ( ) ;

// Wait f o r p r o f i l e r r e t r i e v a l to be f i n i s h e ddo {

f lushCache ( ) ;P r o f i l e r S t a t u s = *( volat i l e unsigned long int *) (STACK ADDR) ;

} while ( P r o f i l e r S t a t u s != 2) ;}

Figure 3.9: Application-independent wrapper to enable initialization and data retrieval.

N bytes (where N is the maximum number of functions to profile), is read next and stored

to a designated RAM block within the profiler. The details of these hashing parameters were

discussed previously in Section 3.1.2.5.

Next, all modules must be reset, a step which is especially important for the Call Stack and

Counter Storage modules. The Call Stack is reset by setting the stack pointer to 0, while the

Counter Storage must explicitly write a 0 into each entry.

The final task is to initialize the Call Stack to set the wrapper function as the top level of

hierarchy. This is required to ensure correct behaviour after the application’s main function

returns; if the Call Stack was not initialized properly, main would be located in entry 0 of the

stack when the return instruction executes. This return causes a pop from an empty stack,

which is incorrect behavior. By pushing the wrapper onto the stack first, the return from main

will no longer cause a pop from an empty stack as there is still one element remaining.

Upon completion of the initialization routine, the profiler must instruct the processor to

continue execution. This is accomplished by setting the status register, located at address

0xFFDFFC, equal to 1. When this value has been written successfully, the profiler begins

monitoring the processor to profile its execution.

3.1.4.3 Data Retrieval

Data retrieval begins when execution returns to the wrapper function after completing its call

to main, corresponding to a PC value of 0x800074. The first step in data retrieval is to write

all profiling results, which are stored in RAM blocks within the profiler, to a predetermined

location in SDRAM. This region begins at address 0x00FFE000, which is also the location to

which the hash initialization data is written. These results, read and written in order of hashed

function number, are stored in contiguous locations in memory. Once the writes to SDRAM

have finished, the profiler instructs the processor to complete execution by setting the status

register equal to 2. The profiling results are passed to the host computer using the MIPSLoad

program discussed previously in Section 2.2.3.

3.2 Cycle and Stall Cycle Profiling

As described in Section 1.3, two goals of the proposed hardware/software partitioning scheme

are to maximize throughput and to exploit data locality in a prospective accelerator. Through-

put can be maximized by accelerating functions that consume the largest number of cycles

during execution. Temporal data locality is exploited through the use of a cache, so that

recently-used data can be accessed more quickly. Spatial locality is exploited through the use

of cache bursts, which involve reading four consecutive cache entries for every read request.

Therefore, it is logical to associate high data locality with a high cache-hit rate. Thus, data

locality can be taken advantage of by accelerating those functions that spend the fewest cycles

stalling on cache misses relative to the total number of cycles they execute.

3.3 Hierarchical Profiling

3.2.1 Hardware Implementation

The proposed profiling architecture is easily adapted to measure the data required by the above

schemes, namely clock cycles and cycles spent stalling. In fact, the only component that requires

modification is the Data Counter. The initial framework was designed to measure the number

of instructions executed per function, which is measured by incrementing the counter whenever

the PC changes, as shown in Figure 3.10a. The total number of cycles executed per function

is collected by incrementing the counter every clock cycle, as shown in Figure 3.10b. The total

number of cycles consumed by cache stalls is found by incrementing the counter every clock

cycle in which either cache has its stall signal asserted, as shown in Figure 3.10c.

ENABLE

resetcall

return

pc_diffOUTPUT count_out

Instruction Counter

(a) Instruction count configuration

ENABLE

resetcall

clockOUTPUT count_out

Cycle Counter

return

(b) Cycle count configuration

ENABLE

resetcall

return

clockOUTPUT count_out

Stall Cycle Countericache_stall

dcache_stall

(c) Stall cycle count configuration

Figure 3.10: Architecture diagrams of Data Counter module for cycle-accurate profiling.

3.3 Hierarchical Profiling

The previously described profiling schemes measure the instructions, cycles, or stall cycles

spent within each function of the profile. This flat profile is very useful for determining specific

functions that consume much of the execution time, but does not provide any information about

the flow or structure of an application. Hierarchical cycle profiling, on the other hand, measures

the number of cycles executed within the function plus all of its descendants. This is especially

applicable to hardware/software partitioning within LegUp since any function that is chosen

for acceleration also causes any descendant functions to be accelerated. Since the acceleration

is inherently hierarchical, the profiling tool used to partition should also be able to provide

hierarchical results.

To implement hierarchical profiling, data counts must be bubbled upwards at each function

return to count towards its parent function. This is achieved through the creation of a Hierarchy

Stack module. At every function call, instead of storing the Data Counter for the calling function

immediately into the Counter Storage module, this value is written into the top of the Hierarchy

Stack, indicated by a stack pointer (SP), and SP is incremented. At each function return, the

top element of the Hierarchy Stack is read, added to the current value of the Data Counter,

and SP is decremented. This sum is stored into the Counter Storage module, and the Hierarchy

Stack is updated by adding the sum to the entry pointed to by SP.

3.4 Experiments

3.4.1 Configurable Parameters

This profiling framework contains many parameters that can be modified pre-synthesis to tai-

lor its execution to its requirements. These parameters are implemented with Verilog define

statements in a single configuration file for the profiler. The parameters are:� Stack depth (S): the maximum call depth the profiler is to support� Number of functions (N): the maximum number of functions the profiler is to support,

rounded to the next power of two� Counter width (CW): the total number of bits used to store each function’s results

3.4 Experiments� Profiling scheme (PROF SCHEME): a single character to describe one of the four

available profiling schemes: i for Instruction Count, c for Cycle Count, s for Stall Cycle

Count (all discussed in Section 3.2), and p for Power Profiling (discussed in Chapter 4).� Hierarchical profiling (DO HIER): a binary value indicating whether the gathered

profile will be flat (0) or hierarchical (1).

For the experiments performed in this section, S is set to 32 entries, N varies from 16 to 256

(at powers of 2), CW varies from 16 to 64 (at powers of 2), PROF SCHEME is set to match

the scheme being tested, and unless otherwise stated, DO HIER is set to 0.

3.4.2 Experimental Benchmarks

To verify the accuracy of the profiling framework, an extensive benchmark suite was tested.

These 13 programs are written in C and each include self-contained test vectors to drive the

applications’ executions and ensure correct behavior. The benchmark suite, summarized in

Table 3.2, includes the twelve CHStone [29] benchmarks, which were developed for tests in

HLS, plus Dhrystone [49], a classic integer computing benchmark. A wide array of applications

are contained, including encoding and decoding, double-precision mathematics, encryption,

image decompression, and a MIPS processor.

3.4.3 Comparison with Software Profiler

This section describes the comparison of the proposed profiling framework, LEAP, to a widely-

used software profiling tool called gprof. Gprof employs sampling to minimize its performance

overhead, which consequently reduces its accuracy. Function-level profiles of each benchmark

described in Section 3.4.2 are generated with LEAP and gprof, and these results are compared

to show the accuracy and performance gains of non-invasive profiling over sampling-based. Both

flat and hierarchical profiling methods are compared.

Table 3.2: Description of the benchmarks used in experiments.

Lines ofBenchmark Description C Code

ADPCM Adaptive differential pulse code modulation decoder and encoder 541AES Advanced encryption standard 716BLOWFISH Data encryption standard 1,406DFADD Double-precision floating-point addition 526DFDIV Double-precision floating-point division 436DFMUL Double-precision floating-point multiplication 376DFSIN Sine function for double-precision floating-point numbers 755GSM Linear predictive coding analysis of global system for mobile com-

munications393

JPEG JPEG image decompression 1,692MIPS Simplified MIPS processor 232MOTION Motion vector decoding of MPEG-2 583SHA Secure hash algorithm 1,284DHRYSTONE Synthetic integer programming benchmark 301

Methodology

Using the compilation flow described in Section 3.1.3, the required files to initialize and perform

profiling with LEAP are generated. To profile an application with gprof, GCC is used with the

flag ‘-pg” to compile the source C code into an executable with inserted profiling commands.

Due to the sampling employed in gprof, which defaults to intervals of 10 ms, each benchmark

was modified slightly so that it executed 10,000,000 times, thus reducing the sampling error

but greatly increasing the execution time. This executable is then run to completion, which

produces an output file “gmon.out.” Finally, gprof is called, which analyzes the executable

and output files to create a call graph profile. The benchmarks were compiled and run on an

x86-based computer for use with gprof, as no MIPS-based computer was available.

Results

The profiling results of gprof and LEAP can be seen in Tables 3.3 and 3.4 for the benchmarks

“ADPCM” and “DFMUL,” respectively. The full set of results for all 13 benchmarks are

3.4 Experiments

available in Appendix A.1.1. The results for gprof are shown as the self time that each function

executed, along with the function’s percentage of the total benchmark execution time. The

hierarchical time, which includes the self time plus all time spend in descendant functions, is

also shown along with the function’s hierarchical percentage of total execution time. LEAP

results are given as the number of clock cycles taken to execute each function, along with

the function’s percentage execution time and the hierarchical clock cycles per function and the

hierarchical percentage of total execution time. The last column shows the difference in percent

execution time for both the self time and hierarchical time.

It can be seen that although discrepancies arise between the percent execution times deter-

mined from gprof and LEAP, most functions are ranked similarly by both profilers for both flat

(self time) and hierarchical profiling. There are two sources of error to account for these discrep-

ancies: firstly, the sampling inaccuracies incurred by gprof result in estimate profiles as opposed

to exact ones. Secondly, the target architectures of the two profilers differ; LEAP is designed to

profile a MIPS processor, while the gprof used in these experiments targets an x86 processor.

Therefore processor characteristics such as cache properties and organization differ, in addition

to the obvious micro-architectural differences, result in slightly different benchmark execution.

To verify the accuracy of these results, LEAP is compared with another non-intrusive hardware

profiler in the following section.

3.4.4 Comparison with FPGA-Based Profiler

To demonstrate the utility of this profiling framework, experiments were designed to compare

LEAP to another FPGA-based hardware profiler with similar goals. SnoopP, described in Sec-

tion 2.1, was implemented in Verilog with the same initialization and data retrieval mechanisms

as LEAP to allow a fair comparison. SnoopP was chosen as a comparative baseline for several

reasons: it is a popular FPGA-based profiler that has been cited many times and included

in a recent profiler comparison [45]. In addition, its source code is very general which makes

the profiler easy to use with Tiger. Finally, due to its non-intrusive nature, the cycle-accurate

Table 3.3: Comparative results of gprof vs. LEAP for the benchmark “ADPCM.”Hierarchical Hierarchical Self Hier.

Function Self Time (s) Time (s) Self Cycles Cycles Time Time

encode 145.12 22.4% 344.51 53.2% 62193 26.63% 118422 50.7% 4.23% 2.47%decode 117.31 18.1% 281.81 43.5% 63863 27.34% 102578 43.9% 9.24% 0.42%upzero 116.23 17.9% 116.23 17.9% 32614 13.96% 32589 14.0% 3.97% 3.99%filtez 77.93 12.0% 77.93 12.0% 14708 6.30% 14683 6.3% 5.73% 5.74%uppol2 40.22 6.2% 40.22 6.2% 11209 4.80% 11212 4.8% 1.41% 1.41%quantl 34.89 5.4% 34.89 5.4% 14177 6.07% 14743 6.3% 0.69% 0.93%filtep 28.86 4.5% 28.86 4.5% 3655 1.56% 3655 1.6% 2.89% 2.89%scalel 21.48 3.3% 21.48 3.3% 2902 1.24% 2888 1.2% 2.07% 2.08%uppol1 21.10 3.3% 21.10 3.3% 8097 3.47% 8097 3.5% 0.21% 0.21%logscl 13.92 2.1% 13.92 2.1% 3425 1.47% 3440 1.5% 0.68% 0.68%adpcm main 10.06 1.6% 641.72 99.0% 4588 1.96% 228355 97.8% 0.41% 1.27%logsch 9.24 1.4% 9.24 1.4% 3100 1.33% 3097 1.3% 0.10% 0.10%main 5.74 0.9% 647.46 99.9% 5122 2.19% 233486 100.0% 1.31% 0.04%reset 5.34 0.8% 5.34 0.8% 2755 1.18% 2764 1.2% 0.36% 0.36%abs 0.49 0.1% 0.49 0.1% 1155 0.49% 1155 0.5% 0.42% 0.42%

(a) gprof (b) LEAP (c) AbsoluteDifference

Table 3.4: Comparative results of gprof vs. LEAP for the benchmark “DFMUL.”Hierarchical Hierarchical Self Hier.

Function Self Time (s) Time (s) Self Cycles Cycles Time Time

float64 mul 5.34 33.9% 13.66 86.8% 5051 46.2% 9138 83.5% 12.25% 3.25%mul64To128 1.86 11.8% 1.86 11.8% 788 7.2% 788 7.2% 4.61% 4.61%main 1.71 10.9% 15.37 97.6% 1815 16.6% 10965 100.2% 5.73% 2.59%extractFloat64Frac 1.42 9.0% 1.42 9.0% 289 2.6% 289 2.6% 6.38% 6.38%extractFloat64Exp 1.17 7.4% 1.17 7.4% 287 2.6% 287 2.6% 4.81% 4.81%roundAndPackFloat64 1.05 6.7% 1.35 8.6% 1212 11.1% 1263 11.5% 4.41% 2.97%propagateFloat64NaN 0.95 6.0% 1.58 10.0% 722 6.6% 1015 9.3% 0.56% 0.76%extractFloat64Sign 0.58 3.7% 0.58 3.7% 279 2.6% 279 2.6% 1.13% 1.13%packFloat64 0.57 3.6% 0.57 3.6% 129 1.2% 129 1.2% 2.44% 2.44%float64 is signaling nan 0.33 2.1% 0.33 2.1% 171 1.6% 171 1.6% 0.53% 0.53%float64 is nan 0.31 2.0% 0.31 2.0% 120 1.1% 120 1.1% 0.87% 0.87%normalizeFloat64Subnormal 0.24 1.5% 0.24 1.5% 0 0.0% 0 0.0% 1.52% 1.52%shift64RightJamming 0.10 0.6% 0.10 0.6% 0 0.0% 0 0.0% 0.64% 0.64%float raise 0.09 0.6% 0.09 0.6% 76 0.7% 76 0.7% 0.12% 0.12%countLeadingZeros64 0.02 0.1% 0.02 0.1% 0 0.0% 0 0.0% 0.13% 0.13%

(a) gprof (b) LEAP (c) AbsoluteDifference

3.4 Experiments

results are exact so as to provide a baseline for comparison.

3.4.4.1 Profiling Results Comparison

The first set of experiments between LEAP and SnoopP were performed to verify the cycle-

accurate results produced by LEAP.

Methodology

To perform comparative experiments to verify the results of LEAP, four systems needed to be

created and synthesized: 1) LEAP measuring clock cycles, 2) LEAP measuring stall cycles, 3)

SnoopP measuring clock cycles, and 4) SnoopP measuring stall cycles. Each of these systems

were programmed to the target FPGA board, in turn, to generate data for the corresponding

configuration. For these experiments, the maximum number of functions to be profiled (N)

was set to 64 and the counter width (CW ) was set to 32 bits.

Using the compilation flow described in Section 3.1.3, the 13 benchmarks were compiled to

ELF format and the hash initialization files for LEAP were created. Although [43] describes

the design of SnoopP to use pre-synthesis parameters for the upper and lower bounds of the

profiling ranges, it was decided that these parameters should be programmable at run-time

for two reasons: 1) to compare the two profilers as fairly as possible, and 2) to eliminate the

need to resynthesize the profiler for each benchmark to be profiled. Therefore, the proper

initialization of SnoopP requires setting values for the upper and lower bounds of each range;

for these experiments, each range corresponds to a function in the benchmark. These ranges

are determined through analysis of each benchmark, and are store in “.range” files. This

initialization data is loaded into SDRAM in the same way as the hash data (discussed in

Section 3.1.4), and later used to initialize the profiler by reading each range from SDRAM and

writing it to a corresponding register within the profiler.

Results

The profiling results of LEAP and SnoopP are shown in Tables 3.5 and 3.6 for the benchmarks

“ADPCM” and “DFMUL,” respectively. The full set of results for all 13 benchmarks can be

found in Appendix A.1.2. These tables show, for each function in the benchmark, the total

number of cycles executed and the total number of cycles stalled, as measured by each profiler.

The difference between the profilers, measured in clock cycles, is shown.

The difference between the results of LEAP and SnoopP, although very small, warrants

explanation. The experiments were performed using only one profiler on-chip at a time, meaning

the results are for different runs of the same benchmark, with all other configurations the same.

These differences can be attributed to small variations in SDRAM memory access times on

the DE2 board. To show that the discrepancies between LEAP and SnoopP are not accuracy

errors, but rather the result of run variations of the system, Table 3.7 shows the cycle count

results for four runs of the benchmark “DFMUL,” for each profiler. The standard deviation of

the number of cycles executed in each function was calculated for LEAP, SnoopP, as was the

deviation among all eight runs. It can be seen that the standard deviation of the eight runs

is less than the individual profiler deviations, indicating that the results of each profiler only

differ by the variations between runs.

3.4.4.2 Area and Power Comparison

The second set of experiments that were performed between LEAP and SnoopP were to measure

the overhead incurred by the use of each profiler. For these experiments, overhead refers to the

area requirements and power consumption of each profiler.

Methodology

The area of each profiler was found by synthesizing the design with Altera’s Quartus II. As

discussed in Section 3.4.1, the number of functions to be profiled (N) was varied from 16 to

256 at powers of 2, while the counter widths are set to 16, 32, and 64. In addition to these

3.4 Experiments

Table 3.5: Comparative results of LEAP vs. SnoopP for the benchmark “ADPCM.”

Function LEAP SnoopP Difference LEAP SnoopP Difference

decode 63863 63863 0 3656 3652 4encode 62193 62204 -11 4245 4271 -26upzero 32614 32620 -6 948 942 6filtez 14708 14708 0 695 691 4quantl 14177 14175 2 737 723 14uppol2 11209 11209 0 230 233 -3uppol1 8097 8097 0 127 126 1main 5122 5122 0 1814 1829 -15adpcm main 4588 4588 0 1641 1626 15filtep 3655 3655 0 56 55 1logscl 3425 3425 0 234 234 0logsch 3100 3112 -12 134 133 1scalel 2902 2902 0 288 292 -4reset 2755 2768 -13 2332 2341 -9abs 1155 1155 0 56 55 1

Total 233563 233603 -40 17193 17203 -10

(a) Cycles (b) Stall Cycles

two parameters, the profiling scheme is also varied to cover instruction count profiling, cycle

profiling and stall cycle profiling. From the compilation reports generated for each configuration,

the total number of logic elements, registers, and memory bits are obtained. The maximum

frequencies are also gathered.

The power overhead of each profiler in the configurations mentioned above was measured

using ModelSim [36] and the Quartus PowerPlay Analyzer [13]. Quartus is able to produce

timing-based simulation files for use with ModelSim. The system was simulated using timing

simulation for the profiler and functional simulation for the rest of the system, for each of

the benchmarks previously described. The system was simulated in this way so that only the

power consumption of the profiler was measured, not of the system. Upon completion of the

simulation, a Value Change Dump (VCD) file is produced containing all signal transitions found

in the timing portion of simulation. The VCD file is read by Quartus PowerPlay to determine

the average dynamic power consumption of the simulation.

Table 3.6: Comparative results of LEAP vs. SnoopP for the benchmark “DFMUL.”

Function LEAP SnoopP Error LEAP SnoopP Error

float64 mul 5051 5077 -26 1855 1830 25main 1815 1827 -12 1074 1082 -8roundAndPackFloat64 1212 1227 -15 479 465 14mul64To128 788 788 0 273 288 -15propagateFloat64NaN 722 724 -2 488 488 0extractFloat64Frac 289 289 0 49 49 0extractFloat64Exp 287 287 0 47 47 0extractFloat64Sign 279 279 0 40 39 1packFloat64 129 129 0 39 39 0float64 is nan 120 120 0 71 71 0float64 is signaling nan 171 171 0 87 87 0float raise 76 76 0 62 62 0

Total 10939 10994 -55 4564 4547 17

Table 3.7: Repeatability testing for cycle accurate profiling of benchmark “DFMUL.”Run Number 1 2 3 4 Dev. 1 2 3 4 Dev. Dev.float64 mul 5050 5083 5074 5055 15.59 5075 5088 5029 5068 25.39 19.51main 1826 1819 1823 1828 3.92 1819 1839 1822 1827 8.81 6.48roundAndPackFloat64 1212 1212 1226 1212 7.00 1212 1221 1220 1227 6.16 6.56mul64To128 794 788 788 797 4.50 791 788 788 788 1.50 3.49propagateFloat64NaN 751 724 739 753 13.35 724 724 733 724 4.50 12.40extractFloat64Frac 289 289 289 289 0.00 289 289 289 289 0.00 0.00extractFloat64Exp 287 287 287 287 0.00 287 287 287 287 0.00 0.00extractFloat64Sign 279 279 279 279 0.00 279 279 282 279 1.50 1.06packFloat64 129 129 129 129 0.00 129 129 129 129 0.00 0.00float64 is nan 120 120 120 120 0.00 120 129 120 120 4.50 3.18float64 is signaling nan 171 171 171 171 0.00 185 171 171 171 7.00 4.95float raise 76 76 76 76 0.00 76 76 76 76 0.00 0.00

Total 10984 10977 11001 10996 10.97 10986 11020 10946 10985 30.25 21.25(a) LEAP (b) SnoopP (c) Overall

Results

The area overhead and maximum frequency for the LEAP profiler can be seen in Table 3.8 for

three profiling schemes: Part (a) shows results for instruction profiling, part (b) shows results

for cycle profiling, and part (c) shows results for stall cycle profiling. It can be seen that these

three schemes have a very small area variation between them, which is because only the Data

Counter module changes among the configurations, requiring extra comparators to check if the

3.4 Experiments

PC has changed (instruction profiling), or if the processor is stalled (stall cycle profiling).

Table 3.8: Area overhead of LEAP for three profiling schemes.

Total Total Mem. Fmax Total Total Mem. Fmax Total Total Mem. FmaxCW N LEs Regs Bits (MHz) LEs Regs Bits (MHz) LEs Regs Bits (MHz)

16 16 1,402 615 4,352 167.3 1,374 608 4,352 169.1 1,394 619 4,352 172.016 32 1,444 622 4,608 157.0 1,503 686 4,608 160.6 1,451 645 4,608 157.316 64 1,520 671 5,120 158.8 1,481 664 5,120 166.9 1,438 617 5,120 149.216 128 1,536 647 6,144 160.3 1,517 655 6,144 165.3 1,508 643 6,144 165.216 256 1,553 677 8,192 146.8 1,531 656 8,192 144.9 1,498 648 8,192 161.5

32 16 1,556 700 4,608 163.2 1,494 675 4,608 167.3 1,530 748 4,608 174.732 32 1,632 742 5,120 169.2 1,594 714 5,120 152.2 1,559 694 5,120 161.132 64 1,660 737 6,144 165.8 1,641 751 6,144 157.3 1,658 783 6,144 161.032 128 1,709 757 8,192 151.4 1,656 710 8,192 152.3 1,659 717 8,192 154.132 256 1,747 788 12,288 159.4 1,660 747 12,288 148.1 1,719 796 12,288 158.2

64 16 1,648 776 5,120 151.2 1,618 749 5,120 151.1 1,583 734 5,120 150.864 32 1,732 801 6,144 147.9 1,649 748 6,144 150.9 1,689 787 6,144 150.964 64 1,779 819 8,192 147.3 1,698 770 8,192 150.8 1,678 770 8,192 150.764 128 1,801 811 12,288 151.2 1,743 771 12,288 151.2 1,781 803 12,288 150.464 256 1,754 810 20,480 150.8 1,793 840 20,480 151.2 1,783 794 20,480 150.2

(a) Instruction Profiling (b) Cycle Profiling (c) Stall Cycle Profiling

The area of LEAP can be compared to that of Tiger, which consumes 13,422 LEs. For the

parameter configurations used for the experiments in Section 3.4.4.1, CW=32 and N=64, the

area overhead of LEAP is 12.37%, 12.23%, and 12.35% for instruction, cycle, and stall cycle

profiling, respectively. The maximum frequency of the profiler ranges from 144.9 to 174.7 MHz,

depending on the configuration of the profiler. Although this is much greater than 75.59 MHz,

the maximum frequency of Tiger, it may not be fast enough to use with a soft processor such as

Nios II when that processor is setup to run at its peak speed (Nios II-fast can achieve frequencies

of up to 200 MHz). Upon investigation, it was observed that the critical paths of the profiler

are contained in the Address Hash module and, with larger counter widths, the RAM modules

used to store the Data Counter. If required, the Data Storage module could be pipelined to

alleviate this portion of the critical path.

Table 3.9 shows the area overhead and maximum frequency of LEAP when configured for

hierarchical profiling. These results are shown for instruction, cycle, and stall cycle profiling.

It can be seen that the extra logic required to implement hierarchical profiling incurs penalties

in logic elements, memory bits, and Fmax. Logic elements, compared to the results of the flat

profiler in Table 3.8, increase by 6.6% in the smallest case (CW=16, N=16) to 18.3% in the

largest case (CW=64, N=256). Extra memory bits are consumed solely by the Hierarchy Stack,

which is a 32xCW AltSyncRam. Fmax decreases between 2-20% for most configurations as a

result of the increased combinational logic complexity. However, in a few cases the maximum

frequency increases despite the logical trend to decrease. These cases are believed to arise as a

result of random changes in placement and routing.

Table 3.9: Area overhead of LEAP for three profiling schemes with Hierarchical Profiling.

Total Total Mem. Fmax Total Total Mem. Fmax Total Total Mem. FmaxCW N LEs Regs Bits (MHz) LEs Regs Bits (MHz) LEs Regs Bits (MHz)

16 16 1,495 672 4,864 162.6 1,440 634 4,864 174.0 1,506 651 4,864 170.616 32 1,537 658 5,120 147.7 1,565 691 5,120 155.0 1,546 651 5,120 150.116 64 1,656 759 5,632 165.7 1,552 660 5,632 154.8 1,566 672 5,632 160.016 128 1,649 700 6,656 157.5 1587 683 6,656.0 162.3 1,626 705 6,656 161.716 256 1,661 705 8,704 149.4 1645 715 8,704.0 153.4 1,670 730 8,704 158.4

32 16 1,770 829 5,632 157.7 1697 762 5,632.0 174.4 1,736 775 5,632 174.032 32 1,840 873 6,144 159.4 1797 800 6,144.0 154.1 1,821 811 6,144 166.932 64 1,884 858 7,168 167.3 1881 843 7,168.0 153.5 1,890 898 7,168 167.532 128 1,975 955 9,216 168.0 1,875 814 9,216 159.1 1,887 852 9,216 159.132 256 1,914 867 13,312 165.5 1,887 826 13,312 150.7 1,927 859 13,312 158.0

64 16 1,882 896 7,168 122.2 1,963 934 7,168 124.9 1,857 886 7,168 127.264 32 1,954 921 8,192 125.4 1,918 892 8,192 126.0 1,906 885 8,192 124.464 64 1,984 911 10,240 127.4 1,941 895 10,240 124.1 1,973 915 10,240 125.664 128 2,010 942 14,336 124.6 2,100 977 14,336 124.4 2,083 985 14,336 121.964 256 2,075 961 22,528 122.7 2,031 932 22,528 124.7 2,034 934 22,528 124.7

(a) Instruction Profiling (b) Cycle Profiling (c) Stall Cycle Profiling

Table 3.10 shows the area overhead for two profilers, LEAP and SnoopP, for flat profiling. It

should be noted that since SnoopP does not require memory bits, the column was omitted. In

part (c) of the table, the profilers are compared by taking the ratio of LEAP over SnoopP in two

ways: first, only logic elements are compared, and second, the “total area” is estimated. When

considering only logic elements, it can be seen that LEAP is at least 11% smaller than SnoopP

in all configurations, and achieves up to 97% area reduction for CW=64 and N=256. This

comparison incurs some inaccuracies, as it does not take memory bits into account; memory

bits are much smaller than LEs, and are used in LEAP but not SnoopP. Memory bits correspond

3.4 Experiments

to one SRAM cell on the FPGA, plus decode logic. Logic elements, however, contain a four-

input look-up table (LUT), a dedicated register, and muxing logic. A four-input LUT alone

contains 16 SRAM cells, so we pessimistically assume that each SRAM bit is equivalent to 116

of a LUT in terms of silicon area. It is in fact much less than this as the memory bit decoding

circuitry is amortized over thousands of bits, whereas multiplexor decoding circuitry is needed

for each LE. Therefore the final column in Table 3.10 shows the ratio of “Total Area,” an

approximation which is represented by the equation A = L+ M16

, where A is total area, L is the

number of logic elements, and M is the number of memory bits.

Table 3.10: Area overhead comparison of LEAP and SnoopP, both configured for cycle profiling.

Total Total Mem. Fmax Total Total Fmax Total TotalCW N LEs Regs Bits (MHz) LEs Regs (MHz) LEs Area

16 16 1,402 615 4,352 167.3 1,572 1,226 138.3 0.89 1.0616 32 1,444 622 4,608 157.0 2,865 2,320 130.7 0.50 0.6016 64 1,520 671 5,120 158.8 5,458 4,502 135.8 0.28 0.3416 128 1,536 647 6,144 160.3 10,630 8,860 122.8 0.14 0.1816 256 1,553 677 8,192 146.8 20,984 17,566 116.2 0.07 0.10

32 16 1,556 700 4,608 163.2 2,008 1,498 114.0 0.77 0.9232 32 1,632 742 5,120 169.2 3,725 2,847 112.8 0.44 0.5232 64 1,660 737 6,144 165.8 7,179 5,540 109.0 0.23 0.2832 128 1,709 757 8,192 151.4 14,047 10,921 103.2 0.12 0.1632 256 1,747 788 12,288 159.4 27,807 21,678 97.6 0.06 0.09

64 16 1,648 776 5,120 151.2 2,855 2,012 86.8 0.58 0.6964 32 1,732 801 6,144 147.9 5,424 3,873 85.3 0.32 0.3964 64 1,779 819 8,192 147.3 10,587 7,590 83.6 0.17 0.2264 128 1,801 811 12,288 151.2 20,872 15,019 79.3 0.09 0.1264 256 1,754 810 20,480 150.8 54,790 29,872 89.3 0.03 0.06

(a) LEAP (b) SnoopP (c) LEAPSnoopP

The comparison in part (c) of Table 3.10 aims to provide a more fair area comparison than

only considering LEs. It shows that LEAP consumes less area than SnoopP in all but one

configuration, with a maximum reduction of 94%. The single configuration in which LEAP

consumes more resources than SnoopP is the smallest configuration possible, with N=CW=16.

These two comparisons form bounds on the ratio of the true area consumptions of the two

profilers. The first comparison assumes memory bits are of zero area and thus forms an upper

10,000

20,000

30,000

40,000

50,000

60,000

Top: Number of Functions (N)

Bottom: Counter Width (CW)

LEAP (LEs)

LEAP (LEs+Mem.)

SnoopP (LEs)

Figure 3.11: Area overhead comparison of LEAP and SnoopP, both configured for cycle profil-ing.

bound on the area ratio LEAP/SnoopP. The second comparison assumes that one memory

bit is equal to 1/16th of the area consumed by a logic element, which is greater than its true

area thus forms a lower bound on the area ratio. Figure 3.11 graphically shows these area

comparisons. The Y-axis represents the total area utilization in equivalent LEs, and the X-axis

represents the configurations of CW and N. It is apparent that as either N or CW increase, the

area requirements of SnoopP increase much more than LEAP. This is because the comparators

and counters required by SnoopP for each function increase with N, and the size of each counter

3.4 Experiments

increases with CW. In contrast, LEAP only increases in area to accommodate the transfer of

this data. As the counter width increases, only a small increase in the number of registers is

required to store this data. For increases in N, more bits are needed by the Address Hash and

Call Stack. The actual storage overhead, however, is compactly stored in memory bits.

The implementation of SnoopP used registers to store the count values. It is possible for

SnoopP to use memory bits for this storage, but two drawbacks limit the practicality of such an

approach: 1) instead of simply performing a comparison and incrementing a registered counter

each cycle, the profiler would need to determine which function is currently executing, read

from an AltSyncRam, increment the read value, and write the data back. This process would

require multiple cycles, as each memory read requires one cycle. 2) Table 3.11 shows that the

area requirement of the 2N 32-bit comparators alone dominates the area requirements of the

profiler, consuming over 1000 LEs in the smallest case (column SnoopP Comparators Only).

The area to be saved by converting the counter registers into memory bits is less than CW ×N ,

which when subtracted from the LE requirements of SnoopP is still larger than LEAP in all but

the smallest case (column SnoopP Mem. Bit Counters). This benefit is lessened if the extra

logic that would be required to handle the reads and writes to/from memory is considered.

For these reasons, it is apparent that not only would SnoopP still be larger than LEAP if

the counters were implemented with memory bits, but also it is not obvious how the profiler’s

multi-cycle requirements could be implemented.

In addition to comparing the area consumption of the two profilers, the maximum operating

frequencies of each are shown. The Fmax of LEAP ranges between 146.8 MHz to 169.2 MHz,

which is well above the maximum operating frequency of the Tiger processor. SnoopP, on the

other hand, ranges between 79.3 MHz to 138.3 MHz. This results in a 17-91% increase in Fmax

for LEAP over SnoopP, another compelling advantage for this framework.

The power overhead of LEAP and SnoopP are shown in Table 3.12, where each power value

is the average of the power consumed when running each of the 13 benchmarks. The individual

Table 3.11: Area overhead investigation for SnoopP, measured in LEs, if counters were imple-mented with memory bits.

SnoopP SnoopP SnoopPRegister Mem. Bit Comparators

CW N LEAP Counters Counters Only

16 16 1,402 1,572 1,316 1,02416 32 1,444 2,865 2,353 2,04816 64 1,520 5,458 4,434 4,09616 128 1,536 10,630 8,582 8,19216 256 1,553 20,984 16,888 16,38432 16 1,556 2,008 1,496 1,02432 32 1,632 3,725 2,701 2,04832 64 1,660 7,179 5,131 4,09632 128 1,709 14,047 9,951 8,19232 256 1,747 27,807 19,615 16,38464 16 1,648 2,855 1,831 1,02464 32 1,732 5,424 3,376 2,04864 64 1,779 10,587 6,491 4,09664 128 1,801 20,872 12,680 8,19264 256 1,754 54,790 38,406 16,384

benchmark powers are shown in Appendix A.1.3 1. Part (a) of Table 3.12 shows the results for

LEAP in both the cycle profiling and stall-cycle profiling schemes. It can be seen that there

is very little variation between configurations; different values of N result in almost no change

in the power overhead, and variations in CW result in only a few milliwatt difference. The

range of dynamic power overhead for LEAP is 4.55-6.94 mW and 4.78-7.05 mW for cycle and

stall-cycle profiling, respectively. Part (b) shows the results for SnoopP for the same schemes.

Unlike LEAP, SnoopP’s power overhead drastically increases with both N and CW, resulting

in an overhead range of 3.00-52.32 mW and 2.77-40.69 mW for cycle profiling and stall-cycle

profiling, respectively.

The results of parts (a) and (b) are graphically shown in Figure 3.12 for cycle-accurate profil-

1For the N=128 and N=256 cases, only three benchmarks were profiled for power overhead due to the largeincrease in simulation time with increasing system size. Due to the highly consistent results measured amongbenchmarks in the same configuration for other values of N, the average power overhead from these threebenchmarks is sufficient.

3.4 Experiments

Table 3.12: Power overhead comparison of LEAP and SnoopP, configured for both cycle andstall-cycle profiling. Note: “Cycles” implies cycle profiling, “Stalls” implies stall-cycle profiling, all values are measured in milliwatts, “N” is the max number offunctions profiled, and “CW” is the counter width.

CW N Cycles Stalls Cycles Stalls Cycles Stalls

16 16 4.55 4.84 3.00 2.77 1.52 1.7516 32 4.97 4.78 6.33 6.06 0.79 0.7916 64 5.28 5.10 9.78 8.68 0.54 0.5916 128 4.99 4.82 16.77 14.91 0.30 0.3216 256 4.99 5.16 28.55 25.19 0.18 0.21

32 16 6.94 5.63 4.03 4.41 1.72 1.2832 32 5.85 6.25 7.17 6.10 0.82 1.0332 64 6.13 6.12 11.94 9.76 0.51 0.6332 128 6.67 6.53 20.54 16.99 0.33 0.3832 256 5.87 5.59 35.39 28.88 0.17 0.19

64 16 6.52 6.84 6.07 4.15 1.07 1.6564 32 6.20 6.35 9.40 6.02 0.66 1.0664 64 6.23 6.70 16.54 11.22 0.38 0.6064 128 6.24 6.11 28.01 21.07 0.22 0.2964 256 6.10 7.05 52.32 40.69 0.12 0.17

(a) LEAP (b) SnoopP (c) LEAPSnoopP

ing, in which the Y-axis represents the power overhead measured in milliwatts, and the X-axis

represents the configurations of CW and N. When only 16 functions are profiled, SnoopP ac-

tually consumes slightly less power than LEAP; however, the rapid increase in comparators

required by SnoopP as N increases, and the extra power consumed by larger counters as CW

increases, causes the power consumption of SnoopP to increase much faster and more signif-

icantly that LEAP. In fact, LEAP shows no correlation between increasing N and its power.

This is due to the efficient use of AltSyncRams to store data at function context switches;

they use dedicated multiplexing logic to read data from these RAMs, so increasing the required

RAM size costs very little in terms of overhead. Part (c) of Table 3.12 shows the ratio LEAPSnoopP.

As previously mentioned, the smallest configurations (N=16) result in a lower power overhead

for SnoopP; LEAP consumes 51.7-72.2% (1.55-2.91 mW) and 127.6-174.5% (1.22-2.07 mW)

Top: Number of Functions (N)

Bottom: Counter Width (CW)

SnoopP

Figure 3.12: Power overhead comparison of LEAP and SnoopP, both configured for cycle pro-filing.

more power, for cycle profiling and stall-cycle profiling, respectively. Stall-cycle profiling also

consumes more power in LEAP for N=32 (CW=32,64), by 2.5-5.6% (0.15-0.33 mW). However,

LEAP is significantly more power-efficient in all other cases, achieving reductions of up to 88.3%

(46.22 mW) and 82.7% (33.64 mW) for cycle profiling and stall-cycle profiling, respectively.

3.5 Summary

LEAP is a non-intrusive profiler that is both area- and power-efficient. The LLVM compiler

framework is used to create a MIPS assembly file from C source code, after which the GNU

Binutils assemble and link the MIPS assembly to create an ELF file capable of running on the

Tiger processor. Initialization files specific to the profiler are also created, and are programmed

along with the ELF to the system running on a target FPGA through a command-line inter-

face; the profiling results are also retrieved through this interface. Three profiling schemes are

investigated: instruction profiling records the function-level instruction counts, cycle profiling

attributes clock cycles to individual functions during execution, and stall cycle profiling mea-

sures the number of cycles the processor spends stalled per function while waiting on data from

either the instruction or data cache. In addition, each of these configurations can be run with

either a flat or hierarchical profile. LEAP is compared to two profilers, gprof and SnoopP,

to verify the accuracy of the results presented as well as to assess the area, speed and power

characteristics of the architecture.

Chapter 4

Energy Consumption Profiling

As mobile systems increase in complexity, the need to maximize their energy efficiency is be-

coming a priority in order to extend battery life, and more generally to conserve electricity.

Methods to determine the energy consumed in individual regions of an application are impor-

tant to aid in this effort. Although two common methods exists to gather such data, direct

measurement and simulation, they are not ideal for rapid development. Direct measurement

requires a specialized experimental setup to monitor the current drawn by the system, whereas

simulation is very slow, taking hours or even days to gather data for only seconds worth of

execution time. This chapter discusses a profiling tool built within LEAP that aims to quickly

estimate the energy consumption of each function within an application. First, an instruction-

level power database is created offline to characterize the power consumed by instructions in the

MIPS1 instruction set. This database is then combined with run-time cycle counts of individual

instructions to produce a function-level energy profile, shown to be accurate within 5.95%, on

average.

4.1 Instruction Power Database

An instruction-level power database is created to store the average dynamic power each in-

struction consumes while executing on the processor. Using this database, it is then possible

4.1 Instruction Power Database

to calculate the total energy consumed by each instruction if the profiler measures the amount

of time that instruction executes. Simulation is used to measure the average dynamic power

but, as previously mentioned, is very slow which prohibits its use in rapid development cycles.

Therefore, the database is created offline once (for the specific processor and device in use) and

is used as a lookup table for profiling.

ModelSim and Quartus are used to determine the average dynamic power consumption of each

instruction in the MIPS1 instruction set. First, Quartus is used to produce the files required

by ModelSim for performing timing simulations of the Tiger system (not including LEAP) for

each of the benchmarks previously described. Individual VCD files for each instruction type

are created to record the switching activity of the system while executing that instruction.

VCD recording is enabled when a given instruction enters the decode stage of the Tiger MIPS

pipeline and is disabled when the instruction exits the pipeline. In addition, VCD recording is

paused during stalls; the dynamic power consumed while stalling is recorded in a separate VCD

file. Two approaches were investigated when considering stalls: 1) cache stalls from either the

instruction or data cache, and 2) pipeline stalls, which can occur due to data hazards, memory

latency, and multi-cycle instructions.

Upon completing the simulation of a benchmark, each VCD file is run through the Quartus

PowerPlay Analyzer to determine the average dynamic power for the corresponding instruction.

The database required many days of computation to simulate the 13 benchmarks, with each

benchmark requiring multiple VCD files totaling over 20GB. As a result, the four longest-

executing benchmarks, Blowfish, DFSIN, JPEG, and SHA, could not be run to completion

due to both storage and time limitations as weeks of computation and hundreds of gigabytes

would be required. These partial runs are acceptable, however, because the benchmarks are

used to measure the power consumption of individual instructions and measurements near the

beginning of execution are no less valuable than those at the end.

The average dynamic power of each instruction, as measured in each benchmark, can be seen

in Tables 4.1 and 4.2, for cache-related stalls and pipeline stalls, respectively. It is important to

4 Energy Consumption Profiling

note that the power attributed to each instruction varies from benchmark to benchmark. These

deviations are due to inter-instruction effects: the power measured for a given instruction also

depends on the other instructions currently in the pipeline. By averaging these power values

across many benchmarks, the average inter-instruction effect is accounted for. To increase the

quality of results, any instruction that is run less than 100 times in a benchmark is “pruned”

from the data set as the dynamic power for such instructions may be inaccurate due to the

limited data that was collected. The power of any instructions that are not actually used

throughout the benchmark suite are estimated for use with the power database. This estimation

is based on the dynamic power of similar instructions; these instructions have average dynamic

power of 0mW in Tables 4.1 and 4.2.

4.2 Implementation in Existing Framework

In order to estimate the function-level energy consumption of an application, run-time data must

be collected. Energy can be calculated by taking the product of power and time, E = P × T .

Since the dynamic power of each instruction is determined offline and stored in the instruction

power database, only the execution time of each instruction per function is required from the

profiler. This measurement is performed by modifying the Data Counter module, as seen

in Figure 4.2. As each instruction enters the execute stage, it must be decoded so that the

correct counters are incremented. In order to reduce the area of the profiler, instructions are

combined into six groups; a single 22-bit counter is used for each group to record the cycle

counts of all instructions in that group. The instruction power database uses these same groups

to provide dynamic power values for the energy calculation; the power for a group is simply

the average dynamic power of all instructions in the group. The groupings were chosen to

minimize (by inspection) the standard deviation within each group, which in turn reduces

the error introduced by averaging. The database, including the instruction groupings, their

average power and standard deviations, is shown in Table 4.3; part (a) shows the database

when considering only cache stalls, while part (b) considers all pipeline stalls. Figure 4.1 shows

Table 4.1: Pruned instruction-level power database for cache stalls, measured in mW.

Instr. ADPCM

AESBlow

GSM JPEG

Averagestall 36.13 25.52 52.56 24.31 31.91 23.70 29.08 27.39 31.22 34.97 30.00 35.28 28.60 31.59add 0.00addi 0.00addiu 61.39 61.88 63.11 65.52 66.45 60.86 63.31 69.95 58.23 57.48 81.89 64.46 64.54addu 60.90 54.31 52.99 55.14 53.66 62.75 62.54 64.94 75.54 74.02 61.68and 83.13 64.86 80.61 65.15 44.40 48.84 68.23 89.16 68.05andi 66.52 52.29 76.91 72.05 61.28 76.60 58.58 56.53 60.07 49.88 63.07beq 42.28 61.45 45.10 53.47 56.47 62.46 49.70 44.85 67.63 58.90 54.23bgez 95.64 30.78 63.21bgezal 0.00bgtz 68.61 51.32 59.97blez 59.29 68.15 63.72bltz 50.06 50.06bltzal 0.00bne 54.69 54.59 79.61 50.57 44.68 45.33 45.60 64.84 66.32 49.80 77.53 80.49 67.95 60.15div 49.07 49.07divu 0.00j 62.41 73.65 61.66 57.68 72.78 59.81 62.28 65.23 64.44jal 57.94 50.03 74.33 60.31 60.68 64.91 65.16 68.46 62.99 62.05 59.90 62.43jr 66.01 66.01 97.52 66.61 73.14 71.28 78.96 74.62 75.60 52.25 75.43 63.95 71.78lb 50.93 68.07 59.50lbu 65.55 85.06 85.37 73.75 59.07 69.27 73.01lh 85.75 41.47 63.61lhu 78.67 78.67lui 58.65 56.57 76.21 66.09 52.40 60.77 56.20 74.77 60.76 57.46 60.29 96.90 57.43 64.19lw 59.85 55.60 76.22 55.92 55.51 48.66 58.37 67.42 66.50 42.60 82.41 66.04 61.26mfhi 40.00 40.00mflo 43.22 39.76 44.35 59.78 46.67 41.97 45.96mult 46.65 42.53 64.85 46.47 50.13multu 53.10 53.10nop 65.27 63.24 79.04 68.42 61.48 68.59 65.03 78.47 73.01 61.30 68.52 85.26 60.02 69.05or 55.82 55.36 64.18 48.34 67.02 83.38 62.35ori 87.81 73.71 71.84 59.06 59.50 61.82 52.24 84.32 68.79sb 90.95 74.51 88.47 84.64sh 0.00sll 62.53 76.43 68.22 61.42 65.15 76.03 72.34 63.91 60.01 67.34sllv 63.45 86.23 74.84slt 51.02 49.85 46.51 72.21 57.69 47.10 58.94 52.21 54.44slti 44.65 46.57 50.67 58.12 41.50 48.30sltiu 65.09 49.64 47.34 40.70 47.72 50.10sltu 63.81 57.38 43.06 48.38 76.06 81.78 43.10 59.08sra 56.67 64.03 64.31 72.13 59.18 71.78 54.16 63.18srav 67.62 67.62srl 54.97 59.40 78.68 59.62 65.96 67.96 54.73 53.62 80.42 63.93srlv 83.31 83.31sub 0.00subu 69.92 73.65 61.98 69.37 61.71 66.84 83.31 73.41 70.02sw 59.06 53.22 80.82 53.29 59.91 52.33 57.85 76.95 66.12 72.68 40.60 89.17 65.98 63.69xor 56.50 61.27 50.40 52.08 50.20 63.46 55.20 65.80 86.69 50.73 59.23xori 58.13 60.46 59.30

Table 4.2: Pruned instruction-level power database for pipeline stalls, measured in mW.

Instr. ADPCM

AESBlow

GSM JPEG

Averagestall 45.55 32.87 71.60 34.75 49.34 30.24 42.75 48.07 50.54 53.80 46.83 102.27 55.50 51.09add 0.00addi 0.00addiu 61.08 59.38 52.76 66.89 68.66 62.14 63.00 71.32 71.76 63.41 53.14 86.43 63.49 64.88addu 60.96 57.63 70.51 51.48 52.18 53.41 49.53 65.15 63.11 62.42 70.82 75.58 70.88 61.82and 80.88 60.90 74.31 64.55 54.93 59.81 65.40 84.43 68.15andi 63.13 48.92 68.96 62.72 77.99 74.18 54.57 63.11 59.81 63.71beq 42.69 58.00 42.91 56.84 61.06 65.03 49.82 47.39 71.42 60.83 55.60bgez 88.39 28.94 58.67bgezal 0.00bgtz 75.86 54.16 65.01blez 63.81 63.81bltz 37.76 37.76bltzal 0.00bne 52.19 51.04 82.41 49.92 44.64 44.73 45.31 62.75 62.49 48.06 72.39 77.66 63.54 58.24div 99.52 99.52divu 0.00j 74.87 69.46 57.16 54.07 69.69 66.44 58.50 59.83 63.75jal 54.36 47.53 66.86 56.96 57.16 61.90 60.61 77.54 61.52 68.18 58.22 60.99jr 62.51 63.20 106.36 63.18 68.94 66.64 75.51 70.42 70.81 47.65 77.37 64.76 69.78lb 52.15 70.48 61.32lbu 74.54 99.68 82.36 69.14 54.59 70.62 75.16lh 104.68 37.94 71.31lhu 75.03 75.03lui 53.91 53.61 72.88 63.02 63.78 57.26 68.28 70.13 57.11 54.19 57.27 91.84 54.79 62.93lw 65.07 55.30 71.02 51.99 60.91 45.59 64.15 71.18 65.49 39.69 66.28 59.70mfhi 39.83 39.83mflo 58.52 48.27 55.60 89.66 71.98 56.08 63.35mult 68.89 42.69 138.77 59.11 77.37multu 105.77 105.77nop 66.56 59.43 74.39 64.23 62.13 64.98 65.20 75.90 70.07 59.07 63.10 78.20 58.99 66.33or 45.17 56.61 68.08 48.33 71.79 79.62 92.28 65.98ori 70.13 65.91 58.42 61.85 59.63 66.17 79.25 65.91sb 22.89 94.73 70.24 83.55 67.85sh 0.00sll 65.12 58.32 71.83 64.45 62.85 66.29 73.15 68.77 60.59 58.63 65.00sllv 58.57 93.34 75.96slt 64.92 52.99 60.95 84.98 72.86 57.99 80.53 58.01 66.65slti 48.50 58.71 57.82 74.91 49.71 57.93sltiu 61.90 52.43 46.30 49.47 49.27 51.87sltu 79.98 60.36 51.51 60.78 84.76 92.19 53.37 68.99sra 55.86 60.25 64.60 72.25 64.84 69.80 51.02 62.66srav 63.43 63.43srl 49.97 56.29 71.62 58.29 73.75 74.79 60.53 57.11 75.81 64.24srlv 37.57 81.61 59.59sub 0.00subu 67.53 71.51 56.46 66.16 66.87 74.29 81.42 74.47 92.48 72.35sw 63.17 52.45 79.51 57.35 68.80 53.95 61.08 89.49 72.99 68.08 38.69 95.47 69.39 66.96xor 63.57 58.58 73.09 52.29 51.89 49.60 61.52 54.71 62.00 86.11 49.14 60.23xori 54.56 67.97 61.27

these two database configurations graphically; the vertical axis represents the average dynamic

power consumption of each isntruction, the horizontal axis shows each instruction (plus stalls),

and the horizontal lines represent the instruction groupings.

In addition to the six 22-bit counters used for instructions in the MIPS1 instruction set,

a seventh 28-bit counter is kept to count the number of cycles spent stalling. As will be

further discussed in the following section, both cache stalls and pipeline stalls were investigated;

Figure 4.2 shows the Data Counter module configured for power profiling when cache stalls are

considered, but the input to the module can also be the logical OR of all pipeline stall signals.

These seven counters are concatenated into one 160-bit output from the module. In this way,

the rest of the profiler can treat the output as a single 160-bit counter when storing to RAM

blocks, thus requiring no changes to the storage mechanisms. However, the delay of an adder

grows proportionally with the size of its inputs, limiting a 160-bit adder (and therefore the

profiler) to a maximum frequency of 70 MHz. To remove these adders from the critical path of

the profiler, 160-bit adders throughout the system were replaced with seven smaller additions.

This modification is not required for functionality; it is implemented purely as an optimization.

4.2.1 Overhead

Table 4.4 shows the area overhead of LEAP when configured for power profiling. This configu-

ration is much larger than the cycle-accurate profiling schemes described in Chapter 3, requiring

nearly double the LEs and registers, and more than double the memory bits as the number of

functions increase. This increase in area can be attributed to two aspects of the profiler: 1) To

store the seven counters, the effective counter width as seen by the remainder of the profiler is

160 bits. This is over double the maximum CW used in cycle-accurate profiling, which was 64

bits. This especially increases the number of memory bits required to store the function-level

data. 2) Instruction decoding logic is needed to determine which group the currently-executing

instruction belongs to. This logic requires many comparisons, consuming over 350 LEs and 150

registers.

Table 4.3: Instruction power database for the MIPS1 instruction set, measured in mW. Part(a) shows the database configured for measuring cache stalls only, part (b) shows thedatabase configured for measuring all pipeline stalls.

Dynamic Group GroupInstr. Power Average Std. Dev.

stall 31.59

mfhi 40.00

48.08 3.31

mflo 45.96slti 48.30div 49.07divu 49.07bltz 50.06bltzal 50.06sltiu 50.10mult 50.13

multu 53.10

53.10 2.83

beq 54.23slt 54.44sltu 59.08xor 59.23xori 59.30lb 59.50bgtz 59.97blez 59.97

bne 60.15

62.57 1.11

lw 61.26add 61.68addu 61.68or 62.35jal 62.43andi 63.07sra 63.18bgez 63.21bgezal 63.21lh 63.61sw 63.69srl 63.93

lui 64.19

65.45 1.58

j 64.44addi 64.54addiu 64.54sll 67.34srav 67.62

and 68.05

70.70 2.33

ori 68.79nop 69.05sub 70.02subu 70.02jr 71.78lbu 73.01sllv 74.84

lhu 78.67

82.82 2.83srlv 83.31sb 84.64sh 84.64

Dynamic Group GroupInstr. Power Average Std. Dev.

stalls 44.34

Abltz 37.76

38.45 1.20bltzal 37.76mfhi 39.83

sltiu 51.87

57.53 2.62

beq 55.60slti 57.93bne 58.24bgez 58.67bgezal 58.67srlv 59.59lw 59.70

xor 60.23

62.82 1.43

jal 60.99xori 61.27lb 61.32add 61.82addu 61.82sra 62.66lui 62.93mflo 63.35srav 63.43andi 63.71j 63.75blez 63.81srl 64.24addi 64.88addiu 64.88

sll 65.00

68.03 2.47

bgtz 65.01ori 65.91or 65.98nop 66.33slt 66.65sw 66.96sb 67.85sh 67.85and 68.15sltu 68.99jr 69.78lh 71.31sub 72.35subu 72.35

lhu 75.03

75.88 1.08lbu 75.16sllv 75.96mult 77.37

Fdiv 99.52

101.60 3.61divu 99.52multu 105.77

(a) Cache stalls only (b) All pipeline stalls

entation

xistin

stallmfhi mflo

slti div

divu bltz

bltzalsltiu

mult multu beq

slt sltu xor xori lb

bgtz blez bne lw

add addu

or jal

andi sra

bgez bgezal

lhsw srl lui j

addi addiu

sll srav

and ori

nop sub

subu jr

lbusllv

lhusrlv sb

Dynamic Power Consumption (mW)

structio

e-stalls

consid

Stallsbltz

bltzalmfhi sltiu

beq slti

bne bgez

bgezalsrlv lw xor jal

xori lb

add addu

sra lui

mflo srav

andi j

blez srl

addi addiu

sll bgtz ori or

nop slt

shand sltu

sub subu

lhulbu sllv

mult div

divu multu

Dynamic Power Consumption (mW)

stalls

consid

re4.1:

Instru

ction-level

erdatab

asegrou

elarge

increase

inarea,

erprofi

relativelyhigh

comparab

levalu

cle-accurate

configu

rations.

Power Profiling Counter

ENABLE OUTPUTcountA

ENABLE OUTPUTcountB

ENABLE OUTPUTcountC

ENABLE OUTPUTcountD

ENABLE OUTPUTcountE

ENABLE OUTPUTcountF

ENABLE OUTPUT

icache_stall

dcache_stall

instruction

Instruction

Decoding

enSTALL

callreset

return

countSTALL

count_out

Figure 4.2: Data Counter module configured for power profiling.

to the optimization mentioned in the previous section to replace the 160-bit additions with 7

smaller additions. This removes the power-specific modifications from the critical path, keeping

the Address Hash’s slowest path as the critical path of the profiler.

Table 4.4: Area overhead of LEAP in configured for energy profiling.

Total Total Total Fmax

N LEs Registers Mem. Bits (MHz)

16 2,559 1427 6,656 161.132 2,620 1430 9,216 158.064 2,703 1478 14,336 150.6

128 2,667 1397 24,576 157.0256 2,758 1483 45,056 163.2

4.3 Experimental Results

Through the combined use of the instruction power database and the on-chip profiler, energy

estimation was performed for the set of 13 benchmarks described in Section 3.4.2. These

experiments demonstrate the utility of this approach, as well as highlight the difficulties inherent

in energy estimation.

4.3.1 Methodology

To perform function-level energy estimation, the profiler and Tiger system are synthesized

and programmed to the target FPGA. Each benchmark is run with the corresponding hash

initialization parameters to produce a 160-bit result for each function. This value is the result of

concatenating the seven counters used within the profiler to store the counts of each instruction

grouping and of stalls. All benchmarks were run in two configurations: in the first, a stall

was said to have occurred whenever either the instruction or data cache stalled. In the second

configuration, a stall was said to occur at any time that a pipeline stage was stalled.

Using these results, the energy estimation of each function within a benchmark is calculated

with the formula E =

Pi × Ti, where E is the energy estimation, Pi is the average dy-

namic power for group i in the instruction power database, and Ti is the time spent executing

instructions of group i. Time can be further expressed as Ti = Ci/f , where Ci is the cycle

count reported for group i, and f is the frequency the system was run at. Since the instruction

database was created based on a system running at 25 MHz, f is set to this value. This may

appear to be a limitation on the frequency the system can run at, but energy estimation is

independent of frequency. In the equation E = P ×T , power is proportional to frequency, so as

frequency changes by a factor df , so does the power. However, the time required for completion

will change by 1df

, since the system is running faster; these two factors of df cancel each other

out, therefore the frequency of the system does not affect its energy consumption.

To test the validity of these energy estimations, timing simulations in ModelSim are performed

on the system (not including the profiler) to produce a VCD for each benchmark. The average

dynamic power, calculated by the Quartus PowerPlay Analyzer, and the total execution time

are recorded and used to calculated the overall energy consumption. This value was taken to be

the golden result as timing simulations, coupled with Quartus PowerPlay, is the recommended

flow for power estimation with Altera devices. The function-level energy estimates were summed

to produce the overall energy estimate for the benchmark, which was compared against these

golden results to gauge the accuracy of the profiling method.

To confirm the accuracy of the estimates at a function level, timing simulations in ModelSim

were performed to create a VCD for each function in the benchmark. The PC of the processor

was monitored throughout execution to indicate which function was currently executing; VCD

recording for the current function is enabled, recording for all other functions is disabled. These

results were then compared to the function-level energy estimates produced through profiling

to further verify the estimates.

A final experiment was performed to compare energy consumption profiles with cycle count

profiles to determine the degree of correlation between the two. The percentage energy con-

sumption of each function, as compared to the overall energy consumed throughout the entire

application, and the percentage cycle count are calculated and the values are compared to see

if a) the numbers correlate, and b) if the ordering of functions based on these percentages is

the same for energy profiling and cycle profiling.

4.3.2 Results

4.3.2.1 Overall Energy Estimation

Table 4.5 shows the overall energy estimates, each of which is the sum of all function-level energy

estimates within a benchmark, of each of the 13 benchmarks in the test suite when considering

only cache stalls. Detailed results for all 13 benchmarks can be found in Appendix A.2.1. Part

(a) shows the results of energy profiling when run on the FPGA, including the percentage of

time the benchmark spent in cache stalls, the total number of cycles required to execute the

benchmark, the average dynamic power estimate (calculated as P = E/T ), and the energy esti-

mate calculated with the aid of the instruction-level power database. Part (b) shows the results

of timing simulation with ModelSim followed by analysis with Quartus PowerPlay, including

total execution time in nanoseconds, average dynamic power, and total energy consumption

(calculated as E = P × T ). Part (c) shows the energy estimation error for each benchmark,

calculated using the standard estimation error formula: error = estimate−actualactual

. The absolute

value of the estimation error can be seen to range from 1.8% for the benchmark “DFADD,” to

22.0% for “SHA.” The average estimation error is calculated as -0.10%, however, the average

of the absolute values is more meaningful as under-estimates are equally as wrong as over es-

timates; the absolute average is seen to be 8.27%. Two benchmarks, “Blowfish” and “SHA,”

have much higher errors than any others, the next largest error being only 12.6%. If these two

problematic benchmarks are ignored, the average error would drop to only 6.49%. In addition

to having the largest errors, they are also among the least-stalling benchmarks. These results

necessitated further investigation, leading to repeating the experiments to include all pipeline

stalls for energy profiling.

Table 4.5: Energy profiling results when only cache stalls are considered.

Power Energy Total Power Energy

Benchmark Stalls Cycles (mW) (nJ) Time (ns) (mW) (nJ) Energy

adpcm 7.4% 216,242 60.59 565,890.55 9,312,640 53.95 502,416.93 12.6%

aes 35.8% 47,427 52.36 154,817.45 2,881,320 49.03 141,271.12 9.6%

blowfish 9.5% 916,497 61.76 2,501,841.74 40,507,640 71.91 2,912,904.39 -14.1%

dfadd 29.8% 18,233 54.18 56,266.19 1,011,720 54.61 55,250.03 1.8%

dfdiv 9.9% 72,057 59.88 191,590.02 3,226,360 55.51 179,095.24 7.0%

dfmul 41.8% 6,426 50.25 22,199.31 431,200 49.04 21,146.05 5.0%

dfsin 21.9% 3,003,897 56.22 8,652,870.50 153,907,120 54.80 8,434,110.18 2.6%

gsm 17.9% 41,662 60.42 122,567.17 2,018,160 64.05 129,263.15 -5.2%

jpeg 14.2% 5,277,413 59.26 14,588,694.58 246,168,000 63.06 15,523,354.08 -6.0%

mips 7.1% 40,780 62.42 109,635.61 1,751,480 56.23 98,485.72 11.3%

motion 30.7% 18,251 56.46 59,462.06 1,057,160 60.52 63,979.32 -7.1%

sha 2.5% 1,113,309 63.98 2,923,453.63 45,690,040 82.05 3,748,867.78 -22.0%

dhrystone 16.1% 27,213 59.76 77,537.53 1,280,440 58.68 75,136.22 3.2%

Average -0.10%

Abs. Average 8.27%

(a) Energy Profiling Results (b) Simulation Results (c) Error

Pipeline stalls can consume a considerable chunk of an application’s execution time. Table 4.6

shows the overall energy estimates when all pipeline stalls are considered for both the instruction

database and the profiler counters. This change in stall measurement is apparent by noting the

increase in stall percentage versus Table 4.5, as all cache stalls cause pipeline stalls. For these

experiments, the absolute estimate error ranges between 1.1% error for “DFSIN” to 23.2% error

for “SHA.” The absolute average error across all benchmarks is 8.38%, which is very similar to

the result of the cache-only experiments. Detailed results for all 13 benchmarks can be found

in Appendix A.2.2.

Table 4.6: Energy profiling results when all pipeline stalls are considered.

Power Energy Total Power Energy

Benchmark Stalls Cycles (mW) (nJ) Time (ns) (mW) (nJ) Energy

adpcm 37.3% 146,468 57.02 532,740.94 9,312,640 53.95 502,416.93 6.0%

aes 46.0% 39,924 55.28 163,500.43 2,881,320 49.03 141,271.12 15.7%

blowfish 12.3% 888,221 62.48 2,530,356.30 40,497,080 71.91 2,912,145.02 -13.1%

dfadd 44.7% 14,318 55.65 57,556.45 1,011,720 54.61 55,250.03 4.2%

dfdiv 33.9% 52,898 57.26 183,025.38 3,226,360 55.51 179,095.24 2.2%

dfmul 55.0% 4,969 53.57 23,716.31 431,200 49.04 21,146.05 12.2%

dfsin 43.3% 2,180,140 55.40 8,526,386.48 153,920,040 54.80 8,434,818.19 1.1%

gsm 42.2% 29,559 57.17 116,914.53 2,018,160 64.05 129,263.15 -9.6%

jpeg 34.6% 4,034,329 57.57 14,195,382.50 246,571,880 63.06 15,548,822.75 -8.7%

mips 21.1% 34,660 60.04 105,358.20 1,751,480 56.23 98,485.72 7.0%

motion 33.0% 17,647 58.22 61,332.44 1,057,160 60.52 63,979.32 -4.1%

sha 9.5% 1,033,449 63.03 2,880,061.19 45,688,840 82.05 3,748,769.32 -23.2%

dhrystone 27.6% 23,394 59.20 76,520.37 1,280,440 58.68 75,136.22 1.8%

Average -0.65%

Abs. Average 8.38%

The benchmarks “Blowfish” and “SHA” still result in relatively high error in this stall con-

figuration, all under-estimates, but it is important to note that the stall percentages of these

two are now significantly less than any other benchmark. This observation lead to an impor-

tant realization about the dynamic power of the processor: an application with low pipeline

stalls means that, on average, the processor is performing work more often than one with a

higher stall rate, so it will consume more power. During a pipeline stall caused by pipeline

Table 4.7: Energy profiling results when all pipeline stalls are considered and correction factorapplied.

Power Stall Energy Total Power Energy

Benchmark Stalls Cycles (mW) Factor (nJ) Time (ns) (mW) (nJ) Energy

adpcm 37.3% 146,468 56.88 0.99 531,430.45 9,312,640 53.95 502,416.93 5.8%

aes 46.0% 39,924 54.96 0.97 162,562.87 2,881,320 49.03 141,271.12 15.1%

blowfish 12.3% 888,221 71.45 1.18 2,893,418.97 40,507,640 71.91 2,912,145.02 -0.6%

dfadd 44.7% 14,318 55.35 0.98 57,245.57 1,011,720 54.61 55,250.03 3.6%

dfdiv 33.9% 52,898 57.26 1.00 183,036.18 3,226,360 55.51 179,095.24 2.2%

dfmul 55.0% 4,969 53.20 0.96 23,550.46 431,200 49.04 21,146.05 11.4%

dfsin 43.3% 2,180,140 55.12 0.98 8,483,392.37 153,907,120 54.80 8,434,818.19 0.6%

gsm 42.2% 29,559 56.90 0.98 116,368.08 2,018,160 64.05 129,263.15 -10.0%

jpeg 34.6% 4,034,329 57.53 1.00 14,187,192.49 246,168,000 63.06 15,548,822.75 -8.8%

mips 21.1% 34,660 61.79 1.06 108,434.21 1,751,480 56.23 98,485.72 10.1%

motion 33.0% 17,647 58.27 1.00 61,382.00 1,057,160 60.52 63,979.32 -4.1%

sha 9.5% 1,033,449 79.95 1.26 3,652,738.91 45,690,040 82.05 3,748,769.32 -2.6%

dhrystone 27.9% 23,394 59.69 1.02 77,149.23 1,280,440 58.68 75,136.22 2.7%

Average 1.95%

Abs. Average 5.95%

stage X, all stages before X must also stall, but all stages after may finish execution; this

results in a partial pipeline stall. Therefore, higher stall rates also cause emptier pipelines and

take longer to complete the computation; less work is performed per unit time, resulting in a

lower overall dynamic power consumption. The instruction-level power database, which was

created by averaging the dynamic powers for each instruction across all benchmarks, inher-

ently assumes a certain pipeline stall rate equal to the average of the stalls rates of Table 4.6,

which is 33.9%, and assumes all pipeline stalls are full stalls. This approach will naturally

incur error because benchmarks such as “Blowfish” and “SHA” only stall for 12.3% and 9.5%

of the time, respectively. This means they have higher dynamic power consumptions than

indicated by the power database; the rest of the benchmarks range between 21.1-55.0% stall

rate, much closer to the database’s average. In a first-attempt method to compensate for the

assumed stall rate, we scale the estimated energy values by an empirically-derived correction

factor that is based on a program’s stall rate; future investigation may lead to a modified

methodology for creating the instruction-level power database. The correction factor is defined

to be: factor = 1 + (Avg.Stall%−Stall%Stall%

) × 1Stall%

, where Avg.Stall% is the average stall rate

across all programs (33.9%) and Stall% is the fraction of time spent stalling in the program

being profiled. Intuitively, this factor increases the power estimate when the stall rate of the

program being profiled is below the average stall rate; likewise, the power estimate is reduced

when the stall rate of the program is above the average stall rate. This factor is multiplied with

the estimated energy consumptions of Table 4.6 to achieve lower estimation errors, as seen in

Table 4.7. These factors range from 0.96, which corresponds to the highest-stalling benchmark

“DFMUL,” to 1.26, which corresponds to the lowest-stalling benchmark “SHA.” It can be seen

in this table that the absolute average estimation error is now only 5.95%, and is at worst

15.1%, confirming that real-time energy profiling is accurate.

4.3.2.2 Function-level Energy Estimation

The previous section demonstrated the overall accuracy of energy consumption profiling by

summing the energy estimates of each function and comparing this total for each benchmark

against the “golden” result from simulation. Tables 4.8 and 4.9 show the function-level results

of energy profiling for the benchmarks “ADPCM” and “DFMUL”, respectively, versus function-

level “golden” results. Part (a) shows the cycle counts of each instruction group and of pipeline

stalls (discussed in Section 4.2). Due to the composition of the instruction-level power database

in Table 4.3(b), groups A and F tend to have low cycle counts; “ADPCM” executes none of the

instructions in either of these groups. Part (b) shows the estimated energy consumption using

the instruction-level database shown in Table 4.3(b), and the percent energy consumption of

each function as compared to the total estimated energy consumption of the application. Part

(c) shows the average dynamic power as measured from timing simulation, the calculated energy

consumption based on this power, and the percentage energy consumption. Finally, part (d)

shows the estimation error at a function level, as well as the difference between the estimated

Table 4.8: Function-level energy estimation results for benchmark “ADPCM”. Note: “Stalls”indicate pipeline stalls, and “Energy %” indicates percentage energy compared tothe total energy.

Est. Energy VCD VCD Energy EnergyFunction A B C D E F Stalls Energy % Power Energy % Energy %encode 0 7782 20665 12490 2642 0 18669 144,953.36 27.2% 58.10 144,664.35 28.2% 0.2% -1.0%decode 0 6872 14744 13062 2547 0 26632 143,372.68 26.9% 57.80 147,637.38 28.8% -2.9% -1.9%upzero 0 4137 5376 7153 1848 0 14038 73,001.07 13.7% 51.63 67,226.39 13.1% 8.6% 0.6%filtez 0 2395 2195 3384 1196 0 5513 33,643.84 6.3% 52.99 31,122.09 6.1% 8.1% 0.3%quantl 0 2183 3364 2471 991 0 5120 32,289.59 6.1% 47.22 26,686.86 5.2% 21.0% 0.9%uppol2 0 1597 3392 1783 799 0 3652 25,952.73 4.9% 52.31 23,483.01 4.6% 10.5% 0.3%uppol1 0 798 1984 2184 398 0 2733 18,820.20 3.5% 51.32 16,621.52 3.2% 13.2% 0.3%main 0 713 1049 875 149 0 2338 11,256.73 2.1% 56.25 11,529.00 2.2% -2.4% -0.1%adpcm main 0 354 1383 773 200 0 1871 10,318.75 1.9% 50.93 9,332.41 1.8% 10.6% 0.1%filtep 0 0 598 799 600 0 1658 8,438.64 1.6% 50.92 7,444.50 1.5% 13.4% 0.1%logscl 0 399 1191 795 99 0 948 8,056.15 1.5% 52.92 7,264.86 1.4% 10.9% 0.1%logsch 0 400 1030 630 200 0 837 7,314.58 1.4% 53.02 6,568.12 1.3% 11.4% 0.1%scalel 0 200 1591 798 0 0 298 7,158.11 1.3% 63.72 7,358.39 1.4% -2.7% -0.1%reset 0 0 65 96 0 0 2594 5,025.29 0.9% 30.64 3,376.53 0.7% 48.8% 0.3%abs 0 0 299 748 0 0 108 2,978.35 0.6% 60.43 2,791.87 0.5% 6.7% 0.0%

Total 0 27,830 58,926 48,041 11,669 0 87,009 532,580 513,107 3.8%

(a) Profiling Results (b) Energy (c) Simulation (d) ErrorEstimation Results

and actual percentage energy consumptions.

It can be seen that although many functions have seemingly high estimation errors, the

overall energy estimates are still quite small, measuring only 3.8% and 2.0% for “ADPCM”

and “DFMUL”, respectively. This result can be explained by noting that the majority of each

program’s execution is spent in a relatively small number of functions; for “ADPCM”, 74% of

all cycles executed are spent in only 4 of 15 functions. Likewise, for “DFMUL”, 81% of all

cycles are spent in only 4 of the 12 functions. Although some function-level error percentages

are high, the actual difference between the energy estimate and the actual energy are quite

small. Therefore, these errors are less critical since they correspond to very small portions of

the code. The right-most column in Tables 4.8 and 4.9 show that the difference in percentage

error between the estimate and the value calculated from simulation is at most 1.9% for either

benchmark. This confirms that the estimate is accurate at the function level.

Table 4.9: Function-level energy estimation results for benchmark “DFMUL”. Note: “Stalls”indicate pipeline stalls, and “Energy %” indicates percentage energy compared tothe total energy.

Est. Energy VCD VCD Energy EnergyFunction A B C D E F Stalls Energy % Power Energy % Energy %float64 mul 0 374 789 1173 32 0 2674 10,875.01 46.2% 52.01 10,489.38 48.1% 3.7% -1.9%main 0 142 260 232 20 0 1199 3,798.68 16.1% 47.97 3,555.54 16.3% 6.8% -0.2%roundAndPackFloat64 0 116 198 241 8 0 641 2,581.46 11.0% 47.82 2,303.01 10.6% 12.1% 0.4%mul64To128 31 30 123 93 0 31 495 1,682.78 7.1% 42.49 1,364.78 6.3% 23.3% 0.9%propagateFloat64NaN 0 41 60 74 0 0 549 1,420.20 6.0% 39.2 1,135.23 5.2% 25.1% 0.8%extractFloat64Exp 0 0 79 159 0 0 49 718.09 3.0% 60.91 699.25 3.2% 2.7% -0.2%extractFloat64Sign 0 0 39 198 0 0 42 711.30 3.0% 54.72 610.68 2.8% 16.5% 0.2%extractFloat64Frac 0 0 79 140 0 0 70 703.63 3.0% 66.32 766.66 3.5% -8.2% -0.5%float64 is signaling nan 0 6 17 44 0 0 104 360.71 1.5% 42.98 293.98 1.3% 22.7% 0.2%packFloat64 0 0 29 30 29 0 41 315.25 1.3% 51.78 267.18 1.2% 18.0% 0.1%float64 is nan 0 5 16 18 0 0 81 244.35 1.0% 44.69 214.51 1.0% 13.9% 0.1%float raise 0 2 2 7 0 0 65 143.96 0.6% 37.35 113.54 0.5% 26.8% 0.1%

Total 31 716 1,691 2,409 89 31 6,010 23,555 21,813 2.0%

(a) Profiling Results (b) Energy (c) Simulation (d) ErrorEstimation Results

4.3.2.3 Energy and Cycle Profile Correlation

A final investigation into energy profiling was to determine how closely the amount of time a

program spends in different parts of its code relates to the amount of energy it consumes. Ta-

bles 4.10 and 4.11 show the function-level results for cycle count profiling and energy profiling for

the benchmarks “ADPCM” and “DFMUL.” The number of clock cycles spent executing within

each function, along with the percentage execution time as compared with the total number of

clock cycles required to execute the program, are labeled Cycles. The energy consumption per

function, along with the percentage energy consumption, are labeled cStall Energy and pStall

Energy; cStall Energy refer to the energy estimate based on cache stalls only, and pStall Energy

refers to the estimates based on all pipeline stalls. By comparing the percentage execution time

to each of the percentage energy consumptions, it can be seen that the maximum difference

between time and cache stall energy for “ADPCM” is 0.3%, and 0.6% for time versus pipeline

stall energy. For “DFMUL,” the maximum difference between time and cache stall energy is

1.8%, and 0.6% for time versus energy (when considering pipeline stalls). The full set of results

for all 13 benchmarks can be found in Appendix A.2.3. This shows that there is a very strong

4.4 Summary

Table 4.10: Comparative results of percentage energy consumption versus percentage executiontime for benchmark “ADPCM”.

Function Cycles cStall Energy pStall Energy

decode 63818 27.3% 155,220.31 27.4% 143,372.68 26.9%encode 62188 26.6% 151,221.70 26.7% 144,953.36 27.2%upzero 32578 14.0% 79,712.37 14.1% 73,001.07 13.7%filtez 14707 6.3% 35,573.37 6.3% 33,643.84 6.3%quantl 14137 6.1% 34,992.36 6.2% 32,289.59 6.1%uppol2 11209 4.8% 27,070.45 4.8% 25,952.73 4.9%uppol1 8097 3.5% 20,692.39 3.7% 18,820.20 3.5%main 5115 2.2% 10,846.07 1.9% 11,256.73 2.1%adpcm main 4596 2.0% 9,799.36 1.7% 10,318.75 1.9%filtep 3655 1.6% 9,986.15 1.8% 8,438.64 1.6%logscl 3425 1.5% 8,382.52 1.5% 8,056.15 1.5%logsch 3097 1.3% 8,083.42 1.4% 7,314.58 1.4%scalel 2887 1.2% 7,190.45 1.3% 7,158.11 1.3%reset 2753 1.2% 4,035.83 0.7% 5,025.29 0.9%abs 1155 0.5% 2,960.50 0.5% 2,978.35 0.6%wrap 91 0.0% 123.28 0.0% 160.89 0.0%

Total 233508 100.0% 565,891 100.0% 532,741 100.0%

correlation between the function-level energy estimates and the function-level execution time.

It is also important to note that except for a few small differences (i.e. encode and decode in

“ADPCM”), the sorted order for percentage execution time, percentage cache stall energy, and

percentage pipeline stall energy are the same. For hardware/software partitioning purposes,

this means that a partition optimized for throughput is also optimized for energy consumption,

and vice-versa.

4.4 Summary

LEAP has shown to be an extensible profiling framework through the implementation of en-

ergy profiling. An instruction-level power database was created using timing-based simulation

to characterize the individual power consumption of instructions in the MIPS1 instruction set.

LEAP’s Data Counter module was modified to count, per function, the number of cycles the

Table 4.11: Comparative results of percentage energy consumption versus percentage executiontime for benchmark “DFMUL”.

Function Cycles cStall Energy pStall Energy

float64 mul 5038 45.6% 10,462.91 47.1% 10,875.01 45.9%main 1834 16.6% 3,290.03 14.8% 3,798.68 16.0%roundAndPackFloat64 1213 11.0% 2,463.48 11.1% 2,581.46 10.9%mul64To128 788 7.1% 1,627.69 7.3% 1,682.78 7.1%propagateFloat64NaN 727 6.6% 1,237.55 5.6% 1,420.20 6.0%extractFloat64Frac 289 2.6% 677.64 3.1% 703.63 3.0%extractFloat64Exp 287 2.6% 681.30 3.1% 718.09 3.0%extractFloat64Sign 282 2.6% 681.14 3.1% 711.30 3.0%float64 is signaling nan 171 1.5% 321.52 1.4% 360.71 1.5%packFloat64 129 1.2% 308.63 1.4% 315.25 1.3%float64 is nan 120 1.1% 209.88 0.9% 244.35 1.0%wrap 91 0.8% 123.28 0.6% 160.89 0.7%float raise 76 0.7% 114.26 0.5% 143.96 0.6%

Total 11045 100.0% 22,199 100.0% 23,716 100.0%

processor spent executing each instruction and the number of cycles spent stalling. Using the

equation E = P × T , the energy consumption of thirteen benchmarks was estimated within

5.95%, on average. The results of cycle profiling and energy profiling were compared to deter-

mine the relationship between percent execution time and percent energy consumption for each

function in the benchmark suite. It was found that for a given function, the two rarely differ

by more than 1-2%.

Chapter 5

Hardware/Software Partitioning

This research aims to enable the partitioning of software code into hardware and software

portions. In Chapter 3, a low-overhead profiling framework was described that returned the

cycle count of each function that was called in an application. This chapter demonstrates

the utility of this framework by using the results of Chapter 3 to drive hardware/software

partitioning with the LegUp framework.

5.1 System Creation with LegUp

In order to demonstrate the effectiveness of hardware/software partitioning using LEAP, LegUp

was used to create hybrid systems for comparison. These hybrid systems consist of the Tiger

processor plus hardware accelerators communicating with the processor over the Altera Avalon

Interconnect fabric. Each accelerator is the hardware implementation of a software function

and all descendant functions.

In order to create a hybrid system, the software-only version of the system is first run using

LEAP to gather profiling data. Using these results, functions are chosen to maximize through-

put or minimize energy consumption by placing the name of each function in a configuration

file. LegUp reads this file and, using its HLS tool, compiles each function and all descendants

into Verilog modules.

5 Hardware/Software Partitioning

In order to communicate with these accelerated functions, a software wrapper function is cre-

ated to: 1) pass any parameters required by the function, 2) retrieve the function’s return value,

3) start the hardware accelerator at the correct times in program flow, and 4) determine when

the accelerator has completed and the results are available. These wrappers are automatically

generated by LegUp for each function that is accelerated. Using the LLVM compiler infras-

tructure, all function calls to accelerated functions are replaced with calls to the corresponding

wrapper, which can then pass parameters over the Avalon Interface, send a start signal to the

accelerator, wait until the accelerator completes and read the return value, if any.

Once the accelerators have been compiled and the software wrappers have been generated to

communicate with the accelerators, the hybrid system (not containing LEAP) is automatically

generated. The Tiger processor and all accelerators are placed on the Avalon Interconnect

fabric, and SOPC builder is used to generate arbitration code for the communication between

them. Quartus is used to synthesize the system and program it to the target FPGA. Users

run applications in the same way they would run software on the Tiger system, except that

the application-specific hybrid system must be used instead of the generic processor, and the

modified software binary containing the software wrapper functions must be run.

Multiple configurations of the hybrid systems were generated and compared to investigate the

trends that arise as computation is gradually moved from software to hardware. Data was

collected that measured the area consumption, throughput, and energy consumption of these

configurations.

5.2.1 Methodology

The experiments performed to investigate trends in hardware/software partitioning consist of

four system configurations created with LegUp:

1. Software Only: this system consists of the base Tiger system, running pure software

through the processor.

2. Hardware Only: this system consists of a pure-hardware implementation of the bench-

mark, as created by the LegUp HLS tools. The Tiger processor is not part of this system.

3. Hybrid 1: this hybrid system accelerates the most compute-intensive software function

(and descendants).

4. Hybrid 2: this hybrid system accelerates the second-most compute-intensive software

function (and descendants).

All four configurations were created for each of the 13 benchmarks, using the results of cycle

profiling (Section 3.4.3) to assist in partitioning, where applicable. Using the post-synthesis

results from Quartus, the area consumption of each system was determined, measured in logic

elements (LEs), memory bits, and embedded hardware multipliers. Using the Fmax of the

system (as determined by Quartus), in conjunction with the number of cycles required to execute

the benchmark in the given configuration, the total execution time was determined. Finally,

by using timing simulation with ModelSim followed by analysis with the Quartus PowerPlay

Analyzer, the average dynamic power and total energy consumption of each configuration was

determined.

5.2.2 Results

The area consumption of each benchmark in the four system configurations is shown in Table 5.1.

Each configuration, labeled (a)-(d), shows the number of logic elements (LEs), memory bits

(Mem. Bits), and embedded multipliers (Mults) for that benchmark. The geometric mean,

geomean, is also shown, averaging the results of each metric among all benchmarks to show the

trends among configurations. The ratio indicates the geomean of the given metric divided by

that of software-only. It can be seen that as more computation is performed in hardware (i.e.

from left to right in the table), all metrics increase in area consumption as accelerators must be

added to the system. This does not hold for the hardware-only configuration, as the average

number of LEs is only slightly more than that of software only, and the memory and multiplier

requirements are the lowest of any configuration. This can be explained by noting that the

Tiger processor is no longer needed in this configuration, as no computation is performed by

software. The trends in area consumption, specifically LE count, are shown in Figure 5.1.

Table 5.1: Area results for hybrid systems in four configurations.Mem. Mem. Mem. Mem.

Benchmark LEs Bits Mults LEs Bits Mults LEs Bits Mults LEs Bits Mults

adpcm 12,243 226,009 16 25,628 242,944 152 46,301 242,944 300 22,605 29,120 300aes 12,243 226,009 16 56,042 244,800 32 68,031 245,824 40 28,490 38,336 0blowfish 12,243 226,009 16 25,030 341,888 16 31,020 342,752 16 15,064 150,816 0dfadd 12,243 226,009 16 22,544 233,664 16 26,148 233,472 16 8,881 17,120 0dfdiv 12,243 226,009 16 28,583 225,600 46 36,946 233,472 78 20,159 12,416 62dfmul 12,243 226,009 16 16,149 225,280 48 20,284 233,472 48 4,861 12,032 32dfsin 12,243 226,009 16 34,695 233,472 78 54,450 233,632 116 38,933 12,864 100gsm 12,243 226,009 16 15,220 225,280 16 30,808 233,296 142 19,131 11,168 70jpeg 12,243 226,009 16 25,148 232,576 114 64,441 354,544 254 46,224 253,936 172mips 12,243 226,009 16 46,432 338,096 252 18,857 230,304 24 4,479 4,480 8motion 12,243 226,009 16 18,857 230,304 24 18,013 242,880 16 13,238 34,752 0sha 12,243 226,009 16 28,761 243,104 16 29,754 359,136 16 12,483 134,368 0dhrystone 12,243 226,009 16 20,382 359,136 16 16,310 225,280 16 4,985 82,008 0

Geomean: 12,243 226,009 16 26,593 248,671 43 33,629 261,260 51 15,646 28,822 12Ratio: 1.00 1.00 1.00 2.17 1.10 2.68 2.75 1.16 3.18 1.28 0.13 0.72

(a) SW-Only (b) Hybrid 2 (c) Hybrid 1 (d) HW-Only

The throughput of each system is shown in Table 5.2. Each configuration shows the number

of cycles required to run the benchmark to completion (Cycles), the maximum frequency at

which the system can be run (Freq.), and the time required to run the benchmark to completion,

measured in microseconds (Time (µs)). It can be seen that, on average, the execution time is

cut in half as more computation is performed in hardware, up to an 88% reduction in execution

time for the hardware-only configuration. The maximum frequency of each system remains

fairly constant, as the LegUp HLS tool orders the execution of operations so as to minimize

its critical path. As one might expect, the dual-precision floating point benchmarks achieved

the greatest speedups as more hardware acceleration was used. The benchmark “DFDIV,”

for example, requires 962.9µs, 190.7µs, 68.8µs, and 29.0µs for the software-only, hybrid 2,

SW-Only Hybrid 2 Hybrid 1 HW-Only

Figure 5.1: Area comparison for hybrid systems in four configurations, averaged across 13benchmarks.

hybrid 1, and hardware-only configurations, respectively. When comparing to the software-

only configuration for this benchmark, these execution times correspond to reductions of 80.2%,

92.9%, and 97.0% for the hybrid 2, hybrid 1, and hardware-only configurations, respectively.

These results show that partition choices, based on profiling results from LEAP, greatly reduce

the execution time requirements of a system. Moreover, by making incorrect partitioning

choices, such as choosing the second most time-consuming function for acceleration as opposed

to the most time-consuming, it is shown that the generated hybrid system will not be as

beneficial as one that was based on good profiling results. There exists a cost to using the hybrid

systems due to the hardware-software communication overhead. Data must be transferred

over the Avalon fabric each time an accelerator is called. Although this overhead is usually

more than compensated for by the gains in throughput from hardware acceleration, incorrect

partitioning can actually increase the total execution time if ineffective accelerators are chosen.

The benchmark “DFMUL,” for example, actually has a longer execution time for the hybrid

2 system than for pure software, due to the communication overhead of each function call.

The trends in throughput, specifically execution time, are shown in Figure 5.2. For the two

hybrid cases, the bottom portion of each bar represents the ideal speedup, using Amdahl’s Law,

assuming a speedup equal to that achieved for the hardware-only case.

Table 5.2: Time results for hybrid systems in four configurations.Time Time Time Time

Benchmark Cycles Freq. (µs) Cycles Freq. (µs) Cycles Freq. (µs) Cycles Freq. (µs)adpcm 193,607 74.26 2,607.2 159,883 61.61 2,595.1 96,948 57.19 1,695.2 36,795 45.79 804.0aes 73,777 74.26 993.5 55,014 54.97 1,000.8 26,878 49.52 542.8 14,022 60.72 231.0blowfish 954,563 74.26 12,854.3 680,343 63.21 10,763.2 319,931 63.70 5,022.5 209,866 65.41 3,208.0dfadd 16,496 74.26 222.1 14,672 83.14 176.5 5,649 83.65 67.5 2,330 124.05 19.0dfdiv 71,507 74.26 962.9 15,973 83.78 190.7 4,538 65.92 68.8 2,144 74.72 29.0dfmul 6,796 74.26 91.5 10,784 85.46 126.2 2,471 83.53 29.6 347 85.62 4.0dfsin 2,993,369 74.26 40,309.3 293,031 65.66 4,462.9 80,678 68.23 1,182.4 67,466 62.64 1,077.0gsm 39,108 74.26 526.6 29,500 61.46 480.0 18,505 61.14 302.7 6,656 58.93 113.0jpeg 29,802,639 74.26 401,328.3 16,072,954 51.20 313,924.9 15,978,127 46.65 342,510.8 5,861,516 47.09 124,475.0mips 43,384 74.26 584.2 6,463 84.50 76.5 6,463 84.50 76.5 6,443 90.09 72.0motion 36,753 74.26 494.9 34,859 73.34 475.3 17,017 83.98 202.6 8,578 91.79 93.0sha 1,209,523 74.26 16,287.7 358,405 84.52 4,240.5 265,221 81.89 3,238.8 247,738 86.93 2,850.0dhrystone 28,855 74.26 388.6 25,599 82.26 311.2 25,509 83.58 305.2 10,202 85.38 119.0

Geomean: 173,332.0 74.26 2334.1 86,258.3 69.98 1232.6 42,700.5 67.78 630.0 20,853.8 71.56 291.7Ratio: 1.00 1.00 1.00 0.50 0.94 0.53 0.25 0.91 0.27 0.12 0.96 0.12

The energy consumption of each system is shown in Table 5.3. Each configuration shows the

average dynamic power consumption of the system, measured in milliwatts (Power (mW)), and

the total energy consumed by the system while running the indicated benchmark, measured

in nanojoules (Energy (nJ)). It can be seen that the dynamic powers of the software-only and

hybrid systems differ by less than 12% on average, whereas the hardware-only case consumes

less than half of this power. This extra power is consumed by the processor; even though much

of the execution is performed by the accelerators, the processor is configured to poll until the

accelerator has completed, thus is always running. The design decision for the processor to poll

until the accelerator completes, rather than stall, is so that future work may enable concurrent

processing of the accelerators and processor using threads. Although the dynamic power results

do not look promising for the hybrid systems, the energy consumptions of each configuration

show much better results: as more computation is performed in hardware, each configuration

Exec. Time

Exec. Time Ideal

Figure 5.2: Execution time comparison for hybrid systems in four configurations, averagedacross 13 benchmarks.

halves in energy consumption. This result can be explained through the equation E = P × T ,

where E is energy, P is dynamic power, and T is execution time. As just discussed, P does not

change much for software-only or the hybrid systems. T , however, drastically improves from

software to hardware, as shown in Table 5.2, leading to excellent energy reductions. The trends

in energy consumption are shown in Figure 5.3.

Table 5.3: Power and energy results for hybrid systems in four configurations.Power Energy Power Energy Power Energy Power Energy

Benchmark (mW) (nJ) (mW) (nJ) (mW) (nJ) (mW) (nJ)

adpcm 264.57 689,783.0 205.80 534,073.1 175.80 298,018.1 157.83 126,894.4aes 231.63 230,125.2 221.84 222,014.5 226.23 122,789.4 123.09 28,434.1blowfish 127.77 1,642,398.5 274.46 2,954,049.3 214.16 1,075,608.0 121.30 389,118.5dfadd 157.61 35,011.1 221.09 39,015.7 178.98 12,086.6 85.84 1,631.0dfdiv 214.70 206,741.0 257.54 49,101.0 193.20 13,299.9 83.66 2,426.0dfmul 122.50 11,210.6 171.13 21,593.8 202.21 5,981.8 40.00 160.0dfsin 240.75 9,704,502.3 274.49 1,224,986.7 240.06 283,857.4 129.14 139,082.3gsm 253.35 133,420.8 167.91 80,594.0 178.90 54,145.6 102.70 11,605.4jpeg 281.92 113,142,738.70 207.3 65,082,605.3 204.74 70,124,803.7 261.37 32,533,823.6mips 239.86 140,130.3 166.74 12,752.7 166.74 12,752.7 41.44 2,983.7motion 274.44 135,824.3 316.77 150,562.9 158.22 32,060.0 73.91 6,873.5sha 383.21 6,241,622.4 227.70 965,543.0 211.64 685,437.1 152.23 433,860.6dhrystone 260.15 101,084.8 230.82 71,830.7 243.18 74,220.9 116.19 13,826.0

Geomean: 221.67 517,413.28 221.61 273,162.62 194.46 122,511.37 100.75 29,390.23Ratio: 1.00 1.00 1.00 0.53 0.88 0.24 0.45 0.06

Figure 5.3: Energy consumption comparison for hybrid systems in four configurations.

Chapter 6

Conclusions

6.1 Summary and Contributions

In recent years, SoCs implemented on reconfigurable FPGAs have become popular due to the

relative ease of development versus traditional ASIC designs, and the performance benefits

versus pure software. These systems combine multiple computational modules, including cus-

tom logic, communication interfaces, vendor intellectual property code, and soft processors. In

order to broaden the user-base of SoCs, and FPGAs in general, software developers need to

be able to create these systems using software development principles and limited hardware

knowledge. To this end, hardware/software partitioning is a process in which certain software

code segments are chosen to be converted into a hardware implementation to gain from the

performance benefits of hardware. Once partitioned, a framework such as LegUp can be used

to create the hardware accelerators for hybrid systems in which software code runs on a soft

processor and calls the accelerators. In order for this partitioning to provide performance ben-

efits, the choices of which code to convert to hardware must be based on real-time data about

the program’s execution.

Profiling is a form of dynamic code analysis in which real-time data is gathered about a

processor’s execution. Software approaches incur time overhead, resulting in very slow execution

of the application being profiled. To alleviate this overhead, sampling is often employed but

6 Conclusions

this leads to less accurate results. FPGA-based approaches are generally non-intrusive, meaning

they do not slow down the application’s execution. However, they add area and power overhead,

and may require resynthesis when targeting a new application.

This thesis has contributed a low-overhead and extensible architecture for profiling, called

LEAP, to aid in hardware/software partitioning for FPGA-based systems:� Chapter 3 showed the extensibility of LEAP through the implementation of three profil-

ing schemes: instruction profiling, cycle profiling, and stall cycle profiling. The accuracy

of the profiling results was verified through comparisons with existing profiling tools, and

the overhead was shown to be minimal as compared to another FPGA-based profiler;

up to a 94% reduction in area, and 88% reduction in dynamic power consumption was

achieved.� Chapter 4 introduced an instruction-level power database that characterizes the dynamic

power consumption of instructions in the MIPS1 instruction set. LEAP was shown to be

versatile by implementing instruction-specific profiling to measure the time spent exe-

cuting different instructions per function. The instruction power database was combined

with the real-time profiling results to estimate the function-level energy consumption of

13 complex C benchmarks, achieving an accuracy of 5.95%, on average, when compared to

simulation-based energy estimates. Cycle profiling and energy profiling results were used

to compare the percent execution time of each function within a benchmark to the percent

energy consumptions. It was found that the two are closely related, with the ordering

from high to low percentages being nearly identical for both profiles for all benchmarks.� Chapter 5 demonstrated the utility of LEAP through the use of its profiling results

to drive hardware/software partitioning and the creation of hybrid processor/accelerator

systems. The throughput, area, and power consumption of four systems created with

LegUp were presented to show the effectiveness of partitioning for reducing execution

time and energy consumption.

6.2 Future Work

The research proposed in this thesis aims to enable efficient hardware/software partitioning for

optimizing throughput, energy consumption, and locality. LEAP was created to facilitate these

optimizations and shown to be extensible to capture the desired metrics. Three additional

metrics that could provide further insight into enhancing the partitioned system are left as

future work, as are two possible improvements to the profiling framework.

6.2.1 Detailed Function Profiling

The current implementation of the framework performs all measurements at the function level.

This is useful when determining which functions should be accelerated, but the technique gives

no indication of how they should be accelerated. This question of “how” can be answered

by performing detailed analyses of a function’s execution, especially by focusing on branch

conditions.

Branch conditions dictate the flow of execution and can be represented with conditional

statements, loops, and gotos. Dynamic optimizations such as loop unrolling and branch predic-

tion can be very effective when applied to hardware circuitry to leverage available parallelism,

but they require knowledge of branch outcomes to drive the optimization. Loop unrolling, for

example, replicates the body of a loop a number of times, attempting to exploit intra-iteration

parallelism. This replication obviously increases the area required to implement the loop, so

the optimization must be applied strategically such that long-running loops are unrolled many

times, while infrequently iterating loops are not unrolled.

A new profiling metric to enable hardware-orientated optimizations such as loop unrolling

could therefore be the measurement of the total or average number of iterations of a given loop.

All natural loops can be determined by short-backwards-branches (sbb), which are statically-

detectable branches from the tail to the head of a loop. By monitoring the outcomes of these

particular branches and identifying a given loop by its sbb instruction’s address, loop iterations

6 Conclusions

can be counted.

The current framework can be extended to implement this scheme by noting the similarities

between it and cycle-accurate profiling: each uses an instruction’s address as an index to store

data, and each simply increments a counter when an event occurs. Cycle-accurate profiling uses

the address of the first instruction in the function as its index into the storage RAM to store the

count of cycles executed. Iteration measurement would use the address of the sbb instruction

as its index into a storage RAM to store the number of times that instruction occurs.

6.2.2 Value Profiling

Standard compiler optimizations include evaluating and propagating constant variables and

expressions to reduce the amount of computation required by the compiled binary. In order

to enable these optimizations, however, the compiler must be able to determine that a given

variable or expression is indeed constant. A beneficial use of dynamic profiling would be to

enable the extension of this optimization to variables and expressions that are probable to be

constant.

A semi-static parameter is an argument to a function that is highly probable to have the

same value whenever that function is called. If it could be determined that a given parameter

is semi-static, a source-code transformation could be performed to enable further optimization.

The transformation would involve creating a second implementation of the applicable function

in which the semi-static parameter is removed as a variable and replaced as a constant with the

value it is highly likely to hold. Any call to this function would be redirected to one of the two

implementations available. If the parameter equals the highly-probable case, the newly-created

constant version is called; otherwise, the original version is used. In this way, the execution of

the code is potentially accelerated while maintaining correctness.

Dynamic profiling with the proposed framework can be used to determine semi-static pa-

rameters by using caches to store data for each function, as opposed to the currently-used data

counters. The caches could be indexed by the actual value of the desired parameter; if a param-

6.2 Future Work

eter was in fact semi-static, only one of these blocks would be updated often. The associativity

and replacement policy of the caches would need to be investigated to reduce the chances of

thrashing, which would occur when other values who hash to the same block overwrite the

desired semi-static value.

6.2.3 Memory Access Pattern Profiling

Nearly all modern processors incorporate instruction and/or data caches to help reduce the

overhead involved in accessing memory. There are many parameters to set when implementing

a cache, including its size, associativity, coherency protocol, and its ability to pre-fetch data

through bursting. These parameters, once chosen, are generally fixed characteristics of the

processor, even though software applications vary widely in the ways they access memory. One

possible avenue of research is to profile applications to determine their memory access patterns,

thus enabling a customization of the processor’s caches to the required application to reduce

memory access time overhead.

Code that iterates through an array many times will have a fairly consistent memory access

pattern because the data is located in contiguous memory addresses. This type of access

pattern would benefit from bursting since it is highly probable that an address in memory will

be accessed soon after the one which precedes it. In addition, high associativity is not necessary

due to the use of modulus in mapping an address to a cache line. For example, in a direct-

mapped cache containing M cache lines, the line that a given address A will be mapped to is

A mod M ; any set of addresses that differ by less than M will never be mapped to the same

Code that makes heavy use of dynamically allocated memory, such as linked list iteration,

may have a more random access pattern. This type of pattern would likely benefit little from

bursting, since there is low probability that subsequent reads or writes would be to nearby

addresses. Higher associativity, however, would be beneficial in reducing the likelihood of

replacement, in which a cache-line conflict causes new data to overwrite the data currently

6 Conclusions

stored in the cache.

One way to extend this work in order to create a profiling scheme capable of determining the

memory access patterns of an application is to measure the spatial characteristics of memory

accesses. This would require access to the data bus, which would be monitored to discover the

locality, or lack thereof, exhibited by the application. High locality would indicate a contiguous

access pattern, while a lower locality would imply a more random one.

6.2.4 Counter Saturation Detection

LEAP contains multiple parameters that can be set to tailor the profiler to the user’s needs.

One such parameter is the width of the data counters used to store the profiling results, which

ranges from 16 to 64 bits and defines the largest integer that can be stored. For long-executing

functions, larger counters are required to avoid counter overflow, also known as saturation; this

increases the area requirements of the profiler.

In the context of hardware/software partitioning, the relative values of the data counters

are as useful as the absolute values since percentage of an overall resource is often all that is

desired. This means that any scaling factor could be applied to the results without affecting

the partition. One way to help reduce the area usage attributed to large counter values is to

implement smaller counters that contain logic to detect saturation. If at any time a counter

saturates, all counters are divided by two through a simple right shift. In this way, a final scaling

factor of 12S is applied to all counters, where S is the total number of saturations detected, and

a relative data set is maintained. This idea is presented in [27], but the authors do not address

a key issue with this approach. After a saturation has occurred, all values added to the counter

(which has been divided by 2) are worth twice as much as a value added prior to the saturation.

In this way, values measured near the beginning of execution are counted as less than those

profiled near the end of execution. This issue must be solved before implementation to achieve

correct counter results.

6.2 Future Work

6.2.5 Recursion Support

Currently, LEAP provides limited support for the profiling of recursive applications. This

limitation is due to the static depth of the Call Stack; the depth of the recursive function calls

can grow very large, causing the Call Stack’s stack pointer to overflow and incorrectly maintain

function contexts. A limited solution to this problem is to simply make the stack extremely

large. If N ≤ 256, where N is the maximum number of functions to profile, the number of bits

required to store a function number to the stack is N2 = log2(N) ≤ 8. Therefore, the number

of stack entries available in a single M4K AltSyncRam is 40968 = 512. This would provide

support for many recursive applications, but cannot be guaranteed to work under extreme uses

of recursion.

In order to provide full support for recursive programs, modifications to the Call Stack

module are required. First, it must be able to determine when an application begins recursion,

or equivalently, when a function already on the stack is called. This can be achieved through the

use of a bit vector of width N, where the position of each bit corresponds to a hashed function

number, and the value of the bit indicates its presence in the stack; a function currently in the

stack is represented with a one. During each function call, the bit corresponding to the called

function is checked to ensure it is not already in the stack, and the bit is set to a 1 to reflect the

function being pushed onto the stack. If a return occurs, the bit corresponding to the returning

function is cleared. If a checked bit is a 1 during a call, this means this function is being

called recursively and a “recursion mode” must be entered. The recursion mode differs from

normal operation as the stack pointer is not changed while in this mode, rather a “recursion

depth counter” is incremented on calls and decremented on returns to keep track of the current

depth of recursion. In addition, the bit vector is not modified in this mode. Upon entering

recursion mode, the depth counter is incremented to a 1 to reflect the current function call.

If any return causes the depth counter to reach 0, the recursion has completed and normal

operation continues. This approach lumps all cycles accrued during recursion into the top-level

recursive function when returning results, as opposed to simple approach of extending the stack

6 Conclusions

which attributes the cycles to each function that executes.

Appendix A

Complete Benchmark Results

This appendix contains the full set of results for the experiments described in Chapters 3, 4,

and 5.

A.1 Full Results for Chapter 3: Cycle-Accurate Profiling

A.1.1 Full Results for LEAP vs. gprof

Hierarchical Hierarchical Self Hier.Function Self Time (s) Time (s) Self Cycles Cycles Time Time

abs 0.49 0.1% 0.49 0.1% 1155 0.49% 1155 0.5% 0.42% 0.42%adpcm main 10.06 1.6% 641.72 99.0% 4588 1.96% 228355 97.8% 0.41% 1.27%decode 117.31 18.1% 281.81 43.5% 63863 27.34% 102578 43.9% 9.24% 0.42%encode 145.12 22.4% 344.51 53.2% 62193 26.63% 118422 50.7% 4.23% 2.47%filtep 28.86 4.5% 28.86 4.5% 3655 1.56% 3655 1.6% 2.89% 2.89%filtez 77.93 12.0% 77.93 12.0% 14708 6.30% 14683 6.3% 5.73% 5.74%logsch 9.24 1.4% 9.24 1.4% 3100 1.33% 3097 1.3% 0.10% 0.10%logscl 13.92 2.1% 13.92 2.1% 3425 1.47% 3440 1.5% 0.68% 0.68%main 5.74 0.9% 647.46 99.9% 5122 2.19% 233486 100.0% 1.31% 0.04%quantl 34.89 5.4% 34.89 5.4% 14177 6.07% 14743 6.3% 0.69% 0.93%reset 5.34 0.8% 5.34 0.8% 2755 1.18% 2764 1.2% 0.36% 0.36%scalel 21.48 3.3% 21.48 3.3% 2902 1.24% 2888 1.2% 2.07% 2.08%uppol1 21.1 3.3% 21.1 3.3% 8097 3.47% 8097 3.5% 0.21% 0.21%uppol2 40.22 6.2% 40.22 6.2% 11209 4.80% 11212 4.8% 1.41% 1.41%upzero 116.23 17.9% 116.23 17.9% 32614 13.96% 32589 14.0% 3.97% 3.99%

(a) gprof (b) LEAP (c) Difference

A Complete Benchmark Results

AddRoundKey 3.74 1.1% 3.74 1.1% 1248 1.7% 1228 1.7% 0.58% 0.55%AddRoundKey InversMixColumn 226.2 67.2% 226.2 67.2% 26069 35.3% 26058 35.3% 31.94% 31.96%aes main 0.18 0.1% 336.33 100.0% 1032 1.4% 73620 99.7% 1.34% 0.29%ByteSub ShiftRow 7.46 2.2% 7.46 2.2% 8942 12.1% 8970 12.1% 9.89% 9.93%decrypt 2.56 0.8% 259.13 77.0% 1514 2.1% 39028 52.9% 1.29% 24.18%encrypt 1.97 0.6% 77.02 22.9% 1916 2.6% 33551 45.4% 2.01% 22.54%frame dummy 0.01 0.0% 0.01 0.0% 0.0% 0.0% 0.00% 0.00%InversShiftRow ByteSub 6.62 2.0% 6.62 2.0% 5010 6.8% 5020 6.8% 4.82% 4.83%KeySchedule 38.28 11.4% 43.76 13.0% 14079 19.1% 15357 20.8% 7.69% 7.79%main 0.05 0.0% 336.33 100.0% 219 0.3% 73839 100.0% 0.28% 0.01%MixColumn AddRoundKey 43.83 13.0% 43.83 13.0% 12544 17.0% 12521 17.0% 3.96% 3.93%SubByte 5.48 1.6% 5.48 1.6% 1271 1.7% 1271 1.7% 0.09% 0.09%

BLOWFISH

BF cfb64 encrypt 924.25 23.2% 2639.39 66.3% 166292 16.4% 546458 54.0% 6.78% 12.30%BF encrypt 2426.87 60.9% 2426.87 60.9% 674439 66.6% 674327 66.6% 5.68% 5.67%BF set key 35.83 0.9% 793.89 19.9% 18402 1.8% 327706 32.4% 0.92% 12.44%blowfish main 548.51 13.8% 3981.79 100.0% 137899 13.6% 1012071 100.0% 0.15% 0.01%frame dummy 0.03 0.0% 0.03 0.0% 0.0% 0.0% 0.00% 0.00%main 0.13 0.0% 3981.88 100.0% 142 0.0% 1012213 100.0% 0.01% 0.00%memcpy 46.32 1.2% 46.32 1.2% 15085 1.5% 15085 1.5% 0.33% 0.33%

addFloat64Sigs 7.79 16.9% 16.31 35.4% 4714 18.3% 9583 37.1% 1.33% 1.67%countLeadingZeros32 0.72 1.6% 0.72 1.6% 388 1.5% 388 1.5% 0.06% 0.06%countLeadingZeros64 0.49 1.1% 1.21 2.6% 453 1.8% 841 3.3% 0.69% 0.63%extractFloat64Exp 2.23 4.8% 2.23 4.8% 575 2.2% 575 2.2% 2.62% 2.62%extractFloat64Frac 1.93 4.2% 1.93 4.2% 591 2.3% 591 2.3% 1.90% 1.90%extractFloat64Sign 2.3 5.0% 2.3 5.0% 591 2.3% 591 2.3% 2.71% 2.71%float raise 0.21 0.5% 0.21 0.5% 53 0.2% 53 0.2% 0.25% 0.25%float64 add 4.86 10.6% 40.88 88.8% 3122 12.1% 22123 85.7% 1.53% 3.15%float64 is nan 1.39 3.0% 1.39 3.0% 262 1.0% 272 1.1% 2.01% 1.97%float64 is signaling nan 1.34 2.9% 1.34 2.9% 397 1.5% 397 1.5% 1.37% 1.37%frame dummy 0.05 0.1% 0.05 0.1% 0.0% 0.0% 0.11% 0.11%main 5.07 11.0% 45.96 99.9% 3666 14.2% 25775 99.8% 3.18% 0.04%normalizeRoundAndPackFloat64 0.6 1.3% 3.55 7.7% 665 2.6% 2450 9.5% 1.27% 1.78%packFloat64 1.29 2.8% 1.29 2.8% 195 0.8% 195 0.8% 2.05% 2.05%propagateFloat64NaN 3.96 8.6% 6.69 14.5% 1587 6.1% 2248 8.7% 2.46% 5.83%roundAndPackFloat64 3.01 6.5% 3.91 8.5% 2227 8.6% 2384 9.2% 2.08% 0.74%shift64RightJamming 1.29 2.8% 1.29 2.8% 1500 5.8% 1884 7.3% 3.01% 4.49%subFloat64Sigs 7.49 16.3% 17.42 37.9% 4324 16.7% 8820 34.2% 0.47% 3.69%

lshrdi3 0.0% 0.0% 191 0.7% 191 0.7% 0.74% 0.74%ashldi3 0.0% 0.0% 318 1.2% 303 1.2% 1.23% 1.17%

add128 0.26 0.8% 0.26 0.8% 0 0.0% 0 0.0% 0.78% 0.78%countLeadingZeros64 0.01 0.0% 0.01 0.0% 0 0.0% 0 0.0% 0.03% 0.03%estimateDiv128To64 8.43 25.4% 12.28 37.1% 2789 3.5% 66972 83.8% 21.95% 46.73%extractFloat64Exp 1.16 3.5% 1.16 3.5% 287 0.4% 287 0.4% 3.14% 3.14%extractFloat64Frac 0.98 3.0% 0.98 3.0% 329 0.4% 329 0.4% 2.55% 2.55%extractFloat64Sign 0.95 2.9% 0.95 2.9% 303 0.4% 303 0.4% 2.49% 2.49%float raise 0.2 0.6% 0.2 0.6% 138 0.2% 131 0.2% 0.43% 0.44%float64 div 6.94 20.9% 30.06 90.7% 6251 7.8% 77933 97.5% 13.12% 6.79%float64 is nan 0.17 0.5% 0.17 0.5% 115 0.1% 115 0.1% 0.37% 0.37%float64 is signaling nan 0.73 2.2% 0.73 2.2% 165 0.2% 165 0.2% 2.00% 2.00%main 2.56 7.7% 32.62 98.4% 1961 2.5% 79909 100.0% 5.27% 1.54%mul64To128 5.1 15.4% 5.1 15.4% 1574 2.0% 1577 2.0% 13.42% 13.42%normalizeFloat64Subnormal 0.26 0.8% 0.26 0.8% 0 0.0% 0 0.0% 0.78% 0.78%packFloat64 1 3.0% 1 3.0% 141 0.2% 141 0.2% 2.84% 2.84%propagateFloat64NaN 1.34 4.0% 2.23 6.7% 719 0.9% 1005 1.3% 3.14% 5.47%roundAndPackFloat64 1.74 5.3% 2.45 7.4% 1603 2.0% 1690 2.1% 3.25% 5.28%shift64RightJamming 0.0% 0.0% 0 0.0% 0 0.0% 0.00% 0.00%sub128 1.31 4.0% 1.31 4.0% 1110 1.4% 1110 1.4% 2.56% 2.56%

udivdi3 0.0% 0.0% 4174 5.2% 62435 78.1% 5.22% 78.11%udivsi3 0.0% 0.0% 1438 1.8% 29464 36.9% 1.80% 36.86%umodsi3 0.0% 0.0% 1368 1.7% 28819 36.1% 1.71% 36.05%

udivmodsi4 0.0% 0.0% 55470 69.4% 55470 69.4% 69.39% 69.39%

countLeadingZeros64 0.02 0.1% 0.02 0.1% 0 0.0% 0 0.0% 0.13% 0.13%extractFloat64Exp 1.17 7.4% 1.17 7.4% 287 2.6% 287 2.6% 4.81% 4.81%extractFloat64Frac 1.42 9.0% 1.42 9.0% 289 2.6% 289 2.6% 6.38% 6.38%extractFloat64Sign 0.58 3.7% 0.58 3.7% 279 2.6% 279 2.6% 1.13% 1.13%float raise 0.09 0.6% 0.09 0.6% 76 0.7% 76 0.7% 0.12% 0.12%float64 is nan 0.31 2.0% 0.31 2.0% 120 1.1% 120 1.1% 0.87% 0.87%float64 is signaling nan 0.33 2.1% 0.33 2.1% 171 1.6% 171 1.6% 0.53% 0.53%float64 mul 5.34 33.9% 13.66 86.8% 5051 46.2% 9138 83.5% 12.25% 3.25%main 1.71 10.9% 15.37 97.6% 1815 16.6% 10965 100.2% 5.73% 2.59%mul64To128 1.86 11.8% 1.86 11.8% 788 7.2% 788 7.2% 4.61% 4.61%normalizeFloat64Subnormal 0.24 1.5% 0.24 1.5% 0 0.0% 0 0.0% 1.52% 1.52%packFloat64 0.57 3.6% 0.57 3.6% 129 1.2% 129 1.2% 2.44% 2.44%propagateFloat64NaN 0.95 6.0% 1.58 10.0% 722 6.6% 1015 9.3% 0.56% 0.76%roundAndPackFloat64 1.05 6.7% 1.35 8.6% 1212 11.1% 1263 11.5% 4.41% 2.97%shift64RightJamming 0.1 0.6% 0.1 0.6% 0 0.0% 0 0.0% 0.64% 0.64%

add128 2.83 0.2% 2.83 0.2% 0 0.0% 0 0.0% 0.17% 0.17%addFloat64Sigs 33.57 2.0% 78.45 4.6% 14567 0.4% 74173 1.9% 1.61% 2.71%countLeadingZeros32 43.03 2.5% 43.03 2.5% 52029 1.4% 52036 1.4% 1.19% 1.19%countLeadingZeros64 16.52 1.0% 33.98 2.0% 27119 0.7% 58368 1.5% 0.27% 0.49%estimateDiv128To64 257.07 15.2% 345.05 20.4% 77938 2.0% 2985616 77.5% 13.18% 57.15%extractFloat64Exp 99.13 5.9% 99.13 5.9% 13335 0.3% 13335 0.3% 5.51% 5.51%extractFloat64Frac 60.68 3.6% 60.68 3.6% 10128 0.3% 10128 0.3% 3.32% 3.32%extractFloat64Sign 70.2 4.2% 70.2 4.2% 13342 0.3% 13342 0.3% 3.80% 3.80%float raise 0.0% 0.0% 0 0.0% 0 0.0% 0.00% 0.00%float64 abs 6.89 0.4% 6.89 0.4% 12033 0.3% 12059 0.3% 0.09% 0.09%float64 add 50.8 3.0% 372.37 22.0% 16488 0.4% 353666 9.2% 2.58% 12.83%float64 div 145.1 8.6% 652.04 38.6% 56192 1.5% 3158964 82.0% 7.12% 43.50%float64 ge 8.63 0.5% 88.51 5.2% 18409 0.5% 79901 2.1% 0.03% 3.16%float64 le 50.39 3.0% 79.88 4.7% 55222 1.4% 61634 1.6% 1.55% 3.12%float64 mul 139.22 8.2% 373.68 22.1% 58336 1.5% 177966 4.6% 6.72% 17.47%float64 neg 5.17 0.3% 5.17 0.3% 1046 0.0% 1054 0.0% 0.28% 0.28%frame dummy 1.22 0.1% 1.22 0.1% 0.0% 0.0% 0.07% 0.07%int32 to float64 39.83 2.4% 88.89 5.3% 13962 0.4% 44197 1.1% 1.99% 4.11%main 5.62 0.3% 1659.96 98.1% 2365 0.1% 3849898 100.0% 0.27% 1.85%mul64To128 118.72 7.0% 118.72 7.0% 133633 3.5% 133843 3.5% 3.55% 3.54%normalizeFloat64Subnormal 5.79 0.3% 5.79 0.3% 0 0.0% 0 0.0% 0.34% 0.34%normalizeRoundAndPackFloat64 14.75 0.9% 105.27 6.2% 51587 1.3% 138573 3.6% 0.47% 2.62%packFloat64 59.47 3.5% 59.47 3.5% 6687 0.2% 6687 0.2% 3.34% 3.34%propagateFloat64NaN 21.51 1.3% 21.51 1.3% 0 0.0% 0 0.0% 1.27% 1.27%roundAndPackFloat64 222.69 13.2% 258.32 15.3% 158701 4.1% 163792 4.3% 9.04% 11.02%shift64ExtraRightJamming 0.0% 0.0% 0 0.0% 0 0.0% 0.00% 0.00%shift64RightJamming 14.69 0.9% 14.69 0.9% 94591 2.5% 134740 3.5% 1.59% 2.63%sin 66.79 3.9% 1654.34 97.8% 19729 0.5% 3847539 99.9% 3.44% 2.12%sub128 37.86 2.2% 37.86 2.2% 30995 0.8% 30870 0.8% 1.43% 1.44%subFloat64Sigs 93.15 5.5% 237.61 14.0% 30683 0.8% 259812 6.7% 4.71% 7.30%

lshrdi3 0.0% 0.0% 20098 0.5% 20005 0.5% 0.52% 0.52%ashldi3 0.0% 0.0% 36042 0.9% 36113 0.9% 0.94% 0.94%udivdi3 0.0% 0.0% 280963 7.3% 2824428 73.4% 7.30% 73.36%udivsi3 0.0% 0.0% 33760 0.9% 1325113 34.4% 0.88% 34.42%umodsi3 0.0% 0.0% 38029 1.0% 1218137 31.6% 0.99% 31.64%

udivmodsi4 0.0% 0.0% 2472067 64.2% 2471591 64.2% 64.21% 64.20%

Autocorrelation 76.18 44.3% 118.73 69.0% 28406 56.0% 33783 66.6% 11.68% 2.46%frame dummy 0.45 0.3% 0.45 0.3% 0.0% 0.0% 0.26% 0.26%gsm abs 23.67 13.8% 23.67 13.8% 2596 5.1% 2596 5.1% 8.65% 8.65%gsm add 4.67 2.7% 4.67 2.7% 1744 3.4% 1744 3.4% 0.72% 0.72%gsm div 9.72 5.7% 9.72 5.7% 1531 3.0% 1531 3.0% 2.63% 2.63%Gsm LPC Analysis 0.11 0.1% 156.45 91.0% 212 0.4% 44089 86.9% 0.35% 4.09%gsm mult 2.96 1.7% 2.96 1.7% 187 0.4% 187 0.4% 1.35% 1.35%gsm mult r 28.96 16.8% 28.96 16.8% 3639 7.2% 3639 7.2% 9.67% 9.67%gsm norm 0.52 0.3% 0.52 0.3% 245 0.5% 245 0.5% 0.18% 0.18%main 15.07 8.8% 171.52 99.7% 6684 13.2% 50746 100.0% 4.41% 0.27%Quantization and coding 1.56 0.9% 5.46 3.2% 1250 2.5% 1771 3.5% 1.56% 0.32%Reflection coefficients 6.85 4.0% 29.81 17.3% 3476 6.8% 7608 15.0% 2.87% 2.34%Transformation to Log Area Ratios 1.26 0.7% 2.34 1.4% 615 1.2% 715 1.4% 0.48% 0.05%legup memset 4 0.0% 0.0% 161 0.3% 161 0.3% 0.32% 0.32%

BoundIDctMatrix 459.66 2.6% 459.66 2.6% 203370 3.3% 203370 3.3% 0.70% 0.70%buf getb 1869.29 10.6% 2091.38 11.9% 1437809 23.4% 1479621 24.0% 12.75% 12.17%buf getv 769.36 4.4% 914.62 5.2% 641506 10.4% 673810 10.9% 6.06% 5.76%ChenIDct 2895.91 16.4% 2895.91 16.4% 619944 10.1% 619944 10.1% 6.36% 6.36%decode block 42.21 0.2% 13081.3 74.2% 12492 0.2% 4752166 77.2% 0.04% 2.96%decode start 89.4 0.5% 16341.6 92.7% 5719 0.1% 5650316 91.8% 0.41% 0.95%DecodeHuffman 2919.39 16.6% 5010.76 28.4% 748738 12.2% 2228359 36.2% 4.40% 7.77%DecodeHuffMCU 2352.81 13.4% 8278.19 47.0% 547850 8.9% 3450019 56.1% 4.45% 9.07%first marker 0.15 0.0% 0.21 0.0% 343 0.0% 567 0.0% 0.00% 0.01%frame dummy 5.66 0.0% 5.66 0.0% 0.0% 0.0% 0.03% 0.03%get dht 27.7 0.2% 41.79 0.2% 5260 0.1% 11872 0.2% 0.07% 0.04%get dqt 13.83 0.1% 18.37 0.1% 4578 0.1% 6730 0.1% 0.00% 0.01%get sof 2.35 0.0% 3.26 0.0% 2248 0.0% 2505 0.0% 0.02% 0.02%get sos 1.4 0.0% 1.91 0.0% 1207 0.0% 1371 0.0% 0.01% 0.01%huff make dhuff tb 115.04 0.7% 115.04 0.7% 15761 0.3% 15761 0.3% 0.40% 0.40%IQuantize 502.41 2.9% 502.41 2.9% 199028 3.2% 199028 3.2% 0.38% 0.38%IZigzagMatrix 497.81 2.8% 497.81 2.8% 165262 2.7% 165262 2.7% 0.14% 0.14%jpeg init decompress 0.71 0.0% 115.75 0.7% 1075 0.0% 16836 0.3% 0.01% 0.38%jpeg read 0.14 0.0% 16530.2 93.8% 183 0.0% 5692492 92.5% 0.00% 1.33%jpeg2bmp main 1081.02 6.1% 17611.2 100.0% 462095 7.5% 6154750 100.0% 1.37% 0.04%main 1.11 0.0% 17612.4 100.0% 249 0.0% 6155008 100.0% 0.00% 0.04%next marker 1.4 0.0% 2.49 0.0% 448 0.0% 947 0.0% 0.00% 0.00%pgetc 367.35 2.1% 367.35 2.1% 74116 1.2% 74116 1.2% 0.88% 0.88%PostshiftIDctMatrix 405.1 2.3% 405.1 2.3% 102051 1.7% 102051 1.7% 0.64% 0.64%read byte 19.35 0.1% 19.35 0.1% 9580 0.2% 9580 0.2% 0.05% 0.05%read dword 0.04 0.0% 0.04 0.0% 0 0.0% 0 0.0% 0.00% 0.00%read markers 4.69 0.0% 72.73 0.4% 1165 0.0% 25157 0.4% 0.01% 0.00%read word 1.87 0.0% 1.87 0.0% 328 0.0% 328 0.0% 0.01% 0.01%Write4Blocks 28.7 0.2% 1244.3 7.1% 15925 0.3% 222711 3.6% 0.10% 3.44%WriteBlock 0.92 0.0% 0.92 0.0% 0 0.0% 0 0.0% 0.01% 0.01%WriteOneBlock 1215.6 6.9% 1215.6 6.9% 206786 3.4% 206786 3.4% 3.54% 3.54%YuvToRgb 1926.62 10.9% 1926.62 10.9% 669720 10.9% 669720 10.9% 0.05% 0.05%

main 115.82 100.0% 115.82 100.0% 43417 99.0% 43891 100.1% 1.01% 0.07%legup memset 4 0.0% 443 1.0% 435 1.0% 1.01% 0.99%

MOTION

decode motion vector 0.3 0.4% 0.3 0.4% 312 1.2% 312 1.2% 0.84% 0.84%Fill Buffer 0.11 0.1% 80 93.8% 406 1.5% 19952 75.9% 1.42% 17.88%Flush Buffer 2 2.3% 82.01 96.2% 1921 7.3% 21883 83.3% 4.97% 12.88%Get Bits 0.15 0.2% 59 69.2% 365 1.4% 1303 5.0% 1.21% 64.23%Get Bits1 0.1 0.1% 35.5 41.6% 109 0.4% 440 1.7% 0.30% 39.96%Get dmvector 0.02 0.0% 0.02 0.0% 0 0.0% 0 0.0% 0.02% 0.02%Get motion code 0.28 0.3% 47.55 55.8% 430 1.6% 1076 4.1% 1.31% 51.67%Initialize Buffer 0.04 0.0% 11.76 13.8% 263 1.0% 21082 80.2% 0.95% 66.45%main 1.68 2.0% 85.24 100.0% 1222 4.7% 26273 100.0% 2.68% 0.04%motion vector 0.17 0.2% 59.81 70.1% 740 2.8% 2757 10.5% 2.62% 59.65%motion vectors 0.2 0.2% 71.81 84.2% 866 3.3% 3970 15.1% 3.06% 69.10%read 79.89 93.7% 79.89 93.7% 19561 74.5% 19546 74.4% 19.23% 19.29%Show Bits 0.33 0.4% 0.33 0.4% 77 0.3% 77 0.3% 0.09% 0.09%

frame dummy 0.01 0.0% 0.01 0.0% 0.0% 0.0% 0.00% 0.00%main 0.34 0.0% 4262.22 100.0% 362 0.0% 1147523 100.0% 0.02% 0.00%memcpy 721.09 16.9% 721.09 16.9% 120372 10.5% 120393 10.5% 6.43% 6.43%memset 1.85 0.0% 1.85 0.0% 248 0.0% 248 0.0% 0.02% 0.02%sha final 0.15 0.0% 15.65 0.4% 402 0.0% 4598 0.4% 0.03% 0.03%sha init 7.78 0.2% 7.78 0.2% 229 0.0% 229 0.0% 0.16% 0.16%sha stream 0.26 0.0% 4261.88 100.0% 255 0.0% 1147163 100.0% 0.02% 0.02%sha transform 3508.55 82.3% 3508.55 82.3% 1020909 89.0% 1020907 89.0% 6.65% 6.65%sha update 22.2 0.5% 4238.19 99.4% 4729 0.4% 1142081 99.5% 0.11% 0.09%

DHRYSTONE

frame dummy 0.14 0.1% 0.14 0.1% 0.0% 0.0% 0.14% 0.14%Func 1 5.14 5.0% 5.14 5.0% 535 1.7% 535 1.7% 3.38% 3.38%Func 2 2.18 2.1% 30.26 29.6% 1530 4.7% 9240 28.6% 2.60% 1.04%Func 3 1.14 1.1% 1.14 1.1% 136 0.4% 136 0.4% 0.70% 0.70%main 8.57 8.4% 102.02 99.9% 5493 17.0% 32262 99.8% 8.60% 0.08%Proc 1 6 5.9% 12.15 11.9% 2377 7.4% 7093 21.9% 1.48% 10.05%Proc 2 0.95 0.9% 0.95 0.9% 477 1.5% 477 1.5% 0.55% 0.55%Proc 3 1.01 1.0% 1.95 1.9% 858 2.7% 978 3.0% 1.67% 1.12%Proc 4 1.99 1.9% 1.99 1.9% 468 1.4% 468 1.4% 0.50% 0.50%Proc 5 0.31 0.3% 0.31 0.3% 208 0.6% 208 0.6% 0.34% 0.34%Proc 6 2.13 2.1% 3.27 3.2% 1234 3.8% 1379 4.3% 1.73% 1.06%Proc 7 2.81 2.8% 2.81 2.8% 488 1.5% 460 1.4% 1.24% 1.33%Proc 8 6.86 6.7% 6.86 6.7% 1984 6.1% 1984 6.1% 0.58% 0.58%strcmp 30.32 29.7% 30.32 29.7% 9103 28.2% 9095 28.1% 1.52% 1.55%strcpy 32.6 31.9% 32.6 31.9% 5176 16.0% 5176 16.0% 15.90% 15.90%legup memcpy 4 0.0% 0.0% 2261 7.0% 2261 7.0% 6.99% 6.99%

A.1.2 Full Results for LEAP vs. SnoopP

main 5122 5122 0 1814 1829 -15encode 62193 62204 -11 4245 4271 -26decode 63863 63863 0 3656 3652 4upzero 32614 32620 -6 948 942 6filtez 14708 14708 0 695 691 4uppol2 11209 11209 0 230 233 -3quantl 14177 14175 2 737 723 14filtep 3655 3655 0 56 55 1uppol1 8097 8097 0 127 126 1scalel 2902 2902 0 288 292 -4logscl 3425 3425 0 234 234 0logsch 3100 3112 -12 134 133 1reset 2755 2768 -13 2332 2341 -9abs 1155 1155 0 56 55 1adpcm main 4588 4588 0 1641 1626 15

Total 233563 233603 -40 17193 17203 -10

main 219 219 0 203 202 1decrypt 1514 1500 14 955 941 14AddRoundKey InversMixColumn 26069 26058 11 1531 1508 23encrypt 1916 1925 -9 1281 1294 -13MixColumn AddRoundKey 12544 12521 23 6420 6460 -40KeySchedule 14079 14065 14 5405 5384 21ByteSub ShiftRow 8942 8970 -28 6593 6547 46InversShiftRow ByteSub 5010 5020 -10 2625 2634 -9SubByte 1271 1271 0 78 71 7AddRoundKey 1248 1228 20 467 481 -14aes main 1032 1035 -3 919 918 1

Total 73844 73812 32 26477 26440 37

BLOWFISH

main 142 142 0 134 133 1BF cfb64 encrypt 166292 166354 -62 21059 21059 0BF encrypt 674439 674394 45 42075 42034 41BF set key 18402 18397 5 3631 3628 3memcpy 15085 15091 -6 6175 6172 3blowfish main 137899 137911 -12 22657 22656 1

Total 1012259 1012289 -30 95731 95682 49

main 3666 3649 17 1964 1986 -22float64 add 3122 3118 4 369 368 1subFloat64Sigs 4324 4335 -11 1327 1291 36addFloat64Sigs 4714 4729 -15 1346 1356 -10propagateFloat64NaN 1587 1594 -7 572 555 17roundAndPackFloat64 2227 2233 -6 523 519 4normalizeRoundAndPackFloat64 665 665 0 248 248 0extractFloat64Exp 575 575 0 23 23 0extractFloat64Sign 591 591 0 39 39 0extractFloat64Frac 591 591 0 41 39 2float64 is signaling nan 397 397 0 87 87 0float64 is nan 262 262 0 64 64 0shift64RightJamming 1500 1485 15 390 403 -13countLeadingZeros64 453 453 0 215 215 0packFloat64 195 195 0 41 39 2countLeadingZeros32 388 388 0 172 187 -15float raise 53 53 0 39 39 0

lshrdi3 191 191 0 71 71 0ashldi3 318 303 15 104 112 -8

Total 25819 25807 12 7635 7641 -6

main 1961 1981 -20 1136 1139 -3float64 div 6251 6210 41 2050 2021 29estimateDiv128To64 2789 2804 -15 953 952 1mul64To128 1574 1582 -8 284 274 10roundAndPackFloat64 1603 1603 0 443 449 -6sub128 1110 1132 -22 239 238 1propagateFloat64NaN 719 713 6 467 451 16extractFloat64Sign 303 303 0 39 39 0extractFloat64Exp 287 287 0 24 23 1packFloat64 141 141 0 39 39 0extractFloat64Frac 329 329 0 65 65 0float64 is signaling nan 165 165 0 87 87 0float64 is nan 115 115 0 65 64 1add128 0 0 0 0 0 0normalizeFloat64Subnormal 0 0 0 0 0 0float raise 138 131 7 111 110 1countLeadingZeros64 0 0 0 0 0 0shift64RightJamming 0 0 0 0 0 0countLeadingZeros32 0 0 0 0 0 0

lshrdi3 0 0 0 0 0 0ashldi3 0 0 0 0 0 0udivdi3 4174 4190 -16 917 915 2udivsi3 1438 1438 0 187 171 16umodsi3 1368 1365 3 101 101 0

udivmodsi4 55470 55453 17 595 600 -5

Total 79935 79942 -7 7802 7738 64

main 1815 1827 -12 1074 1082 -8float64 mul 5051 5077 -26 1855 1830 25mul64To128 788 788 0 273 288 -15propagateFloat64NaN 722 724 -2 488 488 0roundAndPackFloat64 1212 1227 -15 479 465 14extractFloat64Frac 289 289 0 49 49 0extractFloat64Exp 287 287 0 47 47 0extractFloat64Sign 279 279 0 40 39 1packFloat64 129 129 0 39 39 0float64 is nan 120 120 0 71 71 0float64 is signaling nan 171 171 0 87 87 0normalizeFloat64Subnormal 0 0 0 0 0 0float raise 76 76 0 62 62 0shift64RightJamming 0 0 0 0 0 0countLeadingZeros64 0 0 0 0 0 0countLeadingZeros32 0 0 0 0 0 0

lshrdi3 0 0 0 0 0 0ashldi3 0 0 0 0 0 0

Total 10939 10994 -55 4564 4547 17

main 2365 2379 -14 1291 1291 0sin 19729 19718 11 1578 1578 0float64 div 56192 56189 3 1557 1557 0float64 add 16488 16482 6 304 322 -18float64 mul 58336 58353 -17 1199 1201 -2estimateDiv128To64 77938 77970 -32 35321 35321 0subFloat64Sigs 30683 30696 -13 1176 1176 0roundAndPackFloat64 158701 158717 -16 73201 73201 0mul64To128 133633 133652 -19 94579 94579 0extractFloat64Exp 13335 13335 0 39 39 0int32 to float64 13962 13950 12 225 225 0normalizeRoundAndPackFloat64 51587 51582 5 41515 41515 0float64 ge 18409 18405 4 14389 14389 0extractFloat64Sign 13342 13342 0 46 46 0float64 le 55222 55241 -19 31891 31891 0addFloat64Sigs 14567 14569 -2 969 969 0extractFloat64Frac 10128 10128 0 48 48 0countLeadingZeros32 52029 52053 -24 38452 38452 0packFloat64 6687 6687 0 39 39 0countLeadingZeros64 27119 27105 14 22727 22727 0sub128 30995 31001 -6 16755 16755 0shift64RightJamming 94591 94609 -18 78640 78640 0propagateFloat64NaN 0 0 0 0 0 0float64 neg 1046 1046 0 830 830 0float64 abs 12033 12041 -8 10425 10425 0normalizeFloat64Subnormal 0 0 0 0 0 0add128 0 0 0 0 0 0shift64ExtraRightJamming 0 0 0 0 0 0float raise 0 0 0 0 0 0float64 is nan 0 0 0 0 0 0float64 is signaling nan 0 0 0 0 0 0

lshrdi3 20098 20114 -16 16633 16633 0ashldi3 36042 36016 26 26879 26879 0udivdi3 280963 280979 -16 204026 204026 0udivsi3 33760 33753 7 204 204 0umodsi3 38029 38027 2 6257 6257 0

udivmodsi4 2472067 2472113 -46 122484 122484 0

Total 3850076 3850252 -176 843679 843699 -20

main 6684 6668 16 1327 1322 5Gsm LPC Analysis 212 212 0 198 182 16Autocorrelation 28406 28416 -10 3415 3397 18gsm mult r 3639 3639 0 71 71 0Reflection coefficients 3476 3474 2 1845 1835 10gsm abs 2596 2596 0 110 106 4gsm div 1531 1534 -3 119 132 -13Quantization and coding 1250 1240 10 1017 1015 2gsm add 1744 1744 0 97 87 10gsm mult 187 187 0 75 71 4Transformation to Log Area Ratios 615 615 0 364 359 5gsm norm 245 245 0 198 204 -6legup memset 2 0 0 0 0 0 0legup memset 4 161 176 -15 105 105 0

Total 50746 50746 0 8941 8886 55

buf getb 1437809 1437809 0 527188 527188 0DecodeHuffman 748738 748738 0 22722 22722 0buf getv 641506 641506 0 67232 67232 0pgetc 74116 74116 0 10064 10064 0read byte 9580 9580 0 1117 1119 -2WriteOneBlock 206786 206786 0 29098 29098 0DecodeHuffMCU 547850 547850 0 109599 109598 1decode block 12492 12492 0 482 482 0BoundIDctMatrix 203370 203370 0 126 126 0ChenIDct 619944 619944 0 7203 7203 0IQuantize 199028 199028 0 23484 23484 0IZigzagMatrix 165262 165262 0 8158 8158 0PostshiftIDctMatrix 102051 102051 0 62 62 0YuvToRgb 669720 669720 0 2347 2347 0Write4Blocks 15925 15925 0 1951 1951 0read word 328 328 0 135 120 15next marker 448 448 0 113 113 0get dht 5260 5260 0 993 993 0huff make dhuff tb 15761 15761 0 2794 2816 -22get dqt 4578 4578 0 1073 1079 -6decode start 5719 5719 0 1514 1514 0first marker 343 343 0 315 315 0get sof 2248 2248 0 1801 1800 1get sos 1207 1207 0 832 832 0jpeg init decompress 1075 1075 0 932 932 0jpeg read 183 183 0 156 156 0main 249 249 0 232 247 -15read markers 1165 1165 0 788 783 5read dword 0 0 0 0 0 0WriteBlock 0 0 0 0 0 0jpeg2bmp main 462095 462098 -3 54716 54819 -103

Total 6,154,836 6,154,839 -3 877,227 877,353 -126

main 43404 43410 -6 2918 2945 -27legup memset 4 443 443 0 97 97 0

Total 43847 43853 -6 3015 3042 -27

MOTION

main 1222 1227 -5 1021 1024 -3Flush Buffer 1921 1933 -12 911 898 13Fill Buffer 406 404 2 364 366 -2read 19561 19561 0 3140 3131 9motion vectors 866 866 0 748 748 0motion vector 740 740 0 650 635 15Get Bits 365 362 3 222 222 0Get motion code 430 436 -6 346 346 0Get Bits1 109 109 0 76 76 0Initialize Buffer 263 254 9 228 227 1decode motion vector 312 312 0 248 264 -16Show Bits 77 77 0 29 44 -15Get dmvector 0 0 0 0 0 0

Total 26272 26281 -9 7983 7981 2

main 362 362 0 330 291 39sha stream 255 255 0 223 223 0sha update 4729 4729 0 512 512 0sha transform 1020909 1020956 -47 3256 3244 12memcpy 120372 120387 -15 23859 23839 20sha final 402 402 0 317 319 -2memset 248 248 0 144 158 -14sha init 229 229 0 206 189 17

Total 1147506 1147568 -62 28847 28775 72

DHRYSTONE

main 5493 5519 -26 2074 2112 -38strcpy 5176 5177 -1 249 263 -14strcmp 9103 9107 -4 267 252 15Func 2 1530 1530 0 522 505 17Proc 1 2377 2364 13 612 603 9Proc 8 1984 1993 -9 400 399 1Func 1 535 535 0 56 54 2Proc 7 488 451 37 87 72 15Proc 6 1234 1234 0 276 276 0Proc 3 858 858 0 241 250 -9Proc 4 468 468 0 108 108 0Proc 2 477 477 0 79 79 0Func 3 136 136 0 17 16 1Proc 5 208 208 0 49 48 1legup memcpy 4 2261 2261 0 136 135 1

Total 32328 32318 10 5173 5172 1

A.1.3 Full Results for Power Overhead

Table A.1: Power overhead results for LEAP, measured in milliwatts, for 16, 32, and 64 func-tions. Note: “C” refers to cycle profiling, “S” refers to stall-cycle profiling, “N” isthe max number of functions profiled, and “CW” is the counter width.

N 16 32 64

CW 16 32 64 16 32 64 16 32 64Scheme C S C S C S C S C S C S C S C S C S

adpcm 4.48 4.72 6.85 5.55 6.43 6.82 4.89 4.69 5.76 6.16 6.15 6.25 5.23 5.01 6.06 6.02 6.17 6.62aes 4.31 4.61 6.67 5.41 6.29 6.69 4.74 4.60 5.57 6.05 5.94 6.17 5.02 4.87 5.88 5.90 5.97 6.46blowfish 4.81 5.00 7.18 5.88 6.69 7.10 5.16 4.94 6.02 6.39 6.41 6.50 5.47 5.23 6.33 6.27 6.38 6.83dfadd 4.55 4.84 6.98 5.66 6.55 6.95 5.00 4.82 5.86 6.30 6.25 6.39 5.29 5.13 6.15 6.18 6.25 6.75dfdiv 4.52 4.76 6.92 5.59 6.48 6.85 4.95 4.73 5.85 6.21 6.22 6.29 5.31 5.06 6.11 6.06 6.24 6.74dfmul 4.40 4.70 6.80 5.51 6.41 6.81 4.85 4.71 5.70 6.17 6.09 6.27 5.12 4.99 6.00 6.04 6.09 6.60dfsin 4.39 4.65 6.78 5.47 6.36 6.74 4.83 4.64 5.71 6.11 6.07 6.20 5.16 4.97 5.98 5.96 6.12 6.62gsm 4.44 4.74 6.81 5.54 6.45 6.85 4.89 4.72 5.72 6.19 6.08 6.30 5.14 5.03 6.03 6.06 6.12 6.58jpeg 4.58 4.85 6.99 5.53 6.55 6.95 5.00 4.78 5.95 6.27 6.25 6.37 5.32 5.11 6.13 6.14 6.26 6.72mips 4.66 5.01 7.03 5.73 6.62 7.02 5.08 4.86 5.95 6.35 6.30 6.44 5.42 5.20 6.26 6.20 6.35 6.78motion 4.59 4.91 6.97 5.71 6.58 7.01 5.02 4.87 5.89 6.31 6.23 6.43 5.31 5.21 6.18 6.19 6.30 6.78sha 4.79 5.20 7.16 5.86 6.70 7.08 5.15 4.94 6.11 6.39 6.40 6.49 5.52 5.27 6.34 6.28 6.44 6.87dhrystone 4.63 4.91 7.03 5.72 6.62 6.03 5.08 4.87 5.93 6.36 6.27 6.46 5.34 5.20 6.20 6.25 6.32 6.78

average 4.55 4.84 6.94 5.63 6.52 6.84 4.97 4.78 5.85 6.25 6.20 6.35 5.28 5.10 6.13 6.12 6.23 6.70

Table A.2: Power overhead results for LEAP, measured in milliwatts, for 128 and 256 functions.Note: “C” refers to cycle profiling, “S” refers to stall-cycle profiling, “N” is the maxnumber of functions profiled, and “CW” is the counter width.

N 128 256

CW 16 32 64 16 32 64Scheme C S C S C S C S C S C S

dfadd 5.04 4.85 6.72 6.60 6.30 6.16 5.04 5.21 5.89 5.93 6.16 7.09dfmul 4.90 4.72 6.55 6.36 6.16 6.02 4.86 5.06 5.75 5.83 6.02 6.91motion 5.03 4.88 6.73 6.64 6.27 6.16 5.06 5.22 5.98 5.00 6.11 7.15

average 4.99 4.82 6.67 6.53 6.24 6.11 4.99 5.16 5.87 5.59 6.10 7.05

Table A.3: Power overhead results for SnoopP, measured in milliwatts, for 16, 32, and 64 func-tions. Note: “C” refers to cycle profiling, “S” refers to stall-cycle profiling, “N” isthe max number of functions profiled, and “CW” is the counter width.

N 16 32 64

CW 16 32 64 16 32 64 16 32 64Scheme C S C S C S C S C S C S C S C S C S

adpcm 2.90 2.61 3.94 4.20 5.99 3.86 6.46 5.89 6.97 5.71 9.38 5.47 9.37 9.00 11.54 9.01 16.46 10.93aes 2.75 2.58 3.80 4.26 5.83 4.13 5.92 5.56 6.66 5.85 8.84 6.03 8.83 7.88 10.99 9.30 15.38 11.32blowfish 2.98 2.71 4.01 4.30 6.04 3.96 6.21 5.69 7.13 5.89 9.20 5.72 9.77 8.43 11.91 9.43 16.12 10.55dfadd 2.95 2.75 3.98 4.44 6.03 4.27 6.48 6.07 7.07 6.16 9.37 6.25 9.53 8.53 11.72 9.86 16.41 11.68dfdiv 3.02 2.79 4.04 4.40 6.09 4.07 6.37 6.26 7.28 6.06 9.72 5.84 9.97 8.66 12.11 9.70 17.13 10.91dfmul 2.85 2.69 3.89 4.40 5.92 4.30 5.96 5.94 6.86 6.11 8.87 6.35 9.11 8.27 11.31 9.75 16.02 11.88dfsin 2.91 2.74 3.94 4.39 5.99 4.15 6.15 6.11 7.05 6.07 9.05 6.04 9.61 8.49 11.74 9.71 15.83 11.34gsm 2.93 2.69 3.98 4.32 6.02 4.05 6.13 5.87 7.02 5.92 9.28 5.82 9.50 8.29 11.66 9.43 16.26 10.89jpeg 2.95 2.74 3.97 4.37 6.02 4.11 6.26 5.75 7.16 6.07 9.18 5.97 9.72 8.53 11.88 9.67 15.98 11.05mips 3.30 3.00 4.29 4.61 6.34 4.27 6.80 6.75 7.70 6.46 10.20 6.19 10.86 9.52 13.02 10.51 18.22 10.59motion 3.08 2.90 4.11 4.59 6.15 4.39 6.41 6.53 7.29 6.38 9.80 6.44 10.00 9.02 12.18 10.33 17.31 12.09sha 3.05 2.76 4.07 4.33 6.11 3.96 6.31 5.73 7.26 5.93 9.25 5.62 10.01 8.54 12.10 9.47 16.25 10.49dhrystone 3.33 3.09 4.33 4.73 6.40 4.43 6.87 6.62 7.76 6.63 10.00 6.46 10.87 9.68 13.02 10.76 17.71 12.12

average 3.00 2.77 4.03 4.41 6.07 4.15 6.33 6.06 7.17 6.10 9.40 6.02 9.78 8.68 11.94 9.76 16.54 11.22

Table A.4: Power overhead results for SnoopP, measured in milliwatts, for 128 and 256 func-tions. Note: “C” refers to cycle profiling, “S” refers to stall-cycle profiling, “N” isthe max number of functions profiled, and “CW” is the counter width.

N 128 256

CW 16 32 64 16 32 64Scheme C S C S C S C S C S C S

dfadd 16.71 14.80 20.51 16.76 27.98 20.64 28.63 25.04 35.53 28.56 52.62 40.08dfmul 15.87 14.24 19.64 16.51 27.12 21.01 26.63 23.73 33.36 27.78 50.09 40.09motion 17.72 15.70 21.47 17.71 28.92 21.57 30.40 26.80 37.27 30.31 54.25 41.91

average 16.77 14.91 20.54 16.99 28.01 21.07 28.55 25.19 35.39 28.88 52.32 40.69

A.2 Full Results for Chapter 4: Energy Consumption Profiling

A.2.1 Full Results for Cache Stall Energy Estimates

CacheFunction A B C D E F Stalls Energy (nJ)

wrap 0 2 1 3 0 0 85 123.28abs 0 100 0 900 100 0 55 2,960.50reset 0 0 382 48 0 0 2323 4,035.83filtez 1200 3992 0 8800 0 0 715 35,573.37filtep 400 0 0 2200 0 1000 55 9,986.15quantl 993 1664 3126 6632 471 521 730 34,992.36logscl 200 300 700 1891 100 0 234 8,382.52scalel 0 200 600 1200 600 0 287 7,190.45upzero 2496 6722 6839 14375 1200 0 946 79,712.37uppol2 2792 1000 200 6198 586 200 233 27,070.45uppol1 600 600 400 5371 1000 0 126 20,692.39logsch 200 300 200 1668 100 496 133 8,083.42decode 3746 10328 20271 22541 2050 1200 3682 155,220.31encode 3992 9227 13358 29370 700 1300 4241 151,221.70adpcm main 0 356 958 1467 0 200 1615 9,799.36main 0 901 302 1656 300 150 1806 10,846.07

Total 16,619 35,692 47,337 121,586 7,207 5,067 17,266 565,891

wrap 0 2 1 3 0 0 85 123.28encrypt 1 128 127 359 0 16 1279 3,242.21ByteSub ShiftRow 0 440 780 880 0 320 6521 14,651.42SubByte 0 80 240 560 160 160 86 3,359.29KeySchedule 204 2861 1740 3668 2 190 5370 28,441.28InversShiftRow ByteSub 0 451 715 892 0 320 2640 9,602.37MixColumn AddRoundKey 43 2413 1396 1733 198 333 6441 23,544.89AddRoundKey InversMixColumn 52 8356 5429 7257 1026 2421 1513 64,871.36AddRoundKey 8 326 118 238 36 36 465 2,499.79decrypt 1 119 108 327 0 16 981 2,710.19aes main 0 5 64 51 0 0 915 1,470.95main 0 3 5 9 0 0 202 300.42

Total 309 15,184 10,723 15,977 1,422 3,812 26,498 154,817

BLOWFISH

wrap 0 2 1 3 0 0 85 123.28memcpy 2 2086 1042 5781 0 0 6173 30,508.65BF encrypt 0 158114 196728 221319 0 56208 42034 1,677,869.71BF cfb64 encrypt 0 24452 20486 60749 18868 20819 21194 417,796.34BF set key 0 1822 2656 9479 147 593 3686 42,913.71blowfish main 130 21455 796 71654 15860 5236 22889 332,437.50main 0 1 2 6 0 0 133 192.55

Total 132 207,932 221,711 368,991 34,875 82,856 96,194 2,501,842

wrap 0 2 1 3 0 0 85 123.28shift64RightJamming 0 224 283 585 8 0 388 3,279.09countLeadingZeros32 0 56 64 96 0 0 172 760.15countLeadingZeros64 0 70 32 136 0 0 215 873.12float raise 0 4 2 8 0 0 39 84.96float64 is nan 0 105 24 69 0 0 64 564.17float64 is signaling nan 0 56 132 122 0 0 94 898.47propagateFloat64NaN 0 220 312 506 0 0 555 3,323.38extractFloat64Frac 0 0 276 276 0 0 39 1,464.67extractFloat64Exp 0 0 184 368 0 0 23 1,458.57extractFloat64Sign 0 0 92 460 0 0 39 1,493.23packFloat64 0 0 0 104 0 52 39 498.53roundAndPackFloat64 0 414 522 750 0 18 534 4,968.14normalizeRoundAndPackFloat64 0 48 116 245 8 0 263 1,403.25addFloat64Sigs 24 610 1120 1504 28 98 1342 10,317.65subFloat64Sigs 0 582 1028 1305 42 88 1294 9,398.98float64 add 0 322 1065 1390 0 0 377 7,536.89main 0 556 188 758 138 46 1992 6,828.17

lshrdi3 0 24 0 56 16 16 71 392.45ashldi3 0 32 0 104 28 12 103 599.04

Total 24 3,325 5,441 8,845 268 330 7,728 56,266

wrap 0 2 1 3 0 0 85 123.28sub128 0 240 118 394 120 0 238 2,528.47mul64To128 160 180 120 861 0 0 268 3,628.84estimateDiv128To64 0 420 488 914 0 0 987 5,847.33float raise 0 6 3 12 0 0 110 193.05float64 is nan 0 27 6 18 0 0 64 205.89float64 is signaling nan 0 14 33 31 0 0 87 306.84propagateFloat64NaN 0 55 84 120 0 0 451 1,226.40extractFloat64Frac 0 0 132 132 0 0 65 759.54extractFloat64Exp 0 0 88 176 0 0 23 712.86extractFloat64Sign 0 0 44 220 0 0 39 740.08packFloat64 0 0 0 68 0 34 39 343.16roundAndPackFloat64 0 280 337 519 0 12 458 3,474.05float64 div 0 761 1366 1989 14 88 2025 13,307.71main 0 268 92 374 66 22 1185 3,601.89

udivdi3 48 786 1119 1116 72 96 915 9,313.76udivsi3 0 603 428 240 0 0 171 3,298.16umodsi3 0 585 439 240 0 0 102 3,196.33

udivmodsi4 0 20749 9751 21145 280 2848 621 138,782.39

Total 208 24,976 14,649 28,572 552 3,100 7,933 191,590

wrap 0 2 1 3 0 0 85 123.28mul64To128 64 72 46 333 0 0 273 1,627.69float raise 0 4 2 8 0 0 62 114.26float64 is nan 0 26 6 17 0 0 71 209.88float64 is signaling nan 0 16 36 32 0 0 87 321.52propagateFloat64NaN 0 56 79 116 0 0 476 1,237.55extractFloat64Frac 0 0 120 120 0 0 49 677.64extractFloat64Exp 0 0 80 160 0 0 47 681.30extractFloat64Sign 0 0 40 200 0 0 42 681.14packFloat64 0 0 0 60 0 30 39 308.63roundAndPackFloat64 0 184 217 330 0 8 474 2,463.48float64 mul 0 575 1057 1540 4 32 1830 10,462.91main 0 244 84 342 60 20 1084 3,290.03

Total 64 1,179 1,768 3,261 64 90 4,619 22,199

wrap 0 2 1 3 0 0 85 123.28shift64RightJamming 0 2772 3234 9722 231 0 78701 140,991.99sub128 0 3513 1752 7197 1752 0 16656 57,616.95mul64To128 4752 5648 2496 26193 0 0 94754 218,031.13estimateDiv128To64 0 9345 12636 20595 0 0 35302 152,202.19countLeadingZeros32 0 3157 3824 6112 0 484 38459 83,503.61countLeadingZeros64 0 1281 732 2379 0 0 22721 39,986.11extractFloat64Frac 0 0 5040 5040 0 0 48 25,899.99extractFloat64Exp 0 0 4432 8864 0 0 39 34,476.29extractFloat64Sign 0 0 2216 11080 0 0 46 34,829.15packFloat64 0 0 0 4432 0 2216 39 19,177.23roundAndPackFloat64 0 19956 27847 36131 0 836 73170 306,416.55normalizeRoundAndPackFloat64 0 1098 2013 6798 183 0 41501 78,867.23int32 to float64 0 3553 2678 6988 536 0 225 35,068.72addFloat64Sigs 84 2491 4500 6036 89 336 972 35,613.92subFloat64Sigs 0 6,364 9,951 11,877 603 736 1,191 76,391.010 float64 add 0 1876 6128 8169 0 0 292 41,480.29float64 mul 0 11795 18012 26330 0 1208 1194 146,921.82float64 div 0 10381 17804 24297 267 1890 1547 141,286.79float64 le 0 5360 5941 12079 0 0 31818 99,493.93float64 ge 0 268 536 3216 0 0 14247 28,590.92float64 neg 0 36 72 108 0 0 838 1,614.35float64 abs 0 0 804 804 0 0 10451 17,436.48sin 338 1736 4520 11560 0 0 1578 48,395.20main 0 364 114 488 72 36 1285 4,369.11

lshrdi3 0 693 0 1617 462 462 16540 29,792.17ashldi3 0 1405 0 4632 1293 611 26950 55,510.91udivdi3 1068 21365 22975 27813 1602 2136 204219 453,453.01udivsi3 0 18476 9709 5339 0 0 204 80,856.23umodsi3 0 16697 9652 5340 0 0 6242 84,330.70

udivmodsi4 0 885250 420277 899521 16786 127290 122467 6,080,143.23

Total 6,242 1,034,882 599,896 1,200,760 23,876 138,241 843,781 8,652,870

wrap 0 2 1 3 0 0 85 123.28gsm add 79 158 158 1104 79 79 87 4,420.06gsm mult 8 8 16 76 0 8 71 391.21gsm mult r 223 223 446 1561 0 1115 71 9,989.43gsm abs 0 446 356 1504 90 90 110 6,577.01gsm norm 0 10 12 24 2 0 197 372.80gsm div 0 323 154 612 105 218 119 3,919.96Autocorrelation 1405 812 6980 11279 1199 3175 3434 70,074.92Reflection coefficients 39 132 235 836 118 263 1855 6,748.06Transformation to Log Area Ratios 8 29 21 143 27 24 363 1,131.08Quantization and coding 16 18 26 123 16 18 1023 1,870.07Gsm LPC Analysis 0 3 7 20 0 0 182 308.97main 0 657 322 2754 640 968 1351 16,362.96legup memset 4 0 12 10 32 0 2 105 277.35

Total 1,778 2,833 8,744 20,071 2,276 5,960 9,053 122,567

wrap 0 2 1 3 0 0 85 123.28read byte 0 603 1206 6030 631 0 1117 23,522.31read word 0 20 30 80 25 38 120 682.94first marker 0 5 5 18 0 0 315 472.74next marker 0 52 43 240 0 0 122 1,015.38get sof 1 56 106 218 25 41 1790 3,458.37get sos 1 50 93 200 16 15 828 2,026.50get dht 16 516 1008 2644 8 76 987 12,236.50get dqt 2 546 694 1873 2 388 1087 10,617.50read markers 0 64 62 240 0 11 774 1,957.61ChenIDct 129011 48960 44057 266226 112868 11520 7203 1,537,415.41IZigzagMatrix 0 27648 36864 74160 0 18441 8149 423,033.82IQuantize 55292 27652 27648 55728 0 9216 23511 444,672.79PostshiftIDctMatrix 9216 18432 27648 37454 0 9216 62 258,277.35BoundIDctMatrix 9216 36790 18447 129486 0 9305 122 520,926.38WriteOneBlock 288 36864 1152 120178 1152 18054 29078 506,111.68Write4Blocks 0 2146 4462 6718 0 720 1943 38,641.07YuvToRgb 24672 60613 72995 447561 55292 6144 2344 1,729,816.63pgetc 0 9148 18296 32026 4582 0 10079 176,900.12buf getb 0 293289 194476 402265 0 20756 526837 2,958,965.09buf getv 1,995 87,890 204,327 241,263 26,190 12,600 67,165 1,552,590.86huff make dhuff tb 303 1971 1206 8412 0 1075 2804 37,478.92DecodeHuffman 0 133512 175044 379172 5750 32256 22702 1,895,934.53DecodeHuffMCU 15880 61639 99121 229894 144 31615 109551 1,270,764.75decode block 0 864 4592 6199 144 144 482 31,272.85decode start 0 555 917 2731 0 3 1504 12,690.62jpeg init decompress 1 11 58 67 4 2 946 1,571.51jpeg read 0 2 9 16 0 0 156 267.96jpeg2bmp main 0 68936 42285 211695 79341 5207 54686 1,134,902.81main 0 3 5 9 0 0 238 346.29

Total 245,894 918,839 976,857 2,662,806 286,174 186,843 876,787 14,588,695

wrap 0 2 1 3 0 0 85 123.28main 0 6316 7672 24387 172 1889 2947 108,528.60legup memset 4 0 68 66 202 0 2 97 983.73

Total 0 6,386 7,739 24,592 172 1,891 3,129 109,636

MOTION

wrap 0 2 1 3 0 0 85 123.28read 1 2049 0 10253 2048 2048 3147 48,468.90Fill Buffer 0 5 20 17 0 0 364 569.81Show Bits 0 6 12 24 6 0 29 160.97Flush Buffer 30 149 277 490 57 0 895 3,683.25Get Bits 0 15 42 78 0 5 222 644.37Get Bits1 0 3 6 24 0 0 76 182.00Get motion code 2 13 14 50 3 2 346 656.47decode motion vector 2 12 7 28 6 0 252 460.75motion vector 0 20 31 52 1 1 635 1,075.43motion vectors 0 22 44 41 0 2 776 1,263.42Initialize Buffer 0 1 13 13 0 0 232 364.51main 0 53 39 96 8 4 1021 1,808.89

Total 35 2,350 506 11,169 2,129 2,062 8,080 59,462.06

wrap 0 2 1 3 0 0 94 134.75memset 0 16 16 68 3 1 144 451.38memcpy 258 16898 12798 24584 25344 16384 23892 293,333.64sha transform 0 255210 135389 539703 20560 61680 3243 2,616,217.93sha init 0 0 19 5 0 0 205 321.60sha update 0 274 1061 2880 0 2 512 11,530.65sha final 0 10 29 29 1 1 317 581.65sha stream 0 3 11 18 0 0 223 365.86main 0 17 14 17 0 0 312 516.16

Total 258 272,430 149,338 567,307 45,908 78,068 28,942 2,923,454

DHRYSTONE

wrap 0 2 1 3 0 0 85 123.28strcpy 0 682 0 2882 682 682 260 13,729.38strcmp 0 1519 23 6189 1069 40 248 23,373.74Func 3 0 40 0 80 0 0 16 323.37Proc 6 0 220 140 578 0 20 276 2,797.53Proc 7 0 0 60 369 0 0 72 1,215.40Proc 8 60 119 260 950 0 196 399 4,706.37Func 1 0 60 120 300 0 0 55 1,298.23Func 2 40 160 178 607 40 0 504 3,243.83Proc 3 0 120 158 339 0 0 241 1,870.21Proc 1 0 260 463 1029 0 0 623 5,258.31Proc 2 0 60 40 258 40 0 79 1,132.73Proc 4 0 80 80 180 20 0 108 1,052.11Proc 5 0 0 40 100 0 20 48 491.67main 81 484 1128 1572 104 60 2073 11,356.06legup memcpy 4 0 520 260 1346 0 0 135 5,565.31

Total 181 4,326 2,951 16,782 1,955 1,018 5,222 77,538

A.2.2 Full Results for Pipeline Stall Energy Estimates

PipelineFunction A B C D E F Stalls Energy (nJ)

wrap 0 1 1 0 0 0 88 160.89abs 0 0 299 748 0 0 108 2,978.35reset 0 0 65 96 0 0 2594 5,025.29filtez 0 2395 2195 3384 1196 0 5513 33,643.84filtep 0 0 598 799 600 0 1658 8,438.64quantl 0 2183 3364 2471 991 0 5120 32,289.59logscl 0 399 1191 795 99 0 948 8,056.15scalel 0 200 1591 798 0 0 298 7,158.11upzero 0 4137 5376 7153 1848 0 14038 73,001.07uppol2 0 1597 3392 1783 799 0 3652 25,952.73uppol1 0 798 1984 2184 398 0 2733 18,820.20logsch 0 400 1030 630 200 0 837 7,314.58decode 0 6872 14744 13062 2547 0 26632 143,372.68encode 0 7782 20665 12490 2642 0 18669 144,953.36adpcm main 0 354 1383 773 200 0 1871 10,318.75main 0 713 1049 875 149 0 2338 11,256.73

Total 0 27,831 58,927 135,138 11,669 0 87,097 532,741

wrap 0 1 1 0 0 0 88 160.89encrypt 0 93 279 162 16 0 1364 3,823.67ByteSub ShiftRow 0 337 444 536 296 0 7334 17,255.80SubByte 0 80 638 312 159 0 82 3,264.29KeySchedule 80 1296 1120 2028 192 99 9264 28,854.28InversShiftRow ByteSub 0 316 530 674 311 0 3181 10,478.86MixColumn AddRoundKey 0 1262 1706 1917 259 0 7434 26,378.71AddRoundKey InversMixColumn 0 4805 9934 6479 2424 0 2413 65,287.42AddRoundKey 0 158 113 256 38 0 666 2,640.75decrypt 0 84 255 135 16 0 1018 3,055.52aes main 0 5 31 35 0 0 968 1,901.49main 0 1 3 8 0 0 207 398.74

Total 80 8,438 15,054 12,542 3,711 99 34,019 163,500

BLOWFISH

wrap 0 1 1 0 0 0 88 160.89memcpy 0 2086 3906 2100 0 0 6976 32,702.59BF encrypt 0 99515 294703 180006 56207 0 44132 1,708,240.37BF cfb64 encrypt 0 13254 42732 50993 19495 0 39836 406,465.08BF set key 0 1672 4337 3365 663 0 8341 40,708.50blowfish main 0 10923 49435 37223 15599 0 24712 341,822.71main 0 1 0 4 0 0 137 256.17

Total 0 127,452 395,114 273,691 91,964 0 124,222 2,530,356

wrap 0 1 1 0 0 0 88 160.89shift64RightJamming 0 128 233 437 0 0 687 3,287.69countLeadingZeros32 0 38 54 83 0 0 217 833.87countLeadingZeros64 0 37 61 85 0 0 270 948.60float raise 0 2 3 7 0 0 41 103.91float64 is nan 0 24 69 77 0 0 92 601.32float64 is signaling nan 0 24 79 156 0 0 138 923.01propagateFloat64NaN 0 164 263 329 0 0 823 3,393.23extractFloat64Frac 0 0 183 365 0 0 43 1,529.35extractFloat64Exp 0 0 183 368 0 0 24 1,503.82extractFloat64Sign 0 0 91 459 0 0 41 1,550.43packFloat64 0 0 51 50 51 0 43 495.27roundAndPackFloat64 0 265 458 550 18 0 939 4,977.42normalizeRoundAndPackFloat64 0 48 139 140 0 0 338 1,440.19addFloat64Sigs 8 462 724 1233 96 0 2190 10,425.61subFloat64Sigs 14 392 635 1154 87 0 2028 9,520.49float64 add 0 321 753 887 0 0 1161 7,103.72main 0 324 598 531 46 0 2176 7,692.21

lshrdi3 0 23 48 37 0 0 75 407.25ashldi3 0 28 82 61 0 0 125 658.18

Total 22 2,281 4,708 7,009 298 0 11,539 57,556

wrap 0 1 1 0 0 0 88 160.89sub128 0 156 138 433 0 0 383 2,563.36mul64To128 79 79 313 258 0 79 778 3,492.80estimateDiv128To64 0 293 482 596 0 0 1413 6,013.39float raise 0 3 5 10 0 0 113 247.10float64 is nan 0 5 17 18 0 0 75 236.23float64 is signaling nan 0 6 19 36 0 0 104 343.97propagateFloat64NaN 0 36 64 76 0 0 534 1,397.58extractFloat64Frac 0 0 87 154 0 0 88 793.76extractFloat64Exp 0 0 88 174 0 0 25 738.96extractFloat64Sign 0 0 43 219 0 0 41 776.72packFloat64 0 0 34 34 32 0 41 347.80roundAndPackFloat64 0 176 315 376 12 0 728 3,547.34float64 div 0 500 1114 1470 85 0 3055 13,626.43main 0 156 286 255 22 0 1274 4,097.91

udivdi3 0 474 894 688 143 0 1953 9,107.31udivsi3 0 237 238 431 0 0 536 3,266.96umodsi3 0 240 194 429 0 0 502 3,097.56

udivmodsi4 0 16556 2566 19549 1424 0 15280 129,169.32

Total 79 18,918 6,898 25,206 1,718 79 27,011 183,025

wrap 0 1 1 0 0 0 88 160.89mul64To128 31 30 123 93 0 31 495 1,682.78float raise 0 2 2 7 0 0 65 143.96float64 is nan 0 5 16 18 0 0 81 244.35float64 is signaling nan 0 6 17 44 0 0 104 360.71propagateFloat64NaN 0 41 60 74 0 0 549 1,420.20extractFloat64Frac 0 0 79 140 0 0 70 703.63extractFloat64Exp 0 0 79 159 0 0 49 718.09extractFloat64Sign 0 0 39 198 0 0 42 711.30packFloat64 0 0 29 30 29 0 41 315.25roundAndPackFloat64 0 116 198 241 8 0 641 2,581.46float64 mul 0 374 789 1173 32 0 2674 10,875.01main 0 142 260 232 20 0 1199 3,798.68

Total 31 717 1,692 2,409 89 31 6,098 23,716

wrap 0 1 1 0 0 0 88 160.89shift64RightJamming 0 1848 3446 5011 0 0 84327 176,110.31sub128 0 2333 1811 5730 0 0 21025 62,802.32mul64To128 2074 1308 8715 6059 0 2074 113611 254,515.93estimateDiv128To64 0 6669 10405 12002 0 0 48831 160,759.75countLeadingZeros32 0 1987 2889 4615 484 0 42039 100,420.04countLeadingZeros64 0 915 1097 1828 0 0 23249 51,071.07extractFloat64Frac 0 0 3358 6147 0 0 623 26,270.23extractFloat64Exp 0 0 4431 8862 0 0 42 35,324.12extractFloat64Sign 0 0 2216 11077 0 0 49 35,798.43packFloat64 0 0 2216 2214 2215 0 42 18,390.51roundAndPackFloat64 0 11913 21526 27004 836 0 96710 329,051.91normalizeRoundAndPackFloat64 0 731 2562 2745 0 0 45534 96,348.83int32 to float64 0 1338 5847 3215 0 0 3581 32,871.20addFloat64Sigs 84 1795 3122 5078 334 0 4130 34,262.09subFloat64Sigs 90 3,800 6,000 11,385 735 0 8,706 72,613.20float64 add 0 1872 4467 5273 0 0 4853 38,488.92float64 mul 0 7275 13803 22779 1207 0 13490 141,002.24float64 div 0 6614 14975 19401 1888 0 13309 134,979.52float64 le 0 2752 5960 6572 0 0 39895 109,951.01float64 ge 0 268 1716 1340 0 0 14952 35,093.94float64 neg 0 0 107 73 0 0 863 1,998.13float64 abs 0 0 305 840 0 0 10918 22,416.40sin 0 1734 8159 5283 268 0 4299 47,306.39main 0 183 396 328 36 0 1416 4,929.44

lshrdi3 0 462 693 1155 0 0 17432 36,864.98ashldi3 0 990 3161 2697 0 0 28340 67,824.10udivdi3 0 10129 16019 12364 2937 0 239711 531,273.10udivsi3 0 5340 5338 9609 0 0 13443 75,693.16umodsi3 0 5072 4272 9611 0 0 18964 82,195.32

udivmodsi4 0 715228 105213 833886 63912 0 753291 5,709,598.97

Total 2,248 792,557 264,226 1,044,183 74,852 2,074 1,667,763 8,526,386

wrap 0 1 1 0 0 0 88 160.89gsm add 0 237 156 856 79 0 416 4,244.40gsm mult 0 8 15 26 15 0 123 390.54gsm mult r 0 222 668 891 445 0 1413 8,470.80gsm abs 0 355 444 1062 90 0 645 6,239.74gsm norm 0 6 9 18 2 0 210 463.93gsm div 0 323 168 578 217 0 245 3,831.53Autocorrelation 0 812 5320 7374 2361 0 12398 64,458.12Reflection coefficients 0 139 650 562 47 0 2073 7,301.83Transformation to Log Area Ratios 0 36 69 84 7 0 419 1,249.20Quantization and coding 0 17 83 53 0 0 1094 2,332.22Gsm LPC Analysis 0 37 72 89 7 0 477 1,375.51main 0 320 1923 1752 805 0 1857 16,072.95legup memset 4 0 9 24 13 2 0 113 322.88

Total 0 2,522 9,602 13,358 4,077 0 21,571 116,915

wrap 0 1 1 0 0 0 88 160.89read byte 0 603 1808 5383 602 0 1198 24,531.19read word 0 10 39 75 39 0 150 709.52first marker 0 3 4 11 0 0 339 648.14next marker 0 50 86 127 0 0 185 1,004.88get sof 0 42 183 98 18 0 1922 4,286.66get sos 0 604 1808 5377 600 0 1317 24,722.15get dht 0 515 1999 1437 82 0 1236 12,559.63get dqt 0 544 1441 681 372 0 1545 10,595.25read markers 0 72 240 160 23 0 2119 5,032.22ChenIDct 0 57912 214544 107384 48376 0 191699 1,451,413.10IZigzagMatrix 0 27647 64308 27934 18432 0 26938 404,951.28IQuantize 0 36863 36244 27719 18432 0 79753 448,731.02PostshiftIDctMatrix 0 27648 27790 18718 9216 0 18670 245,478.29BoundIDctMatrix 0 46003 37169 55611 9305 0 55282 476,886.79WriteOneBlock 0 36287 78610 55669 2411 0 33805 499,798.27Write4Blocks 0 1583 1446 4837 144 0 7906 34,898.29YuvToRgb 0 60610 173445 159234 30815 0 245628 1,537,795.97pgetc 0 9147 17881 22876 4580 0 19641 176,968.37buf getb 0 124534 287817 252246 20756 0 752268 3,093,447.70buf getv 0 84,462 137,450 196,352 7,652 0 215,473 1,479,462.47huff make dhuff tb 0 2156 4295 3290 1072 0 4965 36,766.45DecodeHuffman 0 133244 151886 255078 32120 0 176365 1,792,712.27DecodeHuffMCU 0 73468 152048 116265 32188 0 173975 1,273,775.99decode block 0 864 4744 3013 144 0 3741 29,179.98decode start 0 413 1634 919 0 0 2729 12,397.24jpeg init decompress 0 7 50 36 1 0 988 1,995.06jpeg read 0 1 8 9 0 0 165 339.54jpeg2bmp main 0 52007 142434 120623 37064 0 110147 1,113,682.89main 0 1 3 7 0 0 238 451.01

Total 0 777,301 1,541,415 1,441,169 274,444 0 2,130,475 14,195,382

wrap 0 1 1 0 0 0 88 160.89main 0 5800 16235 10416 1879 0 9019 104,185.73legup memset 4 0 65 193 69 1 0 105 1,011.57

Total 0 5,866 16,429 10,485 1,880 0 9,212 105,358

MOTION

wrap 0 1 1 0 0 0 88 160.89read 0 2049 8064 4104 2048 0 3284 48,186.70Fill Buffer 0 4 9 7 0 0 386 735.48Show Bits 0 6 18 23 0 0 30 174.83Flush Buffer 0 125 340 315 30 0 1085 4,014.60Get Bits 0 20 38 40 0 0 264 718.59Get Bits1 0 3 11 14 0 0 81 216.30Get motion code 0 13 16 24 0 0 377 804.08decode motion vector 0 9 18 18 0 0 264 583.15motion vector 0 15 18 27 1 0 692 1,383.59motion vectors 0 12 7 38 2 0 812 1,594.85Initialize Buffer 0 1 11 6 0 0 236 464.84main 0 24 55 59 3 0 1089 2,294.54

Total 0 2,282 8,606 4,675 2,084 0 8,688 61,332.44

Total 0 133,902 472,949 332,149 94,449 0 108,808 2,880,061

DHRYSTONE

Total 0 133,902 472,949 332,149 94,449 0 108,808 2,880,061

A.2.3 Full Results for Energy/Time Correlation

Benchmark Cycles cStall Energy pStall Energy

wrap 91 0.0% 123.28 0.0% 160.89 0.0%abs 1155 0.5% 2,960.50 0.5% 2,978.35 0.6%reset 2753 1.2% 4,035.83 0.7% 5,025.29 0.9%filtez 14707 6.3% 35,573.37 6.3% 33,643.84 6.3%filtep 3655 1.6% 9,986.15 1.8% 8,438.64 1.6%quantl 14137 6.1% 34,992.36 6.2% 32,289.59 6.1%logscl 3425 1.5% 8,382.52 1.5% 8,056.15 1.5%scalel 2887 1.2% 7,190.45 1.3% 7,158.11 1.3%upzero 32578 14.0% 79,712.37 14.1% 73,001.07 13.7%uppol2 11209 4.8% 27,070.45 4.8% 25,952.73 4.9%uppol1 8097 3.5% 20,692.39 3.7% 18,820.20 3.5%logsch 3097 1.3% 8,083.42 1.4% 7,314.58 1.4%decode 63818 27.3% 155,220.31 27.4% 143,372.68 26.9%encode 62188 26.6% 151,221.70 26.7% 144,953.36 27.2%adpcm main 4596 2.0% 9,799.36 1.7% 10,318.75 1.9%main 5115 2.2% 10,846.07 1.9% 11,256.73 2.1%

Total 233508 100.0% 565,891 100.0% 532,741 100.0%

wrap 91 0.1% 123.28 0.1% 160.89 0.1%encrypt 1910 2.6% 3,242.21 2.1% 3,823.67 2.3%ByteSub ShiftRow 8941 12.1% 14,651.42 9.5% 17,255.80 10.6%SubByte 1286 1.7% 3,359.29 2.2% 3,264.29 2.0%KeySchedule 14035 19.0% 28,441.28 18.4% 28,854.28 17.6%InversShiftRow ByteSub 5018 6.8% 9,602.37 6.2% 10,478.86 6.4%MixColumn AddRoundKey 12557 17.0% 23,544.89 15.2% 26,378.71 16.1%AddRoundKey InversMixColumn 26054 35.2% 64,871.36 41.9% 65,287.42 39.9%AddRoundKey 1227 1.7% 2,499.79 1.6% 2,640.75 1.6%decrypt 1552 2.1% 2,710.19 1.8% 3,055.52 1.9%aes main 1035 1.4% 1,470.95 1.0% 1,901.49 1.2%main 219 0.3% 300.42 0.2% 398.74 0.2%

Total 73925 100.0% 154,817 100.0% 163,500 100.0%

BLOWFISH

wrap 91 0.0% 123.28 0.0% 160.89 0.0%memcpy 15084 1.5% 30,508.65 1.2% 32,702.59 1.3%BF encrypt 674403 66.6% 1,677,869.71 67.1% 1,708,240.37 67.5%BF cfb64 encrypt 166568 16.4% 417,796.34 16.7% 406,465.08 16.1%BF set key 18383 1.8% 42,913.71 1.7% 40,708.50 1.6%blowfish main 138020 13.6% 332,437.50 13.3% 341,822.71 13.5%main 142 0.0% 192.55 0.0% 256.17 0.0%

Total 1012691 100.0% 2,501,842 100.0% 2,530,356 100.0%

wrap 91 0.4% 123.28 0.2% 160.89 0.3%shift64RightJamming 1488 5.7% 3,279.09 5.8% 3,287.69 5.7%countLeadingZeros32 388 1.5% 760.15 1.4% 833.87 1.4%countLeadingZeros64 453 1.7% 873.12 1.6% 948.60 1.6%float raise 53 0.2% 84.96 0.2% 103.91 0.2%float64 is nan 262 1.0% 564.17 1.0% 601.32 1.0%float64 is signaling nan 404 1.6% 898.47 1.6% 923.01 1.6%propagateFloat64NaN 1593 6.1% 3,323.38 5.9% 3,393.23 5.9%extractFloat64Frac 591 2.3% 1,464.67 2.6% 1,529.35 2.7%extractFloat64Exp 575 2.2% 1,458.57 2.6% 1,503.82 2.6%extractFloat64Sign 591 2.3% 1,493.23 2.7% 1,550.43 2.7%packFloat64 195 0.8% 498.53 0.9% 495.27 0.9%roundAndPackFloat64 2238 8.6% 4,968.14 8.8% 4,977.42 8.6%normalizeRoundAndPackFloat64 680 2.6% 1,403.25 2.5% 1,440.19 2.5%addFloat64Sigs 4726 18.2% 10,317.65 18.3% 10,425.61 18.1%subFloat64Sigs 4339 16.7% 9,398.98 16.7% 9,520.49 16.5%float64 add 3154 12.1% 7,536.89 13.4% 7,103.72 12.3%main 3678 14.2% 6,828.17 12.1% 7,692.21 13.4%

lshrdi3 183 0.7% 392.45 0.7% 407.25 0.7%ashldi3 279 1.1% 599.04 1.1% 658.18 1.1%

Total 25961 100.0% 56,266 100.0% 57,556 100.0%

wrap 91 0.1% 123.28 0.1% 160.89 0.1%sub128 1110 1.4% 2,528.47 1.3% 2,563.36 1.4%mul64To128 1589 2.0% 3,628.84 1.9% 3,492.80 1.9%estimateDiv128To64 2809 3.5% 5,847.33 3.1% 6,013.39 3.3%float raise 131 0.2% 193.05 0.1% 247.10 0.1%float64 is nan 115 0.1% 205.89 0.1% 236.23 0.1%float64 is signaling nan 165 0.2% 306.84 0.2% 343.97 0.2%propagateFloat64NaN 710 0.9% 1,226.40 0.6% 1,397.58 0.8%extractFloat64Frac 329 0.4% 759.54 0.4% 793.76 0.4%extractFloat64Exp 287 0.4% 712.86 0.4% 738.96 0.4%extractFloat64Sign 303 0.4% 740.08 0.4% 776.72 0.4%packFloat64 141 0.2% 343.16 0.2% 347.80 0.2%roundAndPackFloat64 1606 2.0% 3,474.05 1.8% 3,547.34 1.9%float64 div 6243 7.8% 13,307.71 6.9% 13,626.43 7.4%main 2007 2.5% 3,601.89 1.9% 4,097.91 2.2%

udivdi3 4152 5.2% 9,313.76 4.9% 9,107.31 5.0%udivsi3 1442 1.8% 3,298.16 1.7% 3,266.96 1.8%umodsi3 1366 1.7% 3,196.33 1.7% 3,097.56 1.7%

udivmodsi4 55394 69.3% 138,782.39 72.4% 129,169.32 70.6%

Total 79990 100.0% 191,590 100.0% 183,025 100.0%

wrap 91 0.8% 123.28 0.6% 160.89 0.7%mul64To128 788 7.1% 1,627.69 7.3% 1,682.78 7.1%float raise 76 0.7% 114.26 0.5% 143.96 0.6%float64 is nan 120 1.1% 209.88 0.9% 244.35 1.0%float64 is signaling nan 171 1.5% 321.52 1.4% 360.71 1.5%propagateFloat64NaN 727 6.6% 1,237.55 5.6% 1,420.20 6.0%extractFloat64Frac 289 2.6% 677.64 3.1% 703.63 3.0%extractFloat64Exp 287 2.6% 681.30 3.1% 718.09 3.0%extractFloat64Sign 282 2.6% 681.14 3.1% 711.30 3.0%packFloat64 129 1.2% 308.63 1.4% 315.25 1.3%roundAndPackFloat64 1213 11.0% 2,463.48 11.1% 2,581.46 10.9%float64 mul 5038 45.6% 10,462.91 47.1% 10,875.01 45.9%main 1834 16.6% 3,290.03 14.8% 3,798.68 16.0%

Total 11045 100.0% 22,199 100.0% 23,716 100.0%

wrap 91 0.0% 123.28 0.0% 160.89 0.0%shift64RightJamming 94660 2.5% 140,991.99 1.6% 176,110.31 2.1%sub128 30870 0.8% 57,616.95 0.7% 62,802.32 0.7%mul64To128 133843 3.5% 218,031.13 2.5% 254,515.93 3.0%estimateDiv128To64 77878 2.0% 152,202.19 1.8% 160,759.75 1.9%countLeadingZeros32 52036 1.4% 83,503.61 1.0% 100,420.04 1.2%countLeadingZeros64 27113 0.7% 39,986.11 0.5% 51,071.07 0.6%extractFloat64Frac 10128 0.3% 25,899.99 0.3% 26,270.23 0.3%extractFloat64Exp 13335 0.3% 34,476.29 0.4% 35,324.12 0.4%extractFloat64Sign 13342 0.3% 34,829.15 0.4% 35,798.43 0.4%packFloat64 6687 0.2% 19,177.23 0.2% 18,390.51 0.2%roundAndPackFloat64 157940 4.1% 306,416.55 3.5% 329,051.91 3.9%normalizeRoundAndPackFloat64 51593 1.3% 78,867.23 0.9% 96,348.83 1.1%int32 to float64 13980 0.4% 35,068.72 0.4% 32,871.20 0.4%addFloat64Sigs 14508 0.4% 35,613.92 0.4% 34,262.09 0.4%subFloat64Sigs 30722 0.8% 76,391.01 0.9% 72,613.20 0.9%float64 add 16465 0.4% 41,480.29 0.5% 38,488.92 0.5%float64 mul 58539 1.5% 146,921.82 1.7% 141,002.24 1.7%float64 div 56186 1.5% 141,286.79 1.6% 134,979.52 1.6%float64 le 55198 1.4% 99,493.93 1.1% 109,951.01 1.3%float64 ge 18267 0.5% 28,590.92 0.3% 35,093.94 0.4%float64 neg 1054 0.0% 1,614.35 0.0% 1,998.13 0.0%float64 abs 12059 0.3% 17,436.48 0.2% 22,416.40 0.3%sin 19732 0.5% 48,395.20 0.6% 47,306.39 0.6%main 2359 0.1% 4,369.11 0.1% 4,929.44 0.1%

lshrdi3 19774 0.5% 29,792.17 0.3% 36,864.98 0.4%ashldi3 34891 0.9% 55,510.91 0.6% 67,824.10 0.8%udivdi3 281178 7.3% 453,453.01 5.2% 531,273.10 6.2%udivsi3 33728 0.9% 80,856.23 0.9% 75,693.16 0.9%umodsi3 37931 1.0% 84,330.70 1.0% 82,195.32 1.0%

udivmodsi4 2471591 64.2% 6,080,143.23 70.3% 5,709,598.97 67.0%

Total 3847678 100.0% 8,652,870 100.0% 8,526,386 100.0%

wrap 91 0.2% 123.28 0.1% 160.89 0.1%gsm add 1744 3.4% 4,420.06 3.6% 4,244.40 3.6%gsm mult 187 0.4% 391.21 0.3% 390.54 0.3%gsm mult r 3639 7.2% 9,989.43 8.2% 8,470.80 7.2%gsm abs 2596 5.1% 6,577.01 5.4% 6,239.74 5.3%gsm norm 245 0.5% 372.80 0.3% 463.93 0.4%gsm div 1531 3.0% 3,919.96 3.2% 3,831.53 3.3%Autocorrelation 28284 55.8% 70,074.92 57.2% 64,458.12 55.1%Reflection coefficients 3478 6.9% 6,748.06 5.5% 7,301.83 6.2%Transformation to Log Area Ratios 615 1.2% 1,131.08 0.9% 1,249.20 1.1%Quantization and coding 1240 2.4% 1,870.07 1.5% 2,332.22 2.0%Gsm LPC Analysis 212 0.4% 308.97 0.3% 1,375.51 1.2%main 6692 13.2% 16,362.96 13.4% 16,072.95 13.7%legup memset 4 161 0.3% 277.35 0.2% 322.88 0.3%

Total 50715 100.0% 122,567 100.0% 116,915 100.0%

wrap 91 0.0% 123.28 0.0% 160.89 0.0%read byte 9587 0.2% 23,522.31 0.2% 24,531.19 0.2%read word 313 0.0% 682.94 0.0% 709.52 0.0%first marker 343 0.0% 472.74 0.0% 648.14 0.0%next marker 457 0.0% 1,015.38 0.0% 1,004.88 0.0%get sof 2237 0.0% 3,458.37 0.0% 4,286.66 0.0%get sos 1203 0.0% 2,026.50 0.0% 24,722.15 0.2%get dht 5255 0.1% 12,236.50 0.1% 12,559.63 0.1%get dqt 4592 0.1% 10,617.50 0.1% 10,595.25 0.1%read markers 1151 0.0% 1,957.61 0.0% 5,032.22 0.0%ChenIDct 619845 10.1% 1,537,415.41 10.5% 1,451,413.10 10.2%IZigzagMatrix 165262 2.7% 423,033.82 2.9% 404,951.28 2.9%IQuantize 199047 3.2% 444,672.79 3.0% 448,731.02 3.2%PostshiftIDctMatrix 102028 1.7% 258,277.35 1.8% 245,478.29 1.7%BoundIDctMatrix 203366 3.3% 520,926.38 3.6% 476,886.79 3.4%WriteOneBlock 206766 3.4% 506,111.68 3.5% 499,798.27 3.5%Write4Blocks 15989 0.3% 38,641.07 0.3% 34,898.29 0.2%YuvToRgb 669621 10.9% 1,729,816.63 11.9% 1,537,795.97 10.8%pgetc 74131 1.2% 176,900.12 1.2% 176,968.37 1.2%buf getb 1437623 23.4% 2,958,965.09 20.3% 3,093,447.70 21.8%buf getv 641430 10.4% 1,552,590.86 10.6% 1,479,462.47 10.4%huff make dhuff tb 15771 0.3% 37,478.92 0.3% 36,766.45 0.3%DecodeHuffman 748436 12.2% 1,895,934.53 13.0% 1,792,712.27 12.6%DecodeHuffMCU 547844 8.9% 1,270,764.75 8.7% 1,273,775.99 9.0%decode block 12425 0.2% 31,272.85 0.2% 29,179.98 0.2%decode start 5710 0.1% 12,690.62 0.1% 12,397.24 0.1%jpeg init decompress 1089 0.0% 1,571.51 0.0% 1,995.06 0.0%jpeg read 183 0.0% 267.96 0.0% 339.54 0.0%jpeg2bmp main 462150 7.5% 1,134,902.81 7.8% 1,113,682.89 7.8%main 255 0.0% 346.29 0.0% 451.01 0.0%

Total 6154200 100.0% 14,588,695 100.0% 14,195,382 100.0%

wrap 91 0.2% 123.28 0.1% 160.89 0.2%main 43383 98.8% 108,528.60 99.0% 104,185.73 98.9%legup memset 4 435 1.0% 983.73 0.9% 1,011.57 1.0%

Total 43909 100.0% 109,636 100.0% 105,358 100.0%

MOTION

wrap 91 0.3% 123.28 0.2% 160.89 0.3%read 19546 74.2% 48,468.90 81.5% 48,186.70 78.6%Fill Buffer 406 1.5% 569.81 1.0% 735.48 1.2%Show Bits 77 0.3% 160.97 0.3% 174.83 0.3%Flush Buffer 1898 7.2% 3,683.25 6.2% 4,014.60 6.5%Get Bits 362 1.4% 644.37 1.1% 718.59 1.2%Get Bits1 109 0.4% 182.00 0.3% 216.30 0.4%Get motion code 430 1.6% 656.47 1.1% 804.08 1.3%decode motion vector 307 1.2% 460.75 0.8% 583.15 1.0%motion vector 740 2.8% 1,075.43 1.8% 1,383.59 2.3%motion vectors 885 3.4% 1,263.42 2.1% 1,594.85 2.6%Initialize Buffer 259 1.0% 364.51 0.6% 464.84 0.8%main 1221 4.6% 1,808.89 3.0% 2,294.54 3.7%

Total 26331 100.0% 59,462.06 100.0% 61,332.44 100.0%

wrap 100 0.0% 134.75 0.0% 160.89 0.0%memset 248 0.0% 451.38 0.0% 506.19 0.0%memcpy 120158 10.5% 293,333.64 10.0% 301,618.58 10.5%sha transform 1015785 88.9% 2,616,217.93 89.5% 2,564,198.26 89.0%sha init 229 0.0% 321.60 0.0% 419.86 0.0%sha update 4729 0.4% 11,530.65 0.4% 11,282.42 0.4%sha final 387 0.0% 581.65 0.0% 736.39 0.0%sha stream 255 0.0% 365.86 0.0% 468.82 0.0%main 360 0.0% 516.16 0.0% 669.77 0.0%

Total 1142251 100.0% 2,923,454 100.0% 2,880,061 100.0%

DHRYSTONE

wrap 91 0.3% 123.28 0.2% 176.85 0.2%strcpy 5188 16.0% 13,729.38 17.7% 13,341.05 17.4%strcmp 9088 28.0% 23,373.74 30.1% 22,661.35 29.6%Func 3 136 0.4% 323.37 0.4% 336.71 0.4%Proc 6 1234 3.8% 2,797.53 3.6% 2,854.12 3.7%Proc 7 501 1.5% 1,215.40 1.6% 1,080.85 1.4%Proc 8 1984 6.1% 4,706.37 6.1% 4,503.72 5.9%Func 1 535 1.6% 1,298.23 1.7% 1,236.44 1.6%Func 2 1529 4.7% 3,243.83 4.2% 3,334.40 4.4%Proc 3 858 2.6% 1,870.21 2.4% 1,950.79 2.5%Proc 1 2375 7.3% 5,258.31 6.8% 5,160.73 6.7%Proc 2 477 1.5% 1,132.73 1.5% 1,139.33 1.5%Proc 4 468 1.4% 1,052.11 1.4% 1,145.54 1.5%Proc 5 208 0.6% 491.67 0.6% 496.31 0.6%main 5502 17.0% 11,356.06 14.6% 11,548.83 15.1%legup memcpy 4 2261 7.0% 5,565.31 7.2% 5,553.35 7.3%

Total 32435 100.0% 77,538 100.0% 76,520 100.0%

Appendix B

Source Code

B.1 Profiling Definitions`define STACK ADDR 'hFFDFFC`define PROF ADDR 'hFFE000`define S 32 // Address Stack S i ze`define S2 5 // l og2 ( s t a c k depth )`define BUFF DEPTH 4 // Buf f e r Depth`define N 64 // Number o f Functions`define N2 4 // l og2 (`N)`define ICW 32 // i n s t r u c t i o n counter width`define CCW 32 // cyc l e counter width`define SCW 32 // s t a l l c y c l e counter width`define PW 22 // power counter i n d i v i d u a l width`define PSW 28 // power s t a l l counter width`define PGROUPS 6 // number o f groupings f o r power`define PCW ( P̀W*`PGROUPS+̀ PSW) // power counter [ t o t a l ] width`define PROF TYPE ``L ' ' // Choose p r o f i l e r (L −− LEAP, S −− SnoopP)`define PROFMETHOD ``c ' ' // Choose p r o f i l i n g method ( i −− i n s t r u c t i o n count ,

// c −− cy c l e count , s −− s t a l l c y c l e count ,

// p −− power count )`define CW ((`PROF METHOD==`` i ' ' ) ? `ICW : (`PROF METHOD==``c ' ' ) ? C̀CW : (`PROF METHOD==``s ' ' )? `SCW : (`PROF METHOD==``p ' ' ) ? P̀CW : 0)

// `de f ine CW64 // uncomment t h i s l i n e i f ICW = 64

B.1 Profiling Definitions`define START ADDR 32 ' b0000 11 0000 1000 0000 0000 0000 0000 00

// ' h00800000 = ' b0000 0000 |1000 0000 0000 0000 0000 0000 |00`define DO HIER 1 'b1

// addresses corresponding to the j a l and ra o f wrap . c`define WRAP MAIN BEGIN ' h800018`define WRAP MAIN END ' h800074

// power p r o f i l i n g cons tan ts`define A 1`define B 2`define C 3`define D 4`define E 5`define F 6`define STALL 7

// p S t a l l s (BRAVO)`define P ADD `C`define P ADDI `C`define P ADDIU `C`define P ADDU `C`define P AND `D`define P ANDI `C`define P BEQ `B`define P BGEZ `B`define P BGEZAL `B`define P BGTZ `D`define P BLEZ `C`define P BLTZ À`define P BLTZAL À`define P BNE `B`define P DIV `F`define P DIVU `F`define P J `C`define P JAL `C`define P JR `D`define P LB `C`define P LBU È`define P LH `D`define P LHU È141

B Source Code`define P LUI `C`define P LW `B`define P MFHI À`define P MFLO `C`define P MULT È`define P MULTU `F`define P NOP `D`define P OR `D`define P ORI `D`define P SB `D`define P SH `D`define P SLL `D`define P SLLV È`define P SLT `D`define P SLTI `B`define P SLTIU `B`define P SLTU `D`define P SRA `C`define P SRAV `C`define P SRL `C`define P SRLV `B`define P SUB `D`define P SUBU `D`define P SW `D`define P XOR `C`define P XORI `C`define P ADD ADDU `C`define P DIV DIVU `F`define P ADDI ADDIU `C`define P SB SH `D

B.2 LEAP Source Code

// Top− l e v e l module f o r LEAP p r o f i l e r

module LEAP (

input clk ,

input g l ob r e s e t ,

input [ 2 5 : 0 ] pc in ,

input [ 3 1 : 0 ] i n s t r i n 2 ,

input [ 3 1 : 0 ] in s t rEx in ,

input s t a l l i n ,

input i n sVa l i d i n ,

// Handshaking por t s

input i n i t s t a r t i n ,

output i n i t done ,

input r e t r i e v e s t a r t i n ,

output r e t r i e v e don e ,

// Avalon P ro f i l e Master por t s

output avm prof i leMaste r read ,

output avm pro f i l eMaste r wr i t e ,

output [ 3 1 : 0 ] avm pro f i l eMaste r addre s s ,

output [ 3 1 : 0 ] avm pro f i l eMaste r wr i t edata ,

output [ 3 : 0 ] avm pro f i l eMaste r by teenab le ,

input [ 3 1 : 0 ] avm prof i leMaster readdata ,

input avm pro f i l eMaste r wa i t r eques t ,

input avm pro f i l eMaste r r eaddatava l id

// Delay p r o f i l e r ' s consumption o f inpu ts to g i v e ex t ra cy c l e to Address Hash

reg [ 2 5 : 0 ] pc ;

reg [ 2 5 : 0 ] pc minus 4 ;

reg [ 3 1 : 0 ] i n s t r i n ;

reg [ 3 1 : 0 ] instrEx ;

reg s t a l l ;

reg i n sVa l id ;

reg i n i t s t a r t ;

reg r e t r i e v e s t a r t ;

always @(posedge c l k ) begin

i f ( g l o b r e s e t ) begin

B Source Code

pc <= ' b0 ;

pc minus 4 <= 'b0 ;

i n s t r i n <= 'b0 ;

instrEx <= 'b0 ;

s t a l l <= 'b0 ;

in sVa l id <= 'b0 ;

i n i t s t a r t <= 'b0 ;

r e t r i e v e s t a r t <= ' b0 ;

end else begin

pc <= pc in ;

pc minus 4 <= pc in − 'h4 ;

i n s t r i n <= i n s t r i n 2 ;

instrEx <= in s t rEx in ;

s t a l l <= s t a l l i n ;

in sVa l id <= in sVa l i d i n ;

i n i t s t a r t <= i n i t s t a r t i n ;

r e t r i e v e s t a r t <= r e t r i e v e s t a r t i n ;

// wires f o r OpDecode

wire [ 2 5 : 0 ] opTarget ;

wire [ 2 5 : 0 ] opJALrTarget ;

wire opCall ;

wire opJALr ;

wire opRet ;

wire opJALr done ;

wire opPC8 ;

// wires f o r DataCounter

wire [`CW−1:0 ] dcCountOut ;

reg dcPCdiff ;

// wires f o r CounterStorage

wire [ `N2−1:0 ] csFuncNum;

wire [`CW−1:0 ] csCountIn ;

wire csRdReq ;

wire csEmpty ;

wire csSave ;

wire [`CW−1:0 ] csSaveData ;

wire [ `N2−1:0 ] csNumFuncs ;

wire csRamWren;

wire [ `N2−1:0 ] csRamAddr ;

wire [`CW−1:0 ] csRamWrData ;

wire [`CW−1:0 ] csRamRdData ;

// wires f o r Ca l l S tack

wire [ `N2−1:0 ] asFuncNumOut ;

wire asRamWren ;

wire [ `S2−1:0 ] asRamAddr ;

wire [ `N2−1:0 ] asRamWrData ;

wire [ `N2−1:0 ] asRamRdData ;

// wires f o r HierarchyStack

wire hsStackRamWren ;

wire [ `S2−1:0 ] hsStackRamAddr ;

wire [`CW−1:0 ] hsStackRamWrData ;

wire [`CW−1:0 ] hsStackRamRdData ;

wire [`CW−1:0 ] hsChildCount ;

// wires f o r AddressHash

wire [ 2 5 : 0 ] ahTarget ;

wire [ `N2−1:0 ] ahFuncNum;

wire ahJALr ;

wire [ `N2−1:0 ] ahRamAddr ;

wire [ 7 : 0 ] ahRamRdData ;

// wires f o r DataStorage

wire dsHashRamWren ;

wire [ `N2−1:0 ] dsHashRamAddr ;

wire [ 7 : 0 ] dsHashRamWrData ;

wire [ 7 : 0 ] dsHashRamRdData;

wire dsStackRamWren ;

wire [ `S2−1:0 ] dsStackRamAddr ;

wire [ `N2−1:0 ] dsStackRamWrData ;

wire [ `N2−1:0 ] dsStackRamRdData ;

wire dsStorageRamWren ;

wire [ `N2−1:0 ] dsStorageRamAddr ;

wire [`CW−1:0 ] dsStorageRamWrData ;

wire [`CW−1:0 ] dsStorageRamRdData ;

B Source Code

// wires f o r I n i t i a l i z e r

wire i I n i t ;

wire iHashRamWren ;

wire [ `N2−1:0 ] iHashRamAddr ;

wire [ 7 : 0 ] iHashRamWrData ;

wire [ 7 : 0 ] iHashRamRdData ;

wire i n i t h a s h s t a r t ;

wire i n i t h a sh don e ;

wire h a s h f i r s t s t a r t ;

wire i n i t a vm p r o f i l eMa s t e r w r i t e ;

wire [ 3 1 : 0 ] i n i t avm pro f i l eMas t e r add r e s s ;

wire [ 3 1 : 0 ] i n i t avm pro f i l eMas t e r wr i t eda t a ;

wire [ 3 : 0 ] i n i t avm pro f i l eMas t e r by t e enab l e ;

wire i n i t r e s e t ;

// wires f o r Retre i ver

wire rRet r i ev e ;

wire [`CW−1:0 ] rProfData ;

wire [ `N2−1:0 ] rProfIndex ;

wire r e t r i e v e avm pro f i l eMas t e r wr i t e ;

wire [ 3 1 : 0 ] r e t r i e v e avm pro f i l eMas t e r add r e s s ;

wire [ 3 1 : 0 ] r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ;

wire [ 3 : 0 ] r e t r i e v e avm pro f i l eMas t e r by t e enab l e ;

// genera l r e g i s t e r s

reg [ 2 5 : 0 ] o ld pc ;

reg [ 3 1 : 0 ] o l d i n s t r ;

reg p c d i f f ;

// Keep track o f curren t f unc t i on

reg [ `N2−1:0 ] cu r fun c r e g ;

reg c u r f u n c s r c ;

wire [ `N2−1:0 ] cu r func = ( cu r f u n c s r c ) ? ahFuncNum : cu r fun c r e g ;

reg [ 1 : 0 ] s t a t e ;

reg [ 2 : 0 ] d e lay count ;

reg j a l r p r e v ;

reg s a v e d c a l l ;

reg h a sh f i r s t d on e ;

// combine g l o b a l & i n i t r e s e t s

wire r e s e t = g l ob r e s e t | i n i t r e s e t ;

reg i n s t r i n i t ;

reg s t o r e f i r s t c a l l ;

wire [ 3 1 : 0 ] i n s t r = i n s t r i n ;

// Create l agged s i g n a l to determine when JALr t a r g e t i s found

always @(posedge ( c l k ) ) begin

j a l r p r e v <= opJALr ;

// update p c d i f f , s t o r e f i r s t j a l

i f ( r e s e t ) begin

cu r fun c r e g <= 'h0 ;

c u r f u n c s r c <= 1 'b0 ;

s t a t e <= 2 ' b00 ;

h a s h f i r s t d on e <= 1 'b0 ;

de lay count <= 'b0 ;

o ld pc <= 'b0 ;

o l d i n s t r <= 'b0 ;

s t o r e f i r s t c a l l <= 1 'b0 ;

i n s t r i n i t <= 'b0 ;

// i n i t i a l i z e wi th hash o f `START ADDR

end else i f ( i n i t s t a r t ) begin

// Sta te 0 : Wait f o r i n i t ha sh done

i f ( s t a t e <= 2 ' b00 ) begin

i f ( h a s h f i r s t s t a r t & ! h a s h f i r s t d on e ) begin

s t a t e <= 2 ' b01 ;

de lay count <= ' b0 ;

i n s t r i n i t <= 1 ' b1 ; // pretend we s t a r t wi th a jump to 0x8003000 (GXemul) or 0

x800000 ( Tiger )

end else i f ( h a s h f i r s t s t a r t & ha sh f i r s t d on e ) begin

h a sh f i r s t d on e <= 1 'b0 ;

// Sta te 1 : Delay u n t i l hash f i n i s h e d (2 c y c l e s )

end else i f ( s t a t e == 2 ' b01 ) begin

p c d i f f <= 1 'b0 ;

// s tay in t h i s s t a t e u n t i l de l ay f i n i s h e d

i f ( de lay count < 3 ' b111 ) begin // 2 c y c l e s

B Source Code

de lay count <= de lay count + 1 'b1 ;

end else i f ( de lay count < 3 ' b011 ) begin // 2 c y c l e s

i n s t r i n i t <= 1 ' b0 ;

// update cur func

end else begin

i n s t r i n i t <= 32 'b0 ;

h a s h f i r s t d on e <= 1 'b1 ;

cu r fun c r e g <= ahFuncNum;

s t a t e <= 2 ' b10 ;

s t a t e <= 2 ' b00 ;

// s e t / c l e a r p c d i f f s i g n a l so modules know when PC changes

end else begin

p c d i f f <= ( pc != o ld pc ) ;

o ld pc <= pc ;

o l d i n s t r <= i n s t r ;

// Sta te 0 : Wait f o r r e t / c a l l

i f ( s t a t e <= 2 ' b00 ) begin

i f ( opCall | opRet ) begin

s t a t e <= 2 ' b01 ;

de lay count <= opCall ? ' b01 : ' b10 ;

s a v e d c a l l <= opCall ;

// Handle JALr d i f f e r e n t l y

end else i f ( ! opJALr & j a l r p r e v ) begin // use j a l r p r e v to i n d i c a t e we WERE in j a l r −−

once opJALr goes low , t ha t means the t a r g e t i s ready

s t a t e <= 2 ' b01 ;

de lay count <= 1 'b0 ;

s a v e d c a l l <= 1 ' b1 ;

// Sta te 1 : Delay u n t i l hash f i n i s h e d (2 c y c l e s )

// s tay in t h i s s t a t e u n t i l de l ay f i n i s h e d

i f ( de lay count < 3 ' b10 ) begin // 2 c y c l e s

c u r f u n c s r c <= 1 'b1 ; // make mux use ahFuncNum ( f o r bypass o f c a l l−then−

r e t i s su e )

// update cur func

end else begin

i f ( opRet ) cu r fun c r e g <= asFuncNumOut ;

else cu r fun c r e g <= sav ed c a l l ? ahFuncNum : asFuncNumOut ;

c u r f u n c s r c <= 1 'b0 ; // make mux use cur func reg , not d i r e c t l y ahFuncNum ( f o r

bypass o f c a l l −then−r e t i s su e )

de lay count <= ' b0 ;

// t h i s i s a bypass cond i t i on to avoid the case where a hash i s a v a i l a b l e in the same

cyc l e t ha t another c a l l happens

i f ( ! opCall & ! opRet ) s t a t e <= 2 ' b00 ;

else s t a t e <= 2 ' b10 ;

// Sta te 2 : Wait u n t i l c a l l / r e t i s low

i f ( ! opCall & ! opRet ) s t a t e <= 2 ' b00 ;

// Make sure c a l l s a r r i v e at the r i g h t time to the counter & counter s t o ra g e modules ( f o r

cases when i t s t a l l s between the j a l i n s t r and the ac tua l jump)

wire c a l l j a l r = opCall | opJALr ;

wire c a l l j a l r p c ;

reg c a l l j a l r p c s t a t e ;

reg [ 3 1 : 0 ] c a l l j a l r t a r g e t ;

c a l l j a l r p c s t a t e <= 1 'b0 ;

c a l l j a l r t a r g e t <= ' b0 ;

end else begin

i f ( c a l l j a l r & h a sh f i r s t d on e ) begin

c a l l j a l r t a r g e t <= opTarget ;

B Source Code

i f ( c a l l j a l r p c s t a t e & ( pc == c a l l j a l r t a r g e t | ( opJALr & ! opPC8) ) ) begin

assign c a l l j a l r p c = ( c a l l j a l r p c s t a t e & ( pc == c a l l j a l r t a r g e t | (opJALr & ! opPC8) ) ) ;

// Make sure re turns a r r i v e at the r i g h t time to the counter & counter s t o ra g e modules ( f o r

cases when i t s t a l l s between the r e t i n s t r and the ac tua l jump)

wire r e t p c ;

reg r e t p c s t a t e ;

reg [ 3 1 : 0 ] r e t p c v a l ;

reg [ 3 1 : 0 ] r e t i n s t r v a l ;

r e t p c s t a t e <= 1 'b0 ;

r e t p c v a l <= 'b0 ;

r e t i n s t r v a l <= 'b0 ;

end else begin

i f ( opRet & ha sh f i r s t d on e ) begin

r e t p c s t a t e <= 1 ' b1 ;

r e t p c v a l <= pc ;

i f ( r e t p c s t a t e & ( pc != r e t p c v a l ) & ( pc minus 4 != r e t p c v a l ) ) begin

r e t p c s t a t e <= 1 ' b0 ;

assign r e t p c = ( r e t p c s t a t e & ( pc != r e t p c v a l ) & ( pc minus 4 != r e t p c v a l ) ) ; // +4 to

make sure i t s not j u s t the branch de l ay s l o t

// Make sure coun t e r s t o ra g e uses co r r e c t ``cur func ' ' by s t o r i n g i t

reg [ `N2−1:0 ] cu r func pc ;

cu r func pc <= 'b0 ;

end else begin

i f ( ! r e t p c s t a t e & ! c a l l j a l r p c s t a t e ) begin // i f n e i t h e r c a l l /

r e t are wa i t i ng f o r the ac tua l jump , keep track o f curren t f unc t i on

cu r func pc <= ( de lay count == 3 ' b010 ) ? ahFuncNum : cur func ; // a second bypass to

have hash va lue go s t r a i g h t to coun t e r s t o ra g e so i t can read the data a cyc l e

e a r l i e r

// New way o f doing ``pc d i f f ' ' −− j u s t send prev pc to i n s t r c o un t e r and have i t compare

reg [ 3 1 : 0 ] prev pc ;

prev pc <= 'b0 ;

end else begin

prev pc <= pc ;

// Connect modules

assign ahTarget = opTarget ;

assign csCountIn = dcCountOut ;

assign dsHashRamWren = iHashRamWren ; // Hash RAM i s only wr i t t en by I n i t i a l i z e r

assign dsHashRamAddr = ( i n i t s t a r t & ! i n i t d on e & ! i n i t h a s h s t a r t & ! i n i t h a sh don e &

! h a s h f i r s t s t a r t & ! h a s h f i r s t d on e ) ? iHashRamAddr : ahRamAddr ;

assign ahRamRdData = dsHashRamRdData;

assign dsHashRamWrData = iHashRamWrData ;

assign dsStackRamWren = asRamWren ; // Stack RAM i s only accessed by c a l l S t a c k

assign dsStackRamAddr = asRamAddr ;

assign asRamRdData = dsStackRamRdData ;

assign dsStackRamWrData = asRamWrData ;

assign dsStorageRamWren = csRamWren; // Storage RAM i s only wr i t t en by

counterS torage

assign dsStorageRamAddr = ( r e t r i e v e s t a r t & ! r e t r i e v e don e ) ? rProfIndex : csRamAddr ;

assign csRamRdData = dsStorageRamRdData ;

assign rProfData = dsStorageRamRdData ;

assign dsStorageRamWrData = csRamWrData ;

// mux ava lon s i g n a l s

assign avm pro f i l eMaste r wr i t e = ( i n i t s t a r t & ! i n i t d on e ) ? i n i t a vm p r o f i l eMa s t e r w r i t e :

r e t r i e v e avm pro f i l eMa s t e r wr i t e ;

assign avm pro f i l eMaste r addre s s = ( i n i t s t a r t & ! i n i t d on e ) ? i n i t avm pro f i l eMas t e r add r e s s

: r e t r i e v e avm pro f i l eMas t e r add r e s s ;

B Source Code

assign avm pro f i l eMaste r wr i t edata = ( i n i t s t a r t & ! i n i t d on e ) ?

i n i t avm pro f i l eMas t e r wr i t eda t a : r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ;

assign avm pro f i l eMaste r by teenab le = ( i n i t s t a r t & ! i n i t d on e ) ?

i n i t avm pro f i l eMas t e r by t e enab l e : r e t r i e v e avm pro f i l eMas t e r by t e enab l e ;

// I n s t a n t i a t e the modules

OpDecode decoder (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. pc ( pc ) ,

. pc nolag ( pc in ) ,

. i n s t r ( i n s t r ) ,

. i n s t r n o l a g ( i n s t r i n 2 ) ,

. in sVa l id ( in sVa l id ) ,

. r e t ( opRet ) ,

. c a l l ( opCall ) ,

. j a l r ( opJALr) ,

. j a l r d on e ( opJALr done ) ,

. t a r g e t ( opTarget ) ,

. j a l r t a r g e t ( opJALrTarget ) ,

. p c wi th in 8 (opPC8)

parameter PROFMETHOD = `PROF METHOD;

generate

i f (PROFMETHOD==`` i ' ' ) begin

In s t ruc t ionCounte r i n s t r c ou n t e r (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. prev pc ( prev pc ) ,

. pc ( pc ) ,

. r e t ( r e t p c ) ,

. c a l l ( c a l l j a l r p c ) ,

. count out ( dcCountOut ) ,

. i n i t s t a r t ( i n i t s t a r t )

end else i f (PROFMETHOD==``c ' ' ) begin

CycleCounter cy c l e c oun t e r (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. r e t ( r e t p c ) ,

end else i f (PROFMETHOD==``s ' ' ) begin

Sta l lCounte r s t a l l c o u n t e r (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. s t a l l ( s t a l l ) ,

. r e t ( r e t p c ) ,

end else i f (PROFMETHOD==``p ' ' ) begin

PowerCounter power counter (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. p c d i f f ( p c d i f f ) ,

. s t a l l ( s t a l l ) ,

. i n s t r i n ( i n s t r ) ,

. r e t ( r e t p c ) ,

endgenerate

CounterStorage coun t e r s t o r age (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. r e t ( r e t p c ) ,

. cur func num ( cur func pc ) ,

. count ( csCountIn ) ,

. c h i l d c oun t ( hsChildCount ) ,

. i n i t s t a r t ( i n i t s t a r t ) ,

. p c d i f f ( p c d i f f ) ,

// For access ing RAM

. ram wren (csRamWren) ,

. ram addr ( csRamAddr) ,

B Source Code

. ram wr data (csRamWrData) ,

. ram rd data ( csRamRdData)

Cal lStack c a l l s t a c k (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. r e t ( opRet ) ,

. c a l l ( c a l l j a l r ) ,

. func num in ( cu r func ) ,

. func num out (asFuncNumOut ) ,

. p c d i f f ( p c d i f f ) ,

. s t o r e f i r s t c a l l ( s t o r e f i r s t c a l l ) ,

. ram wren (asRamWren) ,

. ram addr (asRamAddr ) ,

. ram wr data (asRamWrData ) ,

. ram rd data (asRamRdData )

parameter DO HIER = `DO HIER;

generate

i f (DO HIER) begin

HierarchyStack h i e r s t a c k (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. r e t ( r e t p c ) ,

. c h i l d c oun t ( hsChildCount ) ,

. count in ( dcCountOut ) ,

. ram wren (hsStackRamWren) ,

. ram addr ( hsStackRamAddr) ,

. ram wr data (hsStackRamWrData) ,

. ram rd data ( hsStackRamRdData )

endgenerate

AddressHash addr hash (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. i n s t r n o l a g ( i n s t r i n 2 ) ,

. i n s t rV a l i d n o l a g ( i n sVa l i d i n ) ,

. addr in ( opJALrTarget ) ,

. funcNum(ahFuncNum) ,

. j a l r ( ahJALr) ,

. i n i t s t a r t ( i n i t h a s h s t a r t ) ,

. i n i t d on e ( i n i t h a sh don e ) ,

. i n s t r i n i t ( i n s t r i n i t ) ,

. ram addr (ahRamAddr) ,

. ram rd data (ahRamRdData)

DataStorage data s torage (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. r e t r i e v e s t a r t ( r e t r i e v e s t a r t ) ,

. hash wren (dsHashRamWren) ,

. hash addr (dsHashRamAddr) ,

. hash wr data (dsHashRamWrData) ,

. hash rd data (dsHashRamRdData) ,

. c a l l s t a c k w r e n (dsStackRamWren) ,

. c a l l s t a c k add r (dsStackRamAddr) ,

. c a l l s t a c k w r d a t a (dsStackRamWrData) ,

. c a l l s t a c k r d d a t a (dsStackRamRdData ) ,

. h i e r s t a ck wr en (hsStackRamWren) ,

. h i e r s t a ck add r (hsStackRamAddr) ,

. h i e r s t a ck wr da t a (hsStackRamWrData) ,

. h i e r s t a c k r d d a t a (hsStackRamRdData ) ,

. s torage wren ( dsStorageRamWren ) ,

. s t o rage addr ( dsStorageRamAddr) ,

B Source Code

. s t o rage wr data (dsStorageRamWrData) ,

. s t o r age rd da t a ( dsStorageRamRdData )

v I n i t i a l i z e r i n i t (

. c l k ( c l k ) ,

. r e s e t ( g l o b r e s e t ) ,

. r e s e t ou t ( i n i t r e s e t ) ,

. pc ( pc ) ,

. i n i t d on e ( i n i t d on e ) ,

. i n i t h a s h s t a r t ( i n i t h a s h s t a r t ) ,

. i n i t h a sh don e ( i n i t h a sh don e ) ,

. h a s h f i r s t s t a r t ( h a s h f i r s t s t a r t ) ,

. h a s h f i r s t d on e ( h a s h f i r s t d on e ) ,

. hashWren ( iHashRamWren ) ,

. hashAddr( iHashRamAddr ) ,

. hashWrData ( iHashRamWrData ) ,

//Avalon Bus s i d e s i g n a l s

. avm pro f i l eMaste r r ead ( avm pro f i l eMaste r r ead ) ,

. avm pro f i l eMaste r wr i t e ( i n i t a vm p r o f i l eMa s t e r w r i t e ) ,

. avm pro f i l eMaste r addre s s ( i n i t avm pro f i l eMas t e r add r e s s ) ,

. avm pro f i l eMaste r wr i t edata ( i n i t avm pro f i l eMas t e r wr i t eda t a ) ,

. avm pro f i l eMaste r by teenab le ( i n i t avm pro f i l eMas t e r by t e enab l e ) ,

. avm pro f i l eMaste r wa i t r eques t ( avm pro f i l eMaste r wa i t r eques t ) ,

. avm pro f i l eMaste r r eaddata ( avm pro f i l eMaste r r eaddata ) ,

. avm pro f i l eMaste r r eaddatava l id ( avm pro f i l eMaste r r eaddatava l id )

generate

i f (PROFMETHOD!=``p ' ' ) begin

vRet r i ev e r r e t r i e v e (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. pc ( pc ) ,

. r e t r i e v e don e ( r e t r i e v e don e ) ,

. p ro f data ( rProfData ) ,

. p r o f i nd ex ( rProfIndex ) ,

. avm pro f i l eMaste r wr i t e ( r e t r i e v e avm pro f i l eMas t e r wr i t e ) ,

. avm pro f i l eMaste r addre s s ( r e t r i e v e avm pro f i l eMas t e r add r e s s ) ,

. avm pro f i l eMaste r wr i t edata ( r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ) ,

. avm pro f i l eMaste r by teenab le ( r e t r i e v e avm pro f i l eMas t e r by t e enab l e ) ,

. avm pro f i l eMaste r wa i t r eques t ( avm pro f i l eMaste r wa i t r eques t )

end else begin

vPowerRetriever r e t r i e v e (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. pc ( pc ) ,

. p ro f data ( rProfData ) ,

. p r o f i nd ex ( rProfIndex ) ,

. avm pro f i l eMaste r wr i t e ( r e t r i e v e avm pro f i l eMas t e r wr i t e ) ,

. avm pro f i l eMaste r by teenab le ( r e t r i e v e avm pro f i l eMas t e r by t e enab l e ) ,

endgenerate

endmodule

module OpDecode (

input clk ,

input r e se t ,

input [ 2 5 : 0 ] pc ,

input [ 2 5 : 0 ] pc nolag ,

input [ 3 1 : 0 ] i n s t r ,

input [ 3 1 : 0 ] i n s t r n o l ag ,

input insVal id ,

output ret ,

output c a l l ,

output reg j a l r ,

output reg j a l r d on e ,

B Source Code

output [ 2 5 : 0 ] target ,

output [ 2 5 : 0 ] j a l r t a r g e t ,

output reg pc wi th in 8

reg [ 2 5 : 0 ] p c j a l r ;

reg [ 2 : 0 ] p c d i f f c o u n t ;

// Detect c a l l s and re turns

wire tCa l l = in sVa l id & ( i n s t r [ 3 1 : 2 6 ] == 6 ' b00 0011 ) ;

wire tRet = in sVa l id & ( i n s t r [ 3 1 : 0 ] == 32 ' b0000 0011 1110 0000 0000 0000 0000 1000 ) ; // $s =$ra = 31

wire tJALR = ( i n s t r n o l a g [ 3 1 : 2 6 ] == 6 'b0 & in s t r n o l a g [ 2 0 : 1 6 ] == 5 'b0 & in s t r n o l a g [ 1 0 : 6 ] ==

5 'b0 & in s t r n o l a g [ 5 : 0 ] == 6 ' b00 1001 ) ;

wire [ 2 5 : 0 ] tTarget = { i n s t r [ 2 3 : 0 ] , 2 'b0 } ;

assign c a l l = tCa l l ;

assign r e t = tRet ;

assign t a r g e t = tTarget ;

assign j a l r t a r g e t = p c j a l r ;

// On a JALr , f i nd when the ac tua l jump occurs , use t h i s PC va lue as the t a r g e t address

j a l r d on e <= 1 'b0 ;

p c d i f f c o u n t <= 'b0 ;

j a l r <= 1 'b0 ;

p c j a l r <= 'b0 ;

end else begin

i f ( j a l r d on e ) begin

i f ( p c d i f f c o u n t < 3 ' b010 ) p c d i f f c o u n t <= pc d i f f c o u n t + 1 'b1 ;

else begin

p c d i f f c o u n t <= 'b0 ;

j a l r d on e <= 1 ' b0 ;

end else i f (tJALR) begin

p c d i f f c o u n t <= ' b1 ;

p c j a l r <= pc nolag ;

j a l r <= 1 'b1 ;

j a l r d on e <= 1 'b0 ;

end else i f ( j a l r & ! pc wi th in 8 ) begin

p c d i f f c o u n t <= ' b0 ;

p c j a l r <= pc ;

j a l r <= 1 'b0 ;

j a l r d on e <= 1 'b1 ;

pc wi th in 8 <= ( pc nolag == pc ) | ( pc nolag == pc + 'h4 ) ;

endmodule

module In s t ruc t ionCounte r (

input clk ,

input r e se t ,

input [ 3 1 : 0 ] prev pc ,

input [ 3 1 : 0 ] pc ,

input ret ,

input c a l l ,

output [`CW−1:0 ] count out ,

input i n i t s t a r t

reg [`CW−1:0 ] counter ;

assign count out = counter ;

wire p c d i f f = ( prev pc != pc ) ;

wire [ 3 1 : 0 ] counte r p lu s one = counter + 1 'b1 ;

i f ( i n i t s t a r t | c a l l | r e t ) begin

counter <= 'b1 ;

end else i f ( r e s e t ) begin

counter <= 'b0 ;

end else i f ( p c d i f f ) begin

counter <= counte r p lu s one ;

endmodule

module CycleCounter (

input clk ,

input r e se t ,

B Source Code

input ret ,

input c a l l ,

counter <= 'b1 ;

counter <= 'b0 ;

end else begin

endmodule

module Sta l lCounte r (

input clk ,

input r e se t ,

input s t a l l ,

input ret ,

input c a l l ,

counter <= s t a l l ;

counter <= 'b0 ;

end else i f ( s t a l l ) begin

endmodule

module PowerCounter (

input clk ,

input r e se t ,

input p c d i f f ,

input s t a l l ,

input [ 3 1 : 0 ] i n s t r i n ,

input ret ,

input c a l l ,

// This assumes 6 groups + s t a l l s ( not parameter i zed )

reg [`PW−1:0 ] A;

reg [`PW−1:0 ] B;

reg [`PW−1:0 ] C;

reg [`PW−1:0 ] D;

reg [`PW−1:0 ] E ;

reg [`PW−1:0 ] F ;

reg [`PSW−1:0 ] STALL;

reg [`PW−1:0 ] UNKNOWN;

assign count out = { A,B,C,D,E,F ,STALL } ;

wire [ 5 : 0 ] op = i n s t r i n [ 3 1 : 2 6 ] ;

wire [ 5 : 0 ] funct = i n s t r i n [ 5 : 0 ] ;

reg [ 3 : 0 ] cur group ;

A <= ' b0 ;

B <= ' b0 ;

C <= ' b0 ;

D <= ' b0 ;

E <= ' b0 ;

F <= ' b0 ;

STALL <= 'b0 ;

cur group <= 'b0 ;

end else begin

B Source Code

cur group = 'b0 ;

// PC DIFF

i f (˜ s t a l l & p c d i f f ) begin

// NOP

i f ( i n s t r i n == ' b0 ) begin

cur group = `P NOP; // NOP

// BEQ, BNE, BLEZ, BGTZ

end else i f ( op [ 5 : 2 ] == 4 ' b0001 ) begin

i f ( op [ 1 : 0 ] == 2 ' b00 ) cur group = `P BEQ; // BEQ

i f ( op [ 1 : 0 ] == 2 ' b01 ) cur group = `P BNE ; // BNE

i f ( op [ 1 : 0 ] == 2 ' b10 ) cur group = `P BLEZ ; // BLEZ

i f ( op [ 1 : 0 ] == 2 ' b11 ) cur group = `P BGTZ ; // BGTZ

// BLTZ, BGEZ, BGEZAL, BLTZAL

end else i f ( op == 6 ' b000001 ) begin

i f ( i n s t r i n [ 1 6 ] == 1 'b0 ) cur group = `P BLTZ ; // BLTZ, BLTZAL

else cur group = `P BGEZ ; // BGEZ, BGEZAL

// R−type i n s t r u c t i o n

end else begin

i f ( op == ' b000000 ) begin

// ADD, ADDU, SUB, SUBU

i f ( funct [ 5 : 2 ] == 4 ' b1000 ) begin

i f ( funct [ 1 ] == 1 'b0 ) cur group = `P ADD ADDU; // ADD, ADDU

else begin

i f ( funct [ 0 ] == 1 'b0 ) cur group = `P SUB ; // SUB

else cur group = `P SUBU; // SUBU

// MUL, MULU, DIV, DIVU

end else i f ( funct [ 5 : 2 ] == 4 ' b0110 ) begin

i f ( funct [ 1 : 0 ] == 2 ' b00 ) cur group = `P MULT; // MULT

i f ( funct [ 1 : 0 ] == 2 ' b01 ) cur group = `P MULTU; // MULTU

i f ( funct [ 1 ] == 1 'b1 ) cur group = `P DIV DIVU ; // DIV, DIVU

// AND, OR, XOR

i f ( funct [ 1 : 0 ] == 2 ' b00 ) cur group = `P AND; // AND

i f ( funct [ 1 : 0 ] == 2 ' b01 ) cur group = `P OR; // OR

i f ( funct [ 1 : 0 ] == 2 ' b10 ) cur group = `P XOR; // XOR

// MFHI, MFLO,

end else i f ( funct [ 5 : 2 ] == ' b0100 ) begin

i f ( funct [ 1 ] == 1 'b0 ) cur group = `P MFHI ; // MFHI

else cur group = `P MFLO; // MFLO

// SLL , SRA, SRL, SLLV, SRLV

end else i f ( funct [ 5 : 3 ] == 3 ' b000 && ( funct [ 1 ] | | ˜ funct [ 0 ] ) ) begin

i f ( funct [ 2 : 0 ] == 2 ' b000 ) cur group = `P SLL ; // SLL ,

i f ( funct [ 2 : 0 ] == 2 ' b100 ) cur group = `P SLLV ; // SLLV

i f ( funct [ 1 : 0 ] == 2 ' b11 ) cur group = `P SRA ; // SRA

i f ( funct [ 2 : 0 ] == 3 ' b010 ) cur group = `P SRL ; // SRL

i f ( funct [ 2 : 0 ] == 3 ' b110 ) cur group = `P SRLV ; // SRLV

// SLT, SLTU

i f ( funct [ 0 ] == 1 'b0 ) cur group = `P SLT ; // SLT

else cur group = `P SLTU ; // SLTU

// JR, JALR

cur group = `P JR ; // JR, JALR

end else begin

UNKNOWN <= UNKNOWN + 1 'b1 ;

// ADDI, ADDIU, SLTI , SLTIU, ANDI, ORI, XORI, LUI

end else i f ( op [ 5 : 3 ] == 3 ' b001 ) begin

i f ( op [ 2 : 0 ] == 3 ' b101 ) cur group = `P ORI ; // ORI

else i f ( op [ 2 : 0 ] == 3 ' b111 ) cur group = `P LUI ; // LUI

else i f ( op [ 2 : 1 ] == 2 ' b00 ) cur group = `P ADDI ADDIU ; // ADDI, ADDIU

else i f ( op [ 2 : 1 ] == 2 ' b10 ) cur group = `P ANDI ; // ANDI

else i f ( op [ 2 : 0 ] == 3 ' b010 ) cur group = `P SLTI ; // SLTI

else i f ( op [ 1 : 0 ] == 2 ' b11 ) cur group = `P SLTIU ; // SLTIU

else i f ( op [ 1 : 0 ] == 2 ' b10 ) cur group = `P XORI ; // XORI

// LBU, LHU

end else i f ( op [ 5 : 1 ] == ' b10010 ) begin

i f ( op [ 0 ] == 1 'b0 ) cur group = `P LBU ; // LBU

else cur group = `P LHU ; // LHU

B Source Code

end else i f ( op == ' b100001 ) begin

cur group = `P LH ; // LH

// LB, LH, LW, SB, SH, SW

end else i f ( op [ 5 : 4 ] == ' b10 && ˜op [ 2 ] ) begin

i f ( op [ 1 : 0 ] == 2 ' b11 ) begin

i f ( op [ 3 ] == 1 'b0 ) cur group = `P LW; // LW

else cur group = `P SW; // SW

end else begin

i f ( op [ 3 ] == 1 'b0 ) cur group = `P LB ; // LB

else cur group = `P SB SH ; // SB, SH

// J , JAL

end else i f ( op [ 5 : 1 ] == ' b00001 ) begin

i f ( op [ 0 ] == 1 'b0 ) cur group = `P J ; // J

else cur group = `P JAL ; // JAL

// Cache S t a l l

end else i f ( s t a l l ) begin

cur group = `STALL;

// P i pe l i n e S t a l l

end else begin

cur group = `STALL;

A <= ( cur group == `A) ;

B <= ( cur group == `B) ;

C <= ( cur group == `C) ;

D <= ( cur group == `D) ;

E <= ( cur group == `E) ;

F <= ( cur group == `F) ;

STALL <= ( cur group == `STALL) ;

end else begin

case ( cur group )

B.2 LEAP Source Code`A: A <= A + 1 ' b1 ;`B : B <= B + 1 ' b1 ;`C : C <= C + 1 ' b1 ;`D: D <= D + 1 ' b1 ;`E : E <= E + 1 ' b1 ;`F : F <= F + 1 ' b1 ;`STALL : STALL <= STALL + 1 'b1 ;

endcase

endmodule

module CounterStorage (

input clk ,

input r e se t ,

input c a l l ,

input ret ,

input [ `N2−1:0 ] cur func num ,

input [`CW−1:0 ] count ,

input [`CW−1:0 ] ch i ld count ,

input i n i t s t a r t ,

input p c d i f f ,

output reg ram wren ,

output reg [ `N2−1:0 ] ram addr ,

output reg [`CW−1:0 ] ram wr data ,

input [`CW−1:0 ] ram rd data

reg main func ;

reg [ 1 : 0 ] s t a t e ;

wire c a l l r e t ;

reg p r e v c a l l r e t ;

reg [`CW−1:0 ] p r ev ch i l d coun t ;

reg [`CW−1:0 ] rd data ;

reg [`CW−1:0 ] wr data ;

assign c a l l r e t = (`DO HIER) ? r e t : ( r e t | c a l l ) ; // s t o r e only on re t i f doing

h i e r a r c h i c h a l

// perform add i t i on s s e p a r a t e l y to avoid c r i t i c a l path

B Source Code

ram addr <= cur func num ;

i f (`PROF METHOD != ``p ' ' )i f (`DO HIER) ram wr data <= ram rd data + wr data + p r ev ch i l d coun t ;

else ram wr data <= ram rd data + wr data ;

else begin

ram wr data <= { rd data [ 1 5 9 : 1 3 8 ] + wr data [ 1 5 9 : 1 3 8 ] , rd data [ 1 3 7 : 1 1 6 ] + wr data

[ 1 3 7 : 1 1 6 ] ,

rd data [ 1 1 5 : 9 4 ] + wr data [ 1 1 5 : 9 4 ] , rd data [ 9 3 : 7 2 ] + wr data [ 9 3 : 7 2 ] ,

rd data [ 7 1 : 5 0 ] + wr data [ 7 1 : 5 0 ] , rd data [ 4 9 : 2 8 ] + wr data [ 4 9 : 2 8 ] , rd data [ 2 7 : 0 ]

+ wr data [ 2 7 : 0 ] } ;

// whenever c a l l or r e t go high , load o l d va lue and add count to i t

i f ( r e s e t | i n i t s t a r t ) begin

ram wren <= 1 'b0 ;

p r e v c a l l r e t <= 1 ' b0 ;

s t a t e <= 2 ' b00 ;

p r ev ch i l d coun t <= ' b0 ;

// main func used to count f i r s t c y c l e in ``main ' ' s ince i t s u s u a l l y compensated f o r by

the l a s t i n s t r in prev i ous func t i on

main func <= (`DO HIER) ? 1 'b0 : 1 'b1 ;

end else begin

i f ( s t a t e == 2 ' b00 ) begin

p r e v c a l l r e t <= c a l l r e t ;

p r ev ch i l d coun t <= ch i l d coun t ;

ram wren <= 1 'b0 ;

// i f we have a *new* c a l l or ret , read from RAM @ cur func num ( shou l d be a l ready

there )

i f ( c a l l r e t & ! p r e v c a l l r e t ) begin

i f (`PROF METHOD != ``p ' ' ) wr data <= count + ( p c d i f f & main func ) ;

else wr data <= count ;

s t a t e <= 2 ' b01 ;

main func <= 1 ' b0 ;

ram wren <= 1 'b1 ;

s t a t e <= 2 ' b00 ;

endmodule

module Cal lStack (

input clk ,

input r e se t ,

input ret ,

input c a l l ,

input [ `N2−1:0 ] func num in ,

output [ `N2−1:0 ] func num out ,

input p c d i f f ,

input s t o r e f i r s t c a l l ,

output reg [ `S2−1:0 ] ram addr ,

output [ `N2−1:0 ] ram wr data ,

input [ `N2−1:0 ] ram rd data

reg [ `S2−1:0 ] sp ; // s t a c k po i n t e r

reg i n i t c a l l d o n e ; // so we only do a c a l l to t ha t f unc t i on once

// Assign wires f o r RAM

assign ram wr data = func num in ;

assign func num out = ram rd data ;

reg s t a t e ;

i f ( r e s e t | ( i n i t s t a r t & ! s t o r e f i r s t c a l l & ! i n i t c a l l d o n e ) ) begin

sp <= ' b0 ;

ram wren <= 1 'b0 ;

i n i t c a l l d o n e <= 1 ' b0 ;

s t a t e <= 1 'b0 ;

B Source Code

end else i f ( s t a t e == 1 'b0 ) begin

// pop on a re turn −− decrement sp then re turn va lue @ new sp

i f ( r e t ) begin

sp <= sp − 1 'b1 ;

ram addr <= sp − 1 'b1 ;

ram wren <= 1 'b0 ;

s t a t e <= 1 'b1 ;

// push on c a l l (when ready ) −− s t o r e va lue @ curren t sp then increment sp

end else i f ( c a l l ) begin

sp <= sp + 1 'b1 ;

ram addr <= sp ;

ram wren <= 1 'b1 ;

i n i t c a l l d o n e <= 1 'b1 ;

s t a t e <= 1 'b1 ;

// i f not c a l l or ret , s top wr i t e

end else begin

ram wren <= 1 'b0 ;

ram addr <= sp−1 'b1 ;

end else i f ( s t a t e == 1 'b1 ) begin

i f ( ! c a l l & ! r e t ) s t a t e <= 1 'b0 ;

ram wren <= 1 'b0 ;

endmodule

module HierarchyStack (

input clk ,

input r e se t ,

input ret ,

input c a l l ,

output reg [`CW−1:0 ] ch i ld count ,

input [`CW−1:0 ] count in ,

output reg [ `S2−1:0 ] ram addr ,

output reg [`CW−1:0 ] ram wr data ,

input [`CW−1:0 ] ram rd data

reg [ `S2−1:0 ] sp ; // s t a c k po i n t e r

reg [ 1 : 0 ] s t a t e ;

reg p r ev r e t ;

reg [`CW−1:0 ] prev count ;

reg [`CW−1:0 ] rd data ;

// grab the counter va lue b e f o r e i t ' s r e s e t @ re t / c a l l !

i f ( r e s e t | i n i t s t a r t ) begin

sp <= ' b0 ;

ram wren <= 1 'b0 ;

ch i l d coun t <= 'b0 ;

s t a t e <= 2 ' b00 ;

prev count <= 'b0 ;

// pop on a re turn −− decrement sp then re turn va lue @ new sp

i f ( r e t ) begin

sp <= sp − 1 'b1 ;

ram wren <= 1 'b0 ;

p r ev r e t <= 1 'b1 ;

s t a t e <= 2 ' b01 ;

ch i l d coun t <= ch i l d coun t + count in ;

rd data <= ram rd data ;

// push on c a l l (when ready ) −− s t o r e va lue @ curren t sp then increment sp

end else i f ( c a l l ) begin

sp <= sp + 1 'b1 ;

ram addr <= sp ;

ram wren <= 1 'b1 ;

p r ev r e t <= 1 'b0 ;

s t a t e <= 2 ' b01 ;

ram wr data <= ch i l d coun t + count in ;

// i f not c a l l or ret , s top wr i t e

end else begin

ram wren <= 1 'b0 ;

B Source Code

ram wren <= 1 'b0 ;

s t a t e <= 2 ' b00 ;

i f ( p r ev r e t ) ch i l d coun t <= ch i l d coun t + rd data ;

else ch i l d coun t <= 'b0 ;

endmodule

module AddressHash (

input clk ,

input r e se t ,

input [ 3 1 : 0 ] i n s t r n o l a g ,

input i n s t rVa l i d no l ag ,

input [ 2 5 : 0 ] addr in ,

output [ `N2−1:0 ] funcNum ,

input enb l in ,

output reg i n i t done ,

input i n s t r i n i t ,

output [ `N2−1:0 ] ram addr ,

input [ 7 : 0 ] ram rd data

reg [ 3 1 : 0 ] V1 ;

reg [ 7 : 0 ] B1 ;

reg [ 7 : 0 ] B2 ;

reg [ 7 : 0 ] A1 ;

reg [ 7 : 0 ] A2 ;

reg [ 3 1 : 0 ] a ;

wire [ 3 1 : 0 ] b ;

wire [ 7 : 0 ] tab b ;

reg [ 3 1 : 0 ] val ;

reg [ `N2−1:0 ] r s l ;

reg [ 3 : 0 ] s t a t e ;

reg [ 3 : 0 ] i S t a t e ;

reg [ 3 : 0 ] n ex t s t a t e ;

reg [ `N2−1:0 ] i n i t a d d r ;

reg [ 1 : 0 ] rd cn t ;

assign funcNum = r s l [ `N2−1 : 0 ] ;

assign ram addr = ( i n i t s t a r t & ! i n i t d on e ) ? i n i t a d d r : b [ `N2−1 : 0 ] ;

assign tab b = ram rd data ;

// decode non−l a gged i n s t r

wire c a l l = ( i n s t rV a l i d n o l a g & ( i n s t r n o l a g [ 3 1 : 2 6 ] == 6 ' b00 0011 ) ) ;

wire [ 2 5 : 0 ] addr = c a l l ? j a l r ? addr in : {6 'b0 , i n s t r n o l a g [ 2 3 : 0 ] , 2 'b0 } : 32 ' h00800000 ;

// i n i t i a l i z e hash parameters

i S t a t e <= 4 ' b0000 ;

i n i t d on e <= 1 'b0 ;

end else i f ( i n i t s t a r t & ! i n i t d on e ) begin

// Sta te 0 : S ta r t reading V1

i f ( i S t a t e == 4 ' b0000 ) begin

i n i t a d d r <= `N;

iS t a t e <= 4 ' b1111 ;

n ex t s t a t e <= 4 ' b0001 ;

rd cn t <= 2 ' b00 ;

// Sta te 1 : Read next by te o f V1

end else i f ( i S t a t e == 4 ' b0001 ) begin

i n i t a d d r <= in i t a d d r + 1 'b1 ;

rd cn t <= rd cnt + 1 'b1 ;

case ( rd cn t )

2 ' b00 : V1 [ 3 1 : 2 4 ] <= ram rd data ;

2 ' b01 : V1 [ 2 3 : 1 6 ] <= ram rd data ;

2 ' b10 : V1 [ 1 5 : 8 ] <= ram rd data ;

2 ' b11 : V1[ 7 : 0 ] <= ram rd data ;

endcase

i f ( rd cn t == 2 ' b11 ) n ex t s t a t e <= 4 ' b0010 ;

else nex t s t a t e <= 4 ' b0001 ;

i S t a t e <= 4 ' b1111 ; // go to in termed ia te s t a t e to wai t f o r read data

B Source Code

// Sta te 2 : Read A/B

i n i t a d d r <= in i t a d d r + 1 'b1 ;

rd cn t <= rd cnt + 1 'b1 ;

case ( rd cn t )

2 ' b00 : A1 <= ram rd data ;

2 ' b01 : A2 <= ram rd data ;

2 ' b10 : B1 <= ram rd data ;

2 ' b11 : B2 <= ram rd data ;

endcase

i f ( rd cn t == 2 ' b11 ) n ex t s t a t e <= 4 ' b0011 ;

else nex t s t a t e <= 4 ' b0010 ;

i S t a t e <= 4 ' b1111 ; // go to in termed ia te s t a t e to wai t f o r read data

// Sta te 3 : Hash i n i t i a l i z a t i o n i s done

i n i t d on e <= 1 'b1 ;

i S t a t e <= 4 ' b0000 ;

// Delay one cyc l e f o r RAM to read data

i S t a t e <= nex t s t a t e ;

end else i f ( ! i n i t s t a r t & in i t d on e ) begin // c l e a r i n i t done at end o f handshake

i n i t d on e <= 1 'b0 ;

wire enb l = i n s t r i n i t | c a l l | j a l r ;

assign b = ( val >> B1) & B2 ;

// s t a t e machine f o r hashing

s t a t e <= 3 ' b000 ;

end else begin

i f ( s t a t e == 3 ' b000 ) begin

i f ( enb l ) begin

val = addr + V1 ;

val = val + ( val << 8) ;

val = val ˆ ( val >> 4) ;

s t a t e <= 3 ' b001 ;

end else begin

s t a t e <= sta t e ;

a = ( val + ( val << A1) ) ;

s t a t e <= 3 ' b010 ;

r s l = ( ( a >> A2) ˆ tab b ) ;

s t a t e <= 3 ' b000 ;

endmodule

module DataStorage (

input clk ,

input r e se t ,

input r e t r i e v e s t a r t ,

// Hash I /Os

input hash wren ,

input [ `N2−1:0 ] hash addr ,

input [ 7 : 0 ] hash wr data ,

output [ 7 : 0 ] hash rd data ,

// Address Stack I /Os

input ca l l s t a ck wr en ,

input [ `S2−1:0 ] c a l l s t a ck add r ,

input [ `N2−1:0 ] c a l l s t a ck wr da t a ,

output [ `N2−1:0 ] c a l l s t a c k r d d a t a ,

// Hierarchy Stack I /Os

input h ie r s tack wren ,

input [ `S2−1:0 ] h i e r s t a ck add r ,

input [`CW−1:0 ] h i e r s t a ck wr da t a ,

B Source Code

output [`CW−1:0 ] h i e r s t a ck rd da t a ,

// Storage I /Os

input storage wren ,

input [ `N2−1:0 ] storage addr ,

input [`CW−1:0 ] storage wr data ,

output [`CW−1:0 ] s t o r age rd da t a

// 9 b i t Address f o r index ing i n to M4K (2ˆ9 = 512 , 512*9 = 4096)

wire [ 8 : 0 ] hash 9b i t addr = { 1 'b0 , {(8−`N2) {1 'b0 }} , hash addr [ `N2−1:0 ] } ;

wire [ 8 : 0 ] c a l l s t a c k 9 b i t a d d r = { 1 'b1 , {(8−`S2 ) {1 'b0}} , c a l l s t a c k add r [ `S2−1:0 ] } ;

// Extend Stack Data f o r 8 b i t s

wire [ 7 : 0 ] c a l l s t a c k 8b i t w r d a t a = { {(8−`N2) {1 'b0}} , c a l l s t a c k w r d a t a [ `N2−1:0 ] } ;

wire [ 7 : 0 ] c a l l s t a c k 8 b i t r d d a t a ;

assign c a l l s t a c k r d d a t a = c a l l s t a c k 8 b i t r d d a t a [ `N2−1 : 0 ] ;

reg i n i t d on e ;

// muxing wires f o r r e s e t t i n g s t o ra g e RAM

wire storage wren mux ;

wire [ `N2−1:0 ] storage addr mux ;

wire [`CW−1:0 ] storage wr data mux ;

reg [ `N2−1:0 ] r e s e t c n t ;

assign storage wren mux = ( i n i t s t a r t & ! i n i t d on e ) ? 1 'b1 : ( r e t r i e v e s t a r t ) ? 1 'b0 :

storage wren ;

assign storage addr mux = ( i n i t s t a r t & ! i n i t d on e ) ? r e s e t c n t : s to rage addr ;

assign storage wr data mux = ( i n i t s t a r t & ! i n i t d on e ) ? 'b0 : s torage wr data ;

// s t a t e machine to r e s e t a l l s t o ra g e RAM data

i n i t d on e <= 1 'b0 ;

r e s e t c n t <= 'b0 ;

end else i f ( i n i t s t a r t & ! i n i t d on e ) begin

i f ( r e s e t c n t < `N−1) r e s e t c n t <= r e s e t c n t + 1 'b1 ;

else i n i t d on e <= 1 'b1 ;

end else i f ( ! i n i t s t a r t & in i t d on e ) begin

i n i t d on e <= 1 'b0 ;

r e s e t c n t <= 'b0 ;

altsyncram h i e r s t a c k (

. wren a ( h i e r s t a ck wr en ) ,

. c l ock0 ( c l k ) ,

. c locken0 (1 ' b1 ) ,

. addre s s a ( h i e r s t a ck add r ) ,

. data a ( h i e r s t a ck wr da t a ) ,

. q a ( h i e r s t a c k r d d a t a ) ,

. a c l r 0 (1 ' b0 ) ,

. a c l r 1 (1 ' b0 ) ,

. addre s s b (1 ' b1 ) ,

. a d d r e s s s t a l l a (1 ' b0 ) ,

. a d d r e s s s t a l l b (1 ' b0 ) ,

. byteena a (1 ' b1 ) ,

. byteena b (1 ' b1 ) ,

. c l ock1 (1 ' b1 ) ,

. c locken1 (1 ' b1 ) ,

. c locken2 (1 ' b1 ) ,

. c locken3 (1 ' b1 ) ,

. data b (1 ' b1 ) ,

. e c c s t a tu s ( ) ,

. q b ( ) ,

. rden a (1 ' b1 ) ,

. rden b (1 ' b1 ) ,

. wren b (1 ' b0 )

defparam

h i e r s t a c k . c l o ck enab l e i npu t a = ``BYPASS ' ' ,

h i e r s t a c k . c l o ck enab l e ou tpu t a = ``BYPASS ' ' ,

h i e r s t a c k . i n t end ed d ev i c e f am i l y = ``Cyclone I I ' ' ,

h i e r s t a c k . lpm hint = ``ENABLE RUNTIME MOD=NO' ' ,

h i e r s t a c k . lpm type = ``altsyncram ' ' ,

h i e r s t a c k . numwords a = `S ,

h i e r s t a c k . operat ion mode = ``SINGLE PORT ' ' ,

h i e r s t a c k . ou tda t a ac l r a = ``NONE' ' ,

h i e r s t a c k . ou tdata reg a = ``UNREGISTERED ' ' ,

h i e r s t a c k . p owe r up un in i t i a l i z ed = ``FALSE ' ' ,

h i e r s t a c k . ram block type = ``M4K ' ' ,

h i e r s t a c k . widthad a = `S2 ,

h i e r s t a c k . width a = `CW,

B Source Code

h i e r s t a c k . width byteena a = 1 ;

// I n s t a n t i a t e the Hash Array & Ca l l Stack using AltSyncRam

altsyncram hash stack (

// Hash S i gna l s ( Port A)

. wren a ( hash wren ) ,

. addre s s a ( hash 9b i t addr ) ,

. data a ( hash wr data ) ,

. q a ( hash rd data ) ,

. byteena a (1 ' b1 ) ,

// Stack S i gna l s ( Port B)

. wren b ( c a l l s t a c k w r e n ) ,

. addre s s b ( c a l l s t a c k 9 b i t a d d r ) ,

. data b ( c a l l s t a c k 8b i t w r d a t a ) ,

. q b ( c a l l s t a c k 8 b i t r d d a t a ) ,

. byteena b (1 ' b1 ) ,

// Common S i gna l s

. c l ock0 ( c l k ) ,

. c locken0 (1 ' b1 ) ,

. a c l r 0 (1 ' b0 ) ,

. a c l r 1 (1 ' b0 ) ,

. a d d r e s s s t a l l a (1 ' b0 ) ,

. a d d r e s s s t a l l b (1 ' b0 ) ,

. c l ock1 (1 ' b1 ) ,

. c locken1 (1 ' b1 ) ,

. c locken2 (1 ' b1 ) ,

. c locken3 (1 ' b1 ) ,

. e c c s t a tu s ( ) ,

. rden a (1 ' b1 ) ,

. rden b (1 ' b1 )

defparam

hash s tack . add r e s s r e g b = ``CLOCK0' ' ,

hash s tack . c l o ck enab l e i npu t a = ``BYPASS ' ' ,

hash s tack . c l o ck enab l e i npu t b = ``BYPASS ' ' ,

hash s tack . c l o ck enab l e ou tpu t a = ``BYPASS ' ' ,

hash s tack . c l o ck enab l e ou tpu t b = ``BYPASS ' ' ,

hash s tack . i nda t a r e g b = ``CLOCK0' ' ,

hash s tack . i n t end ed d ev i c e f am i l y = ``Cyclone I I ' ' ,

hash s tack . lpm type = ``altsyncram ' ' ,

hash s tack . numwords a = 512 ,

hash s tack . numwords b = 512 ,

hash s tack . operat ion mode = ``BIDIR DUAL PORT ' ' ,

hash s tack . ou tda t a ac l r a = ``NONE' ' ,

hash s tack . ou tda t a ac l r b = ``NONE' ' ,

hash s tack . ou tdata reg a = ``UNREGISTERED ' ' ,

hash s tack . ou tdata reg b = ``UNREGISTERED ' ' ,

hash s tack . p owe r up un in i t i a l i z ed = ``FALSE ' ' ,

hash s tack . r ead dur ing wr i t e mode mixed por t s = ``DONT CARE' ' ,

hash s tack . widthad a = 9 , // (2ˆ9) *8 = 4096. Just use the whole M4K( s )

hash s tack . widthad b = 9 , // (2ˆ9) *8 = 4096. Just use the whole M4K( s )

hash s tack . width a = 8 , // i f N <= 256 , widthad can be 8 . i f more w i l l need separa te

w id ths f o r each por t

hash s tack . width b = 8 , // i f N <= 256 , widthad can be 8 . i f more w i l l need separa te

w id ths f o r each por t

hash s tack . width byteena a = 1 ,

hash s tack . width byteena b = 1 ,

hash s tack . wrcont ro l wraddre s s r eg b = ``CLOCK0 ' ' ;// I n s t a n t i a t e the Storage RAM

altsyncram storage (

. wren a ( storage wren mux ) ,

. c l ock0 ( c l k ) ,

. c locken0 (1 ' b1 ) ,

. addre s s a ( storage addr mux ) ,

. data a ( storage wr data mux ) ,

. q a ( s t o r age rd da t a ) ,

. a c l r 0 (1 ' b0 ) ,

. a c l r 1 (1 ' b0 ) ,

. addre s s b (1 ' b1 ) ,

. a d d r e s s s t a l l a (1 ' b0 ) ,

. a d d r e s s s t a l l b (1 ' b0 ) ,

. byteena a (1 ' b1 ) ,

. byteena b (1 ' b1 ) ,

. c l ock1 (1 ' b1 ) ,

. c locken1 (1 ' b1 ) ,

. c locken2 (1 ' b1 ) ,

. c locken3 (1 ' b1 ) ,

. data b (1 ' b1 ) ,

. e c c s t a tu s ( ) ,

. q b ( ) ,

B Source Code

. rden a (1 ' b1 ) ,

. rden b (1 ' b1 ) ,

. wren b (1 ' b0 )

defparam

s torage . c l o ck enab l e i npu t a = ``BYPASS ' ' ,

s t o rage . c l o ck enab l e ou tpu t a = ``BYPASS ' ' ,

s t o rage . i n t end ed d ev i c e f am i l y = ``Cyclone I I ' ' ,

s t o rage . lpm hint = ``ENABLE RUNTIME MOD=NO' ' ,

s t o rage . lpm type = ``altsyncram ' ' ,

s t o rage . numwords a = `N,

storage . operat ion mode = ``SINGLE PORT ' ' ,

s t o rage . ou tda t a ac l r a = ``NONE' ' ,

s t o rage . ou tdata reg a = ``UNREGISTERED ' ' ,

s t o rage . p owe r up un in i t i a l i z ed = ``FALSE ' ' ,

s t o rage . ram block type = ``M4K ' ' ,

s t o rage . widthad a = `N2 ,

s torage . width a = `CW,

storage . width byteena a = 1 ;

endmodule

module v I n i t i a l i z e r (

input clk ,

input r e se t ,

output r e s e t ou t ,

input [ 2 5 : 0 ] pc ,

output reg i n i t h a s h s t a r t ,

input i n i t hash done ,

output reg h a s h f i r s t s t a r t ,

input hash f i r s t d on e ,

output reg hashWren ,

output reg [ `N2−1:0 ] hashAddr ,

output reg [ 7 : 0 ] hashWrData ,

output reg avm prof i leMaste r read ,

output reg avm pro f i l eMaste r wr i t e ,

output reg [ 3 1 : 0 ] avm pro f i l eMaste r addre s s ,

output reg [ 3 1 : 0 ] avm pro f i l eMaste r wr i t edata ,

output reg [ 3 : 0 ] avm pro f i l eMaste r by teenab le ,

reg [ 2 : 0 ] s t a t e ;

reg [ 2 : 0 ] n ex t s t a t e ;

reg [ `N2 : 0 ] read count ;

reg [ 2 : 0 ] hashByte ;

reg [ 3 1 : 0 ] hashWrData32 ;

reg needData ;

assign r e s e t ou t = ( i n i t s t a r t & ! i n i t d on e & s t a t e ==3'b000 ) ;

// Read each va lue from counter storage ' s RAM, wr i te to SDRAM

s t a t e <= 3 ' b000 ;

avm pro f i l eMaste r wr i t e <= 1 'b0 ;

avm pro f i l eMaste r r ead <= 1 'b0 ;

i n i t d on e <= 1 'b0 ;

i n i t h a s h s t a r t <= 1 'b0 ;

h a s h f i r s t s t a r t <= 1 'b0 ;

hashByte <= 3 ' b000 ;

needData <= 1 'b0 ; // so we can s t a r t the next sdram read immediate ly

end else begin

i f ( i n i t s t a r t & ! i n i t d on e ) begin

// Sta te 0 : Reset p r o f i l e r

i f ( s t a t e == 3 ' b000 ) begin

// ge t sdram read ready

avm pro f i l eMaste r by teenab le <= 4 ' b1111 ;

// ge t ready to read V1/A/B from SDRAM

avm pro f i l eMaste r addre s s <= `PROF ADDR + `N;

avm pro f i l eMaste r r ead <= 1 ' b1 ;

read count <= ' b0 ;

needData <= 1 'b1 ;

s t a t e <= 3 ' b001 ;

// Sta te 1 : Read V1/A1/A2/B1/B2 from SDRAM, wr i te to hash RAM

B Source Code

i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin

i f ( needData) begin

avm pro f i l eMaste r addre s s <= avm pro f i l eMaste r addre s s + 'h4 ;

needData <= 1 'b0 ;

end else begin

i f ( avm pro f i l eMaste r r eaddatava l id ) begin

// prepare wri te−out to hash BRAM

s t a t e <= 3 ' b111 ;

hashAddr <= read count + `N;

hashByte <= 3 ' b001 ; // s ince i 'm pass ing in the f i r s t by te now , don ' t re−send i t

next c y c l e

hashWrData32 <= avm pro f i l eMaste r r eaddata ;

hashWrData <= avm pro f i l eMaste r r eaddata [ 3 1 : 2 4 ] ;

read count <= read count + ' h4 ;

// s top a f t e r V1/A/B are f u l l y read & wr i t t en

i f ( read count >= 'h8 ) begin // `N−4 b/c when reading at 28 , we read the 28 th ,

29 th , 30 th , and 31 s t b i t s so we ' re done

read count <= 'b0 ;

hashWren <= 1 'b0 ;

n ex t s t a t e <= 3 ' b010 ;

end else begin

hashWren <= 1 'b1 ;

n ex t s t a t e <= 3 ' b001 ;

avm pro f i l eMaste r r ead <= 1 'b1 ; // ge t next sdram read s t a r t e d

i f ( read count >= 'h4 ) needData <= 1 'b0 ;

else needData <= 1 'b1 ;

// Sta te 2 : I n i t i a l i z e Hash (Put V1/A/B in to r e g i s t e r s )

i f ( ! i n i t h a sh don e ) begin

i n i t h a s h s t a r t <= 1 'b1 ;

end else i f ( ! avm pro f i l eMaste r wa i t r eques t ) begin

i n i t h a s h s t a r t <= 1 'b0 ;

avm pro f i l eMaste r addre s s <= `PROF ADDR;

s t a t e <= 3 ' b011 ;

// Sta te 3 : Read tab [ ] from sdram , put to hash RAM

// i f we ' re not done g e t t i n g tab [ ] , keep reading

i f ( needData) begin

avm pro f i l eMaste r addre s s <= avm pro f i l eMaste r addre s s + 'h4 ;

needData <= 1 'b0 ;

end else begin

// prepare wri te−out to hash BRAM

s t a t e <= 3 ' b111 ;

hashAddr <= read count [ `N2−1 : 0 ] ;

hashByte <= 3 ' b001 ; // s ince i 'm pass ing in the f i r s t by te now , don ' t re−send i t

next c y c l e

hashWrData32 <= avm pro f i l eMaste r r eaddata ;

hashWrData <= avm pro f i l eMaste r r eaddata [ 3 1 : 2 4 ] ;

// i f we ' ve wr i t t en a l l 4 b y t e s AND we ' re done reading from sdram , s t a r t reading V1

i f ( read count >= (`N−4) ) begin // `N−4 b/c when reading at 28 , we read the 28 th ,

29 th , 30 th , and 31 s t b i t s so we ' re done

read count <= 'b0 ;

hashWren <= 1 'b0 ;

n ex t s t a t e <= 3 ' b100 ;

end else begin // otherwise , come back here when bram ' s done

read count <= read count + 'h4 ;

hashWren <= 1 'b1 ;

n ex t s t a t e <= 3 ' b011 ;

needData <= 1 'b1 ;

B Source Code

// Sta te 4 : Add f i r s t f unc t i on (main/wrapper ) to s t a c k

i f ( h a s h f i r s t d on e ) begin

s t a t e <= 3 ' b101 ;

// Sta te 5 : Te l l p rocessor to cont inue

avm pro f i l eMaste r addre s s <= `STACK ADDR;

avm pro f i l eMaste r wr i t edata <= 32 'hDEADBEEF;

s t a t e <= 3 ' b110 ;

// Sta te 6 : Wait u n t i l wr i t e f i n i s h e d

i f ( pc >= `WRAP MAIN BEGIN & pc <= `WRAP MAIN END) begin

i n i t d on e <= 1 'b1 ;

s t a t e <= 3 ' b000 ;

// Sta te 7 : Write the 4 by t e s to the hash RAM

// i f we haven ' t w r i t t en a l l 4 bytes , keep going

i f ( hashByte < 3 ' b100 ) begin

hashByte <= hashByte + 1 'b1 ;

hashAddr <= hashAddr + 1 'b1 ;

// choose by te to send (3 ' b000 i s a l ready sen t )

case ( hashByte )

3 ' b001 : hashWrData <= hashWrData32 [ 2 3 : 1 6 ] ;

3 ' b010 : hashWrData <= hashWrData32 [ 1 5 : 8 ] ;

3 ' b011 : hashWrData <= hashWrData32 [ 7 : 0 ] ;

endcase

end else begin

hashWren <= 1 ' b0 ;

s t a t e <= nex t s t a t e ;

i n i t d on e <= 1 'b0 ; // r e s e t i n i t done so handshaking can occur again

endmodule

module vRet r i ev e r ( // c y c l e s

input clk ,

input r e se t ,

input [ 2 5 : 0 ] pc ,

output reg r e t r i e v e don e ,

input [`CW−1:0 ] prof data ,

output reg [ `N2−1:0 ] p ro f index ,

input avm pro f i l eMaste r wa i t r eques t

reg [ 2 : 0 ] s t a t e ;

reg [ `N2 : 0 ] p ro f count ; // used b/c p ro f i ndex can only go up to `N−1, so i s always < `N (

c i r c u l a r )

r e t r i e v e don e <= 1 ' b0 ;

s t a t e <= 3 ' b000 ;

p r o f i nd ex <= 'b0 ;

B Source Code

end else begin

i f ( r e t r i e v e s t a r t & ! r e t r i e v e don e ) begin

// Sta te 0 : Setup muxes

i f ( s t a t e == 3 ' b000 ) begin

p r o f i nd ex <= ' b0 ;

p ro f count <= ' b0 ;

s t a t e <= 3 ' b001 ;

// Sta te 1 : Prepare / wr i te data to sdram −− PROBLEM, 64 b i t s w i l l r e qu i r e 2 wr i t e s / reads

else i f ( s t a t e == 3 ' b001 ) begin

i f ( p ro f count < `N) begin` i fdef CW64

avm pro f i l eMaste r addre s s <= `PROF ADDR + { pro f index , 3 ' b000 } ;`else

avm pro f i l eMaste r addre s s <= `PROF ADDR + { pro f index , 2 ' b00 } ;`endif

avm pro f i l eMaste r wr i t edata <= pro f data ;

p r o f i nd ex <= pro f i nd ex + 1 'b1 ;

p ro f count <= pro f count + 1 'b1 ;` i fdef CW64

s t a t e <= 3 ' b011 ;`endif

end else begin

avm pro f i l eMaste r wr i t edata <= 32 'hCADFEED;

s t a t e <= 3 ' b010 ;

// Sta te 2 : Wait u n t i l u n s t a l l wr i t e i s f i n i s h e d

avm pro f i l eMaste r wr i t e <= 0 ;

s t a t e <= 3 ' b000 ;

r e t r i e v e don e <= 1 'b1 ;

// Sta te 3 : Write out top 32 b i t s f o r a 64− b i t counter` i fdef CW64

else i f (`CW>32 & s t a t e == 3 ' b011 ) begin

avm pro f i l eMaste r addre s s <= `PROF ADDR + { pro f index , 3 ' b100 } ;

avm pro f i l eMaste r wr i t edata <= pro f data [ 6 3 : 3 2 ] ;

s t a t e <= 3 ' b001 ;

end`endif

end else i f ( ! r e t r i e v e s t a r t & r e t r i e v e don e ) begin

r e t r i e v e don e <= 1 'b0 ; // r e s e t r e t r i e v e done so handshaking can occur again

endmodule

// Retr i eve Power Resu l t s ( assuming entry s i z e o f 160=32x5 b i t s )

module vPowerRetriever ( // power

input clk ,

input r e se t ,

input [ 2 5 : 0 ] pc ,

input [`CW−1:0 ] prof data ,

output reg [ `N2−1:0 ] p ro f index ,

B Source Code

reg [ 2 : 0 ] s t a t e ;

reg [ `N2 : 0 ] p ro f count ; // used b/c p ro f i ndex can only go up to `N−1, so i s always <`N ( c i r c u l a r )

reg [ 3 1 : 0 ] p ro f addr ; // t h i s i s to j u s t add 32 b i t s to the address i n s t ead o f doing

s h i f t i n g / e t c .

reg [ 2 : 0 ] r epeat count ; // used to count out a l l 32− b i t words to wr i t e to sdram

reg [ 1 5 9 : 0 ] p r o f d a t a wr i t e ou t ; // so we can s t a r t the next RAM read and not l o s e the curren t

r e t r i e v e don e <= 1 ' b0 ;

s t a t e <= 3 ' b000 ;

p r o f i nd ex <= 'b0 ;

end else begin

i f ( s t a t e == 3 ' b000 ) begin

p r o f i nd ex <= ' b0 ;

p ro f addr <= `PROF ADDR;

s t a t e <= 3 ' b001 ;

// Sta te 1 : Prepare / wr i te data to sdram −− PROBLEM, 64 b i t s w i l l r e qu i r e 2 wr i t e s / reads

i f ( p ro f count < `N) begin

avm pro f i l eMaste r addre s s <= pro f addr ;

avm pro f i l eMaste r wr i t edata <= pro f data [ 1 5 9 : 1 2 8 ] ; // j u s t at i n t e r v a l s o f 32

f o r sdram b l o c k s i z e

p ro f d a t a wr i t e ou t <= pro f data ;

p ro f addr <= pro f addr + 32 'h4 ; // 4 by t e s = 32 b i t s = one sdram entry

p ro f i nd ex <= pro f i nd ex + 1 'b1 ; // s t a r t the next RAM read so i t s ready in

r epeat count <= 'b0 ;

s t a t e <= 3 ' b011 ;

end else begin

s t a t e <= 3 ' b010 ;

s t a t e <= 3 ' b000 ;

// Sta te 3 : Write a l l 160 b i t s to SDRAM

i f ( r epeat count < 3 ' b100 ) begin

avm pro f i l eMaste r addre s s <= pro f addr ;

p ro f addr <= pro f addr + 32 'h4 ;

// choose word to send

case ( r epeat count )

3 ' b000 : avm pro f i l eMaste r wr i t edata <= pro f d a t a wr i t e ou t [ 1 2 7 : 9 6 ] ;

3 ' b001 : avm pro f i l eMaste r wr i t edata <= pro f d a t a wr i t e ou t [ 9 5 : 6 4 ] ;

3 ' b010 : avm pro f i l eMaste r wr i t edata <= pro f d a t a wr i t e ou t [ 6 3 : 3 2 ] ;

3 ' b011 : avm pro f i l eMaste r wr i t edata <= pro f d a t a wr i t e ou t [ 3 1 : 0 ] ;

endcase

r epeat count <= repeat count + 1 'b1 ;

end else begin

pro f count <= pro f count + 1 'b1 ;

s t a t e <= 3 ' b001 ;

B Source Code

endmodule

B.3 SnoopP Source Code

// Top− l e v e l module f o r SnoopP p r o f i l e r

module SnoopP (

input clk ,

input r e se t ,

input [ 2 5 : 0 ] pc ,

input [ 3 1 : 0 ] i n s t r i n ,

input s t a l l ,

// Handshaking por t s

output i n i t done ,

output r e t r i e v e don e ,

// Avalon P ro f i l e Master por t s

output avm prof i leMaste r read ,

output avm pro f i l eMaste r wr i t e ,

output [ 3 1 : 0 ] avm pro f i l eMaste r addre s s ,

output [ 3 1 : 0 ] avm pro f i l eMaste r wr i t edata ,

wire [ `N2−1:0 ] addr a ;

wire [ `N2−1:0 ] count a ;

wire [ 2 5 : 0 ] addr lo data ;

wire [ 2 5 : 0 ] addr h i data ;

wire [`CW−1:0 ] count data ;

wire p r o f i n i t s t a r t ;

wire p r o f i n i t d on e ;

// mux ava lon s i g n a l s

wire i n i t a vm p r o f i l eMa s t e r w r i t e ;

wire [ 3 1 : 0 ] i n i t avm pro f i l eMas t e r add r e s s ;

wire [ 3 1 : 0 ] i n i t avm pro f i l eMas t e r wr i t eda t a ;

wire r e t r i e v e avm pro f i l eMas t e r wr i t e ;

wire [ 3 1 : 0 ] r e t r i e v e avm pro f i l eMas t e r add r e s s ;

wire [ 3 1 : 0 ] r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ;

B Source Code

assign avm pro f i l eMaste r wr i t e = ( i n i t s t a r t & ! i n i t d on e ) ? i n i t a vm p r o f i l eMa s t e r w r i t e :

r e t r i e v e avm pro f i l eMa s t e r wr i t e ;

assign avm pro f i l eMaste r addre s s = ( i n i t s t a r t & ! i n i t d on e ) ? i n i t avm pro f i l eMas t e r add r e s s

: r e t r i e v e avm pro f i l eMas t e r add r e s s ;

assign avm pro f i l eMaste r wr i t edata = ( i n i t s t a r t & ! i n i t d on e ) ?

i n i t avm pro f i l eMas t e r wr i t eda t a : r e t r i e v e avm pro f i l eMas t e r wr i t eda t a ;

// I n s t a n t i a t e the sub−modules

P r o f i l e r p r o f i l e (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. pc ( pc ) ,

. s t a l l ( s t a l l ) ,

. addr a ( addr a ) ,

. count a ( count a ) ,

. addr lo data ( addr lo data ) ,

. addr h i data ( addr h i data ) ,

. count data ( count data ) ,

. g l o b a l i n i t ( i n i t s t a r t | r e t r i e v e s t a r t ) ,

. i n i t s t a r t ( p r o f i n i t s t a r t ) ,

. i n i t d on e ( p r o f i n i t d on e )

s I n i t i a l i z e r i n i t (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. pc ( pc ) ,

. i n i t d on e ( i n i t d on e ) ,

. addr a ( addr a ) ,

. addr lo data ( addr lo data ) ,

. addr h i data ( addr h i data ) ,

. p r o f i n i t s t a r t ( p r o f i n i t s t a r t ) ,

. p r o f i n i t d on e ( p r o f i n i t d on e ) ,

. avm pro f i l eMaste r r ead ( avm pro f i l eMaste r r ead ) ,

. avm pro f i l eMaste r wr i t e ( i n i t a vm p r o f i l eMa s t e r w r i t e ) ,

. avm pro f i l eMaste r addre s s ( i n i t avm pro f i l eMas t e r add r e s s ) ,

. avm pro f i l eMaste r wr i t edata ( i n i t avm pro f i l eMas t e r wr i t eda t a ) ,

. avm pro f i l eMaste r wa i t r eques t ( avm pro f i l eMaste r wa i t r eques t ) ,

. avm pro f i l eMaste r r eaddata ( avm pro f i l eMaste r r eaddata ) ,

. avm pro f i l eMaste r r eaddatava l id ( avm pro f i l eMaste r r eaddatava l id )

sR e t r i e v e r r e t r i e v e (

. c l k ( c l k ) ,

. r e s e t ( r e s e t ) ,

. pc ( pc ) ,

. count a ( count a ) ,

. count data ( count data ) ,

. avm pro f i l eMaste r wr i t e ( r e t r i e v e avm pro f i l eMa s t e r wr i t e ) ,

endmodule

module P r o f i l e r (

input clk ,

input r e se t ,

input [ 2 5 : 0 ] pc ,

input s t a l l ,

// addresses to index i n to r e g i s t e r arrays

input [ `N2−1:0 ] addr a ,

input [ `N2−1:0 ] count a ,

// va l u e s returned from r e g i s t e r arrays

input [ 2 5 : 0 ] addr lo data ,

B Source Code

input [ 2 5 : 0 ] addr h i data ,

output [`CW−1:0 ] count data ,

// i n i t handshaking s i g n a l s

input g l o b a l i n i t , // to t e l l i f the system i s i n i t i a l i z i n g

input i n i t s t a r t , // t e l l i f new data i s ready to wr i t e i n t o reg ' soutput reg i n i t d on e

// Reg i s te r arrays to s t o r e add r l o / addr h i / counters

reg [ 2 5 : 0 ] addr lo [ `N−1 : 0 ] ;

reg [ 2 5 : 0 ] addr h i [ `N−1 : 0 ] ;

reg [`CW−1:0 ] count [ `N−1 : 0 ] ;

assign count data = count [ count a ] ;

// i n i t i a l i z e addr ranges

i n i t d on e <= 1 'b0 ;

end else begin

addr lo [ addr a ] <= addr lo data ;

addr h i [ addr a ] <= addr h i data ;

i n i t d on e <= 1 'b1 ;

i n i t d on e <= 1 'b0 ;

// se tup each addr range p r o f i l e r

integer i ;

reg [`CW−1:0 ] count va l ;

reg [ `N2−1:0 ] count cur ;

// Reset Counters

for ( i =0; i<`N; i=i +1)

count [ i ] <= 'b0 ;

end else begin

i f ( ! g l o b a l i n i t ) begin

i f ( s t a l l | (`PROF METHOD!=``s ' ' ) ) begin // only count on s t a l l s ( or always i f not

measuring s t a l l s )

for ( i =0; i<`N; i=i +1) begin

// Check which func t i on range the PC l i e s in

i f ( pc >= addr lo [ i ] & pc <= addr h i [ i ] ) begin

count [ i ] <= count [ i ] + 1 ' b1 ;

count va l <= count [ i ] + 1 'b1 ;

count cur <= i ;

endmodule

module s I n i t i a l i z e r (

input clk ,

input r e se t ,

input [ 2 5 : 0 ] pc ,

output reg [ `N2−1:0 ] addr a ,

output reg [ 2 5 : 0 ] addr lo data ,

output reg [ 2 5 : 0 ] addr h i data ,

output reg p r o f i n i t s t a r t ,

input p ro f i n i t d on e ,

output reg avm prof i leMaste r read ,

B Source Code

reg [ 2 : 0 ] s t a t e ;

reg [ 2 : 0 ] n ex t s t a t e ;

reg [ 2 5 : 0 ] sdram addr ;

reg [ `N2+1:0] read count ;

wire [ `N2−1:0 ] reg addr = read count [ `N2 : 1 ] ; // use to d i v i d e by 2 s ince l o and h i reg ' sa l t e r n a t e

s t a t e <= 3 ' b000 ;

avm pro f i l eMaste r wr i t edata <= ' b0 ;

i n i t d on e <= 1 'b0 ;

p r o f i n i t s t a r t <= 1 'b0 ;

end else begin

// Sta te 0 : Reset p r o f i l e r

i f ( s t a t e == 3 ' b000 ) begin

// ge t ready to read ranges from SDRAM

avm pro f i l eMaste r addre s s <= `PROF ADDR;

sdram addr <= `PROF ADDR;

avm pro f i l eMaste r r ead <= 1 ' b1 ;

p r o f i n i t s t a r t <= 1 'b1 ;

read count <= ' b0 ;

s t a t e <= 3 ' b001 ;

// Handshaking wi th p r o f i l e r module

i f ( p r o f i n i t s t a r t & p r o f i n i t d on e )

p r o f i n i t s t a r t <= 1 'b0 ;

// I f wai t r e que s t i s low we can g i v e another address to read from

// I f we ' ve g i ven address f o r a l l the b l o c k s we want , s top reading

i f ( avm pro f i l eMaste r addre s s == `PROF ADDR + (2*4*`N) )

avm pro f i l eMaste r addre s s <= avm pro f i l eMaste r addre s s + 4 'h4 ;

// I f we have v a l i d data

// i f we ' ve read a l l a dd r l o / addr hi , t e l l p rocessor to u n s t a l l

i f ( read count == (2*`N) ) begin

avm pro f i l eMaste r wr i t edata <= 32 'hDEADBEEF;

s t a t e <= 3 ' b011 ;

end else begin

read count <= read count + 1 'b1 ;

sdram addr <= sdram addr + 4 'h4 ; // t h i s v a r i a b l e i s on ly used f o r debug output

// s t o r e data

i f ( read count [ 0 ] ) begin

addr h i data <= avm pro f i l eMaste r r eaddata ; // s t o r e to add r l o r e g i s t e r s

addr a <= reg addr ;

p r o f i n i t s t a r t <= 1 'b1 ;

end else begin

addr lo data <= avm pro f i l eMaste r r eaddata ; // s t o r e to add r l o r e g i s t e r s

// Sta te 3 : Wait u n t i l wr i t e f i n i s h e d

i n i t d on e <= 1 'b1 ;

s t a t e <= 3 ' b000 ;

i n i t d on e <= 1 'b0 ; // r e s e t i n i t done so handshaking can occur again

endmodule

module sRe t r i e v e r (

B Source Code

input clk ,

input r e se t ,

input [ 2 5 : 0 ] pc ,

output reg [ `N2−1:0 ] count a ,

input [`CW−1:0 ] count data ,

reg [ 2 : 0 ] s t a t e ;

reg [ `N2 : 0 ] p ro f count ; // used b/c p ro f i ndex can only go up to `N−1, so i s always < `N (

c i r c u l a r )

r e t r i e v e don e <= 1 ' b0 ;

s t a t e <= 3 ' b000 ;

count a <= 'b0 ;

end else begin

i f ( s t a t e == 3 ' b000 ) begin

count a <= 'h0 ;

s t a t e <= 3 ' b001 ;

// Sta te 1 : Prepare / wr i te data to sdram

i f ( p ro f count < `N) begin` i fdef CW64

avm pro f i l eMaste r addre s s <= `PROF ADDR + { count a , 2 ' b000 } ;`else

avm pro f i l eMaste r addre s s <= `PROF ADDR + { count a , 2 ' b00 } ;`endif

avm pro f i l eMaste r wr i t edata <= count data ;

count a <= count a + 1 'b1 ;

p ro f count <= pro f count + 1 'b1 ;` i fdef CW64

s t a t e <= 3 ' b011 ;`endif

end else begin

s t a t e <= 3 ' b010 ;

s t a t e <= 3 ' b000 ;

// Sta te 3 : Write out top 32 b i t s f o r a 64− b i t counter` i fdef CW64

else i f (`CW>32 & s t a t e == 3 ' b011 ) begin

avm pro f i l eMaste r addre s s <= `PROF ADDR + { count a , 3 ' b100 } ;

avm pro f i l eMaste r wr i t edata <= count data [ 6 3 : 3 2 ] ;

s t a t e <= 3 ' b001 ;

B Source Code

end`endif

endmodule

References

[1] Gnu gprof. http://sourceware.org/binutils/docs/gprof/index.html.

[2] Home :: Opencores. http://opencores.org/.

[3] Virtex-II Platform FPGAs: Complete Data Sheet.

[4] Yacc-yet another cpu cpu. {http://opencores.org/project,yacc,overview}.

[5] APEX 20K device family overview. http://www.altera.com/products/devices/apex/

overview/apx-overview.html#1.8, 2011.

[6] Altera, Corp. http: // www. altera.com .

[7] Altera, Corp., San Jose, CA. Avalon Interface Specifications, 2011.

[8] Altera, Corp., San Jose, CA. ByteBlaster II Download Cable User Guide, 2011.

[9] Altera, Corp., San Jose, CA. Cyclone II Device Family Data Sheet, 2011.

[10] Altera, Corp., San Jose, CA. DE2 Development and Education Board, 2011.

[11] Altera, Corp., San Jose, CA. Internal Memory (RAM and ROM) User Guide, 2011.

[12] Altera, Corp., San Jose, CA. Nios II Processor Reference Handbook, 2011.

[13] Altera, Corp., San Jose, CA. PowerPlay Power Analysis, 2011.

[14] Altera, Corp., San Jose, CA. Quartus II Handbook Version 10.0, 2011.

[15] Altera, Corp., San Jose, CA. Quartus II Handbook Version 10.0 Volume 4: SOPC Builder,2011.

[16] Altera, Corp., San Jose, CA. Stratix Device Handbook, 2011.

[17] Altera, Corp., San Jose, CA. Using SignalTap II Embedded Logic Analyzers in SOPCBuilder Systems, 2011.

B References

[18] Gene M. Amdahl. Readings in computer architecture. chapter Validity of the singleprocessor approach to achieving large scale computing capabilities, pages 79–81. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 2000.

[19] Glenn Ammons, Thomas Ball, and James R. Larus. Exploiting hardware performancecounters with flow and context sensitive profiling. In Proceedings of the ACM SIGPLAN1997 conference on Programming language design and implementation, PLDI ’97, pages85–96, New York, NY, USA, 1997. ACM.

[20] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H.Anderson, Stephen Brown, and Tomasz Czajkowski. Legup: high-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA internationalsymposium on Field programmable gate arrays, FPGA ’11, pages 33–36, New York, NY,USA, 2011. ACM.

[21] Peter Clarke. Arm clone taken offline, talks between its designer and arm. November 2001.

[22] J. Cong and Y. Zou. FPGA-based hardware acceleration of lithographic aerial imagesimulation. ACM Trans. Reconfigurable Technol. Syst., 2(3):1–29, 2009.

[23] Keith DeHaven. Extensible Processing Platform - Ideal Solution for a Wide Range ofEmbedded Systems, 2010.

[24] M. Finc and A. Zemva. A systematic approach to profiling for hardware/software parti-tioning. Computers & Electrical Engineering, 31(2):93 – 111, 2005.

[25] GNU. GDB: The GNU Project Debugger.

[26] GNU. Using the GNU Compiler Collection (GCC).

[27] Ann Gordon-Ross and Frank Vahid. Frequent loop detection using efficient nonintrusiveon-chip hardware. IEEE Trans. Comput., 54:1203–1215, October 2005.

[28] Zhi Guo, Walid Najjar, Frank Vahid, and Kees Vissers. A quantitative analysis of thespeedup factors of FPGAs over processors. In Proceedings of the 2004 ACM/SIGDA 12thinternational symposium on Field programmable gate arrays, FPGA ’04, pages 162–170,New York, NY, USA, 2004. ACM.

[29] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. Proposal and quantitative analysis ofthe CHStone benchmark program suite for practical C-based high-level synthesis. Journalof Information Processing, 17:242–254, 2009.

[30] http://www.llvm.org. The LLVM Compiler Infrastructure Project, 2011.

[31] Intel Corporation. Using the RDTSC Instruction for Performance Monitoring, 1997.

B References

[32] Bob Jenkins. Perfect hashing. http://burtleburtle.net/bob/hash/perfect.html,2011.

[33] Chi keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney,Steven Wallace, Vijay Janapa, and Reddi Kim Hazelwood. Pin: Building customizedprogram analysis tools with dynamic instrumentation. In In Programming Language Designand Implementation, pages 190–200. ACM Press, 2005.

[34] Lawrence Latif. Intel combines the atom with an fpga chip. November 2010.

[35] J. Luu, K. Redmond, W. Lo, P. Chow, L. Lilge, and J. Rose. FPGA-based monte carlocomputation of light absorption for photodynamic cancer therapy. In IEEE Symposiumon Field-Programmable Custom Computing Machines, pages 157–164, 2009.

[36] Mentor Graphics. ModelSim - Advanced Simulation and Debugging.

[37] MIPS Technologies, Sunnyvale, CA. MIPS Architecture For Programmers, 2010.

[38] MIPS Technologies, Inc. MIPS R10000 Microprocessor User’s Manual, 1996.

[39] Simon Moore and Gregory Chadwick. The Tiger “MIPS” processor. http://www.cl.cam.ac.uk/teaching/0910/ECAD+Arch/mips.html, 2011.

[40] Jingzhao Ou and V.K. Prasanna. Rapid energy estimation of computations on fpga basedsoft processors. In SOC Conference, 2004. Proceedings. IEEE International, pages 285 –288, September 2004.

[41] Gang Qu, Naoyuki Kawabe, Kimiyoshi Usami, and Miodrag Potkonjak. Function-levelpower estimation methodology for microprocessors. In Proceedings of the 37th AnnualDesign Automation Conference, DAC ’00, pages 810–813, New York, NY, USA, 2000.ACM.

[42] El-Sayed M. Saad, Medhat H. A. Awadalla, and Kareem Ezz El-Deen. Fpga-based real-time software profiler for embedded systems. International Journal of Computers, Systemsand Signals, 9(2), 2008.

[43] Lesley Shannon and Paul Chow. Using reconfigurability to achieve real-time profiling forhardware/software codesign. In in FPGA ’04: Proceedings of the 2004 ACM/SIGDA 12thinternational symposium on Field programmable gate arrays, pages 190–199. ACM Press,2004.

[44] V. Tiwari, S. Malik, and A. Wolfe. Power analysis of embedded software: a first steptowards software power minimization. Very Large Scale Integration (VLSI) Systems, IEEETransactions on, 2(4):437–445, December 1994.

[45] Jason G. Tong and Mohammed A. S. Khalid. Profiling tools for fpga-based embeddedsystems: Survey and quantitative comparison. Journal of Computers, 3(6):1–14, 2008.

B References

[46] United States Bureau of Labor Statistics. Occupational Outlook Handbook 2010-2011 Edi-tion, 2010.

[47] Liz Vaughan-Adams. Arm investigates swedish university over ‘clone’ claim. March 2001.

[48] Vineet Aggarwal Wayne Marx. FPGAs Are Everywhere - In Design, Test & Control. RTCMagazine, April 2008.

[49] R. P. Weicker. Dhrystone benchmark: rationale for version 2 and measurement rules.SIGPLAN Not., 23:49–62, August 1988.

[50] Xilinx, San Jose, CA. Platform Studio User Guide, 2004.

[51] Xilinx, San Jose, CA. MicroBlaze Soft Processor Core, 2011.

[52] Xilinx, Inc. http: // www.xilinx.com .

[53] Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. Application-specific cus-tomization of soft processor microarchitecture. In Proceedings of the 2006 ACM/SIGDA14th international symposium on Field programmable gate arrays, FPGA ’06, pages 201–210, New York, NY, USA, 2006. ACM.

low-cost hardware profiling of run-time and energy in fpga soft

Documents

a study of the speedups and competitiveness of fpga soft...

high performance ddr4 interfaces with fpga · pdf filehigh...

protoflex: fpga-accelerated instrumentation • software...

profiling techniques for fpga-based hardware- software

vol. 3, issue 12, december 2014 fpga based soft ip design...

fpga-based soft vector processors - university of...

implementation of soft-core processor on fpga

implementation of soft-core processor on fpga (final...

improving autonomous soft-error tolerance of … ·...

a verilog 8051 soft core for fpga applications

fpga-based soft-core processors for image processing ... ·...

fpga-based soft-core processors for image processing ...j...

12pc542_nios 2 processor based customized soft mcu...

run-time customization of a soft-core cpu on · pdf...

gene expression profiling for the investigation of soft...

fpga-based multicore microprocessor emulation...the fpga to...

the idea architecture-focused fpga soft processor · the...

fpga-based soft-core processors for image processing … ·...

log likelihood ratio soft decision demapper polytechnic...

mitigating fpga interconnect soft errors by in-place lut...