an energy aware framework for mobile computing
TRANSCRIPT
DISSERTATION
An Energy Aware Framework for MobileComputing
ausgefuhrt zum Zwecke der Erlangung des akademischen Gradeseines Doktors der technischen Wissenschaften
eingereicht an derTechnischen Universitat WienFakultat fur Elektrotechnik und Informationstechnik
von
Dipl.-Ing. Naeem Zafar AzeemiBrigittenauer Lande 224/ 6643, 1200 Wiengeboren in Karachi, Pakistan am 14. August 1968Matrikelnummer: 0327346
October 6, 2007 .............................................................
Advisor
Univ.Prof. Dipl.-Ing. Dr.techn. Markus RuppTechnische Universitat WienInstitut fur Nachrichtentechnik und Hochfrequenztechnik
Examiner
Univ.Prof. Dr.phil.nat. Christoph GrimmTechnische Universitat WienInstitut fur Computertechnik
To Amra, Mukashfa and Kunza
ABSTRACT
Since their inception, energy dissipation has been a critical issue for mobile computingsystems. Although a large research investment in low-energy circuit design and hardwarelevel energy management has led to more energy-efficient architectures, even then, thereis a growing realization that the contribution to energy conservation should be morerigorously considered at higher levels of the systems, such as operating systems andapplications.
This dissertation puts forth the claim that energy-aware compilation to improve appli-cation quality both in terms of execution time and energy consumption is essential fora high performance mobile computing embedded system design. Our work is a designparadigm shift from the logic gate being the basic silicon computation unit, to an in-struction running on an embedded processor. Multimedia DSP processors are the mostlucrative choice to a mobile computing system design for their optimal performance de-livery in high data throughput at low energy. They use instruction-level parallelism (ILP)in programs, for executing more than one primitive instruction at a time. In this work,we exploit the parallelism slacks, unraveled by the native multimedia DSP compilers.We propose an iterative compilation environment to optimize a given ’C’ source code.The contributions of our framework are the collaboration of an application profile mon-itor (APM) together with an optimization engine in native multimedia DSP SoftwareDevelopment Environments (SDE). We propose to monitor application behavior at alllevels (such as static, compilation, scheduling, linking and during execution). TheseAPMs are later used in an optimization engine to speculate optimal code transformationschemes. These schemes are applied successively, across the basic code blocks. Wepropose two methods for the selection of optimization schemes, a Gradient Mode Iter-ative Compilation (GMIC) and Multicriteria Stochastic Iterative Compilation (MSIC).Both schemes are tested at several multimedia applications obtained from diversifieddomains such as video transcodecs (MPEG2, H-264L), audio transcodecs (G-723, Mp3)and bioinformatics (Glimmer, Fgene), to name a few.
Finally, we propose the characterization of application-architecture correlations that sup-port our claim that an ideal performance of a mobile computing system demands a per-fect match between hardware capability and program behavior. We exposed our resultsfor 20 multimedia applications experimented at the TriMedia DSP 1300, the BlackfinDSP ADSP533, and the PIII-850 embedded processor.
Keywords: Energy Aware, Source-to-Source, Multimedia Processor, Workload Charac-terization.
vi Abstract
ZUSAMMENFASSUNG
Seit dem Bestehen von mobilen Rechensystemen ist Energieverbrauch ein entscheiden-der Faktor. Obwohl bereits zahlreiche Forschungsergebnisse zu hardwarelosungen mitniedrigem Energieverbrauch gefuhrt haben, ist mittlerweile klar geworden, dass En-ergieeinsparungen auf hoherer Ebene, wie beispielsweise bei Betriebssystemen und -anwendungen, vermehrt in Betracht gezogen werden sollten.
Diese Dissertation belegt, dass eine energiebewusste Compilierung zur Verringerung derAusfuhrungszeit fuhrt und somit ein wesentliches Kriterium darstellt, um ein effizienteseingebettetes System fur mobile Datenverarbeitung zu gewahrleisten. Unsere Arbeitbeschaftigt sich mit einem neuen Entwicklungs-Paradigma, das sich nicht mehr aufeinzelne logische Gatter als grundlegende Entwicklungselemente konzentriert, sondernsich einzelnen Instruktionen auf einem eingebetteten Prozessor widmet. Digitale Sig-nalverarbeitungsprozessoren fur Multimediaanwendungen stellen fur ein mobiles Daten-verarbeitungssystem die preiswerteste Losung dar, um eine optimale Datendurchlaufzeitbei niedrigem Energiebedarf zu gewahrleisten. Diese nutzen hierfur die Parallelitat aufInstruktionsebene (ILP) von Programmen, um damit mehrere primitive Instruktionenzur gleichen Zeit ausfuhren zu konnen. In der vorliegenden Dissertation wird die Pro-grammparalellisierung mit einem speziellen Monitor erfasst. Weiters schlagen wir eineschrittweise Compilierung vor, um den gegebenen Programmcode in ”C” zu optimieren.Ein weiterer Beitrag besteht aus einer Programmumgebung zur Analyse von Anwendun-gen und deren Optimierung. Hierbei wird das Programmverhalten auf mehreren Ebenen(statischer Ebene, Compilierung, Scheduling, Linking, und wahrend der Ausfuhrung)uberwacht. Diese Analysen werden anschließend von einem Optimierungsprogramm ver-wendet, um eine optimale Compiler-Konfiguration zu ermitteln. In dieser Arbeit wer-den zwei verschiedene Methoden fur die Auswahl der Optimierungsoptionen vorgestellt,namlich ein Gradientenverfahren und ein stochastisches Verfahren. Beide Verfahrenwerden mit verschiedenen Multimediaanwendungen aus unterschiedlichen Bereichen wiebeipsielsweise Video-Kodierung (MPEG2, H-264L), Audio-Kodierung (G-723, MP3) undBioinformatik (Gllimmer, Fgene) getestet.
Schließlich schlagen wir Metriken zur Erfassung der Korrelation zwischen Anwendung undHardware vor, die unsere Behauptung untermauern, dass eine ideale Leistung des mobilenDatenverarbeitungssystems nur dann erreicht werden kann, wenn die Hardwarekapazitatsowie das Programmverhalten perfekt zusammenpassen. Die Leistungsfahigkeit dieserMetriken wird anhand der Prozessoren Trimedia DSP 1300, Blackfin DSP ADSP533 undPIII-850 gezeigt.
viii Zusammenfassung
Schlagworter: Energy-aware, Quellcodetransformation, eingebettete Systeme, Multi-media Prozessoren, Mobile Computing, workload characterization
ACKNOWLEDGEMENTS
I would like to thank my teacher Khwaja Shamsuddin Azeemi and parents who have hada positive effect on me personally, to whom I owe a debt of gratitude for helping in oneway or another to influence the person I am today.
First and foremost, I thank my supervisor Dr. Markus Rupp, for his consistent efforts toinvoke my inherent skills to accomplish this task successfully. I appreciate his bottomlesspatience for technical review and substantive comments that improved the readabilityof the dissertation.
Thanks to my sister Farhi, and brothers Waseem and Nadeem, who provide encourage-ment in the face of every seemingly impossible task that I face.
Thanks to Afsar, Sobia, Shams Sahib, Ana Eliza and Liana for their love, support andgreat understanding, especially during vulnerable moments.
Thanks to my friends, colleagues and acquaintances: Bastian, Martin at the ChristianDoppler Laboratory; Sabine from Vienna; Naveed and Saima from Boston; Nadeem andfamily from San Francisco; Amir Malik and family from Korea for their kind assistanceand facilitation during last 45 months.
I would like to acknowledge valuable technical support from Dr. Arpad Scholtz atInstitute of Communications and Radio Frequency Engineering, Dr. Stefan Mahlknechtat Institute of Computer Technology and Aneesa Sultan at Vienna Bio Center.
I am also grateful to Dr. Christoph Grimm for his time and patience to review thismanuscript.
CONTENTS
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Mobile Embedded System Constraints . . . . . . . . . . . . . . 11.1.2 IC Fabrication Technology Constraints . . . . . . . . . . . . . . 21.1.3 Battery Technology Constraints . . . . . . . . . . . . . . . . . 31.1.4 Architecture-Application Correlation Slacks . . . . . . . . . . . 4
1.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Energy-Cycle Aware Compilation Framework (ECACF) 13
2.1 Energy Saving Techniques - A Review . . . . . . . . . . . . . . . . . . 142.1.1 Fabrication level power reduction . . . . . . . . . . . . . . . . . 142.1.2 Processor level power reduction . . . . . . . . . . . . . . . . . . 152.1.3 EDA tools level power reduction . . . . . . . . . . . . . . . . . 152.1.4 Compiler level power reduction . . . . . . . . . . . . . . . . . . 162.1.5 Low power data structures . . . . . . . . . . . . . . . . . . . . 162.1.6 Idle mode power reduction . . . . . . . . . . . . . . . . . . . . 172.1.7 Power reduction in distributed computing systems . . . . . . . . 172.1.8 Power reduction in communication systems . . . . . . . . . . . 172.1.9 Battery aware power reduction . . . . . . . . . . . . . . . . . . 18
2.2 Multimedia DSPCPU Architecture . . . . . . . . . . . . . . . . . . . . 192.2.1 Multimedia Processor Execution Model . . . . . . . . . . . . . 202.2.2 Multimedia Processor Operations Overview . . . . . . . . . . . 21
2.3 Workload Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.1 Multimedia Applications . . . . . . . . . . . . . . . . . . . . . 232.3.2 Bioinformatics Workload . . . . . . . . . . . . . . . . . . . . . 24
2.4 Energy Cycle Aware Compilation Framework Methodology . . . . . . . 282.4.1 Application Expression Profile . . . . . . . . . . . . . . . . . . . 30
2.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.5.1 Related Work for Energy Measurement . . . . . . . . . . . . . . 322.5.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Gradient Mode Iterative Compilation (GMIC) 41
3.1 GMIC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
xii Contents
3.1.1 Performance Qualifier Measurement . . . . . . . . . . . . . . . 43
3.1.2 Code Block Queuing . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.3 Code Block Expression Profile . . . . . . . . . . . . . . . . . . 44
3.1.4 Transformation Scheme . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Example: Optimization of an MPEG-1 encoder . . . . . . . . . . . . . 46
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Multicriteria Stochastic Iterative Compilation (MSIC) 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 Objects and Constraints . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 Case Study I - Arbitrary Application . . . . . . . . . . . . . . . 59
4.2.3 Case Study II - Nonlinear Interpolative Vector Quantization (NLIVQ) 61
4.3 Performance Comparison with GMIC . . . . . . . . . . . . . . . . . . . 66
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 Application-Architecture Characterization 69
5.1 Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.1 Principal Component Analysis (PCA): . . . . . . . . . . . . . . 70
5.1.2 Scree Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3 Box Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.4 Scatter Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.5 Differential Application Expression Profile (dAEP): . . . . . . . 72
5.2 Application Characterization . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Case Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.2 Case Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.3 Case Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Architecture-Centric Application Characterization . . . . . . . . . . . . 81
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 Conclusions 89
Appendices 91
A List of Application Expression Profile (AEP) Monitors 93
B VLIW Descriptor File (VDF) Format 99
C User Constraints Files (UCF) Format 103
C.1 UCF for MPEG-1 encoder example in Section 3.3 . . . . . . . . . . . . 104
C.2 UCF for NLIVQ example in Section 4.2.3 . . . . . . . . . . . . . . . . 104
Contents xiii
D Application Attributes 105
E List of Acronyms 113
LIST OF FIGURES
1.1 Power consumption for Intel CPUs [1]. . . . . . . . . . . . . . . . . . . 3
1.2 Thermal and power delivery cost in a desktop PC [2]. . . . . . . . . . . 4
1.3 Battery technologies and their capacities [3]. . . . . . . . . . . . . . . 5
1.4 Thesis Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 TriMedia VLIW instruction [4]. . . . . . . . . . . . . . . . . . . . . . . 20
2.2 TriMedia functional unit assignment [4]. . . . . . . . . . . . . . . . . . 21
2.3 Transformation methodology. . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Vertical application profile layers. . . . . . . . . . . . . . . . . . . . . . 30
2.5 Experimental setup for instruction/program current measurement [5]. . 33
2.6 Proposed experimental setup for application current measurement atprocessor and memory. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Current consumption for vector quantization (VQ) application executionlife cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 CPU core current consumption versus address range for VQ application. 35
2.9 Memory current consumption versus address range for G-728 audio transcodec.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10 CPU core current consumption versus address range for G-728 audiotranscodec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.11 CPU peripheral current consumption versus address range for G-728 au-dio transcodec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Gradient mode Iterative Compilation Methodology (GMIC). . . . . . . . 42
3.2 Fraction of JPMO CB in an MPEG-1 application, the code blocks arenumbered from fb01 to fb34. . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Fraction of JPMO contributed by code blocks in an MPEG-1 application-(a window view for seven blocks). . . . . . . . . . . . . . . . . . . . . 44
3.4 GMIC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
xvi List of Figures
3.5 Heuristic track of CT-Tuple for an MPEG-1 encoder application. . . . . 48
3.6 Heuristic track of CTxy tuple for FFT application. . . . . . . . . . . . . 50
3.7 Heuristic track of CTxy tuple for IDCT application. . . . . . . . . . . . 50
3.8 Heuristic track of CTxy tuple for T64 application. . . . . . . . . . . . . 51
3.9 Heuristic track of CTxy tuple for M100 application. . . . . . . . . . . . 52
3.10 Heuristic track of CTxy tuple for H-264L application. . . . . . . . . . . 52
4.1 A simplified view of framework with multicriteria methodology extension. 56
4.2 Simplified Genetic Algorithm Model [6]. . . . . . . . . . . . . . . . . . 58
4.3 Development of fitness function for Case Study 1 in TS1 and TS2. . . . 59
4.4 Fraction of IPC for Case Study 1 in TS1 and TS2. . . . . . . . . . . . 60
4.5 Fraction of IPC and Energy overlapping for Case Study 1 in TS1 and TS2. 60
4.6 Fraction of CPU cycles for CB life time (CBLT)in NLIVQ application (25CB are numbered from F01 to F25). . . . . . . . . . . . . . . . . . . . 62
4.7 Development of the fitness function for NLIVQ. . . . . . . . . . . . . . 64
4.8 Fraction of IPC for NLIVQ. . . . . . . . . . . . . . . . . . . . . . . . . 64
4.9 Fraction of energy saving for NLIVQ. . . . . . . . . . . . . . . . . . . . 65
4.10 Fraction of functional unit utilization for NLIVQ. . . . . . . . . . . . . 65
5.1 Scatter plot for 20 applications at the TriMedia processor. . . . . . . . 75
5.2 PCA Scree plot for 20 applications at the TriMedia processor. . . . . . 76
5.3 PCA box plot for 20 applications at the TriMedia processor. . . . . . . 76
5.4 PCA biplot for 20 applications at the TriMedia processor. . . . . . . . . 77
5.5 Scatter plot for 20 applications at the Blackfin processor. . . . . . . . . 79
5.6 PCA biplot for 20 applications at the Blackfin processor. . . . . . . . . 80
5.7 Scatter plot for 20 applications at the PIII 850 processor. . . . . . . . . 82
5.8 PCA biplot for 20 applications at the PIII 850 processor. . . . . . . . . 83
5.9 Differential AEP across three hardware platforms. . . . . . . . . . . . . 83
5.10 PCA biplot for 20 applications across the TriMedia processor and theBlackfin processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.11 PCA biplot for 20 applications across the Blackfin processor and the PIII850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.12 PCA biplot for 20 applications across the TriMedia processor and the PIII850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
LIST OF TABLES
2.1 Energy reduction techniques for embedded system design. . . . . . . . . 14
2.2 Multimedia Benchmarks (Speech Transcodecs). . . . . . . . . . . . . . 24
2.3 Multimedia Benchmarks (Video Transcodecs). . . . . . . . . . . . . . . 25
2.4 Multimedia Benchmarks (Audio Transcodecs). . . . . . . . . . . . . . . 25
2.5 Generic DSP application Benchmarks [7]. . . . . . . . . . . . . . . . . 26
2.6 Test Vectors Characterization. . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Bio-Computation Applications Benchmark . . . . . . . . . . . . . . . . 27
3.1 Transformation Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Gradient Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 CBLT in CPU cycles for NLIVQ. . . . . . . . . . . . . . . . . . . . . . 63
4.2 Achieved CPU cycles (%) in ECHCB of NLIVQ application for TS04,TS07, TS09. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Sum of absolute difference for for TS04, TS07, TS09. . . . . . . . . . . 66
4.4 Performance comparison between GMIC and MSIC. . . . . . . . . . . . 67
5.1 MPEGdec profile for successive transformations [8]. . . . . . . . . . . . 72
D.1 Pseudonyms for 20 applications. . . . . . . . . . . . . . . . . . . . . . 105
D.2 AEP for optimized 20 applications at the TriMedia processor. . . . . . . 106
D.3 AEP for optimized 20 applications at the Blackfin processor. . . . . . . 107
D.4 AEP for optimized 20 applications at the PIII 850 processor. . . . . . . 108
D.5 dAEP for optimized 20 applications across the TriMedia and the Blackfinprocessors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
D.6 dAEP for optimized 20 applications across the Blackfin and the PIII 850processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xviii List of Tables
D.7 dAEP for optimized 20 applications across the TriMedia and the PIII 850processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
1 INTRODUCTION
1.1 Motivation
The growing trend towards the untethered ubiquitous computing is entailed with many
performance related issues. The ideal performance of a mobile computing system de-
mands a perfect match between architecture capability and program behavior. Archi-
tecture performance can be enhanced with better hardware technology, innovative low
Integrated Circuits (IC) geometry features, and efficient resources management [9]. In
the same vein, the demand for having multimedia functions on handheld devices requires
an enormous computation power to handle large data and program sizes. Efficient ar-
chitecture utilization for both energy dissipation and execution time as well as optimal
application firmware are two important performance metrics for these embedded systems.
The optimal architecture utilization is debilitated by different design limitations, such
as high level system design constraints, fabrication level constraints, battery technology
constraints etc. They are discussed next in more detail.
1.1.1 Mobile Embedded System Constraints
Mobile embedded systems (MES) present unique challenges and opportunities for system-
level low-energy designs, e.g.,
• MES are usually severely energy constrained. In particular, handheld devices , air-
borne, and spaceborne systems are typically battery-operated and therefore have a
limited energy budget [10]. MES are also typically relatively more time-constrained
compared to portable embedded or general-purpose systems. Therefore, the chal-
lenge is to save energy while guaranteeing temporal constraints.
• Some MES applications such as avionics, robotics and deep space missions require
systems with small form factors, which in turn mandates low heat dissipation.
Since heat is a byproduct of energy dissipation, low-energy system-design ensures
a more reliable system by limiting the heat produced.
• MES are typically over-designed to ensure that the temporal deadline guarantees
are still met even if all tasks take up their Worst-Case Execution Time (WCET).
2 1 Introduction
Since, in the average case, tasks do not require their WCET, the redundancy in
hardware design in MES makes them energy inefficient.
In short, system-level techniques can decrease this energy dissipation through the
use of energy-aware task scheduling algorithms while preserving their temporal
constraints.
1.1.2 IC Fabrication Technology Constraints
Integrated circuits in their various incarnations consume some amount of electric power.
This power is dissipated both by the action of the switching devices contained in IC
(such as transistors) as well as heat due to the resistivity of the electrical circuits. This
is a major consideration in the design of microporcessors and the embedded systems
they are used in [11]. Figure 1.1 shows the power consumption for the Intel series
of processors produced over the last two decades [1]. The horizontal axis shows the
advancement in IC fabrication technology in terms of chip geometry (i.e nanometers),
while power dissipation is plotted in Watts. Each point is marked with two numbers,
showing chip geometry and power consumption, respectively. Points lying on the same
vertical axis such as (350,43) and (350,34.8) show the processors in the same technology,
but different performance. E.g., (350,43) and (350,34.8) corresponds to PII 300MHz
and PII 233MHz, respectively. Similarly, P4 3MHz was fabricated at 130 nm and 81.9
W, while in later versions at lower geometry P4 EE 3.40MHz is fabricated at 90 nm
and low power 83.9 W; further, it is improved for higher operating frequency (P4 EE
3.73MHz) at same the geometry but at a penalty of increase in power consumption
i.e., 115 W. The increasing trend towards special purpose core processors has further
reduced the geometry down to 65 nm and power consumption to 130 W (for Intel Core
2 Extreme Qx6700). Readers are encouraged to read [1] [12] [13] for a detailed view of
power versus technology trends realized by various CPU manufacturers.
Attempts to shape the power-geometry envelop (shown as a shoe in Figure 1.1) have
their limits at the fabrication technology at 50 nm, where leakage current starts dominat-
ing the power consumption (discussed further in Chapter 2). Although special purpose
core processors are implemented at 50 nm [14] [12], with a power consumption of 14.5
W (shown at bottom of heal in Figure 1.1), but their operating frequency is limited to
130 MHz, which is not sufficient to meet the current demand for multimedia process-
ing. The designers goal to achieve a low leakage ’heal’ in the power-geometry shoe is
associated with a high power cost. This cost has two components. The first is thermal
cost, which is associated with keeping the devices below the specified operating temper-
ature limits. Maintaining the integrity of packaging at higher temperatures also requires
expensive solutions. The second component is the on board power delivery cost, which
is related to on-board decoupling capacitances and interconnects associated with the
power distribution network. Moreover, the increased trend towards driving the CPU at
1.1. Motivation 3
lower operating voltage and higher frequency increases the magnitude of the current
drawn by the CPU. This exacerbates the issue of resistive and inductive noise problems
and leads to a significant increase in system cost.
Fig. 1.1: Power consumption for Intel CPUs [1].
Figure 1.2 gives an idea of the range of dollar amounts associated with the above costs
for different system components [2]. As can be seen, when the system power is in the
35-40 W range, the cost of each additional Watt tends to grow above $1/W per chip.
Designers have already pulled the fabrication limits to achieve low energy design goals
[15]. E.g., shrinking the integrated circuit geometry below 50 nm doubles the leakage
current as compared to 65 nm. Such issues exacerbate the need to consider low energy
design more rigorously at higher hierarchies of the system level [5].
1.1.3 Battery Technology Constraints
The energy constraints on mobile devices are becoming increasingly tight as complexity
and performance requirements continue to be pushed by the user demand [16]. Proces-
sor speeds have doubled as approximately every 18 months as predicted by Moore’s
law [17]. While processor speed and energy consumption have increased rapidly, the
corresponding improvement in battery technology has been slow. In fact, battery ca-
pacity has increased by a factor of less than four in the last three decades [3] [18].
4 1 Introduction
Fig. 1.2: Thermal and power delivery cost in a desktop PC [2].
Figure 1.3 shows the current state-of-the-art in battery technology. The slack in in-
crease in the battery capacity is hampered by the ionization chemistry limits [3] [19].
The design target for batteries with long life-span and short sizes is hard to achieve.
E.g., though Ni-MH is lighter in weight than Ni-Cd, it requires a higher recharging
time. In the same vein, Li-Ion batteries are more promising for higher energy density,
large number of charging cycles, little memory effect, longer shelf life, but higher cost
and increased external protection against discharging inhibits its low cost wide use. In
short, the technological constraint on the realization of high capacity, low size battery
highlights the importance of low energy consideration.
1.1.4 Architecture-Application Correlation Slacks
Traditionally, optimal MES performance is gained by focussing on the underlying hard-
ware architecture. This ignores the fact that it is the software executing on a CPU
that determines its energy consumption. The execution time and energy consumption
of a program on any parallel processor is dependent not only on the composition of
operations contained within the program, but also on the ability of users to express the
1.2. Design Space Exploration 5
Fig. 1.3: Battery technologies and their capacities [3].
parallelism at the correct granularity level for the processor. Therefore, to fairly com-
pare cycle-energy performance of two applications at a given processor, two different
mappings of the applications will be required, one for each application. An integrated
approach that considers energy-cycle performance at architecture as well as application
level is essential for energy efficient application developments.
1.2 Design Space Exploration
The program behavior is difficult to predict due to its heavy dependence on application
and run-time conditions [20] [21]. For mobile computing, the application performance
can be optimized by using parallel hardware architectures, such as Very-Long Instruction
Word (VLIW) architectures [22] [23]. VLIW architectures are a suitable alternative for
exploiting instruction-level parallelism (ILP) in programs, that is, for executing more than
one basic (primitive) instruction at a time. These processors contain multiple functional
units. They fetch from the instruction cache a Very-Long Instruction Word containing
several primitive instructions, and dispatch the entire VLIW for parallel execution. These
6 1 Introduction
capabilities are exploited by compilers which generate code that has grouped together
independent primitive instructions executable in parallel. The processors have a relatively
simple control logic because they do not perform any dynamic scheduling nor reordering
of operations (as is the case in most contemporary superscalar processors). The instruc-
tion set for a VLIW architecture tends to consist of simple instructions (RISC-like). The
compiler must assemble many primitive operations into a single ”instruction word” such
that the multiple functional units are kept busy, which requires enough instruction-level
parallelism (ILP) in a code sequence to fill the available operation slots.
In mobile computing software design, the conventional software development environ-
ment (for compilation and machine code generation) cannot be used. In these methods,
the execution time and code size are primarily considered, while the energy dissipation
issue is piggy-backed to the final design; that inevitably leads to an expensive cooling
mechanism and eventually increases the system overall cost while reducing reliability.
The software perspective on power consumption has been the subject of work in [24].
Here a detailed instruction-level power model of the Intel 486DX2 was built. The impact
of software on the CPU power and energy consumption, and software optimizations to
reduce these were studied. It is well known that the number of useful instructions is
always different from the number of instructions in a static code. The code execution
flow determines the number of useful instructions according to input data. Therefore,
computing the total energy consumed merely by adding the energy consumption of
individual instructions does not provide the actual energy consumption of the program
as claimed in [24].
In this thesis we propose a framework, where software applications optimally utilize
the hardware architecture to deliver energy-cycle performance within user defined con-
straints. Our energy aware framework in [25] meets the demand by incorporating the
following features in a native multimedia DSP compilation environment.
1) The framework transforms the legacy application source code into optimal ’C’ source
code, taking advantage of different slacks appearing in the application-to-binary devel-
opment hierarchy.
2) Unlike conventional techniques, ’C’ source code is iteratively compiled for different
performance goals both in terms of execution time as well as energy dissipation.
3) We developed post-profiling techniques published in [26] to evaluate the application
performance not only at compilation layer (as conventional compiler does) but also at
scheduling layer, linker layer, machine code generation layer and finally at loader layer.
4) We measure the real-time performance of applications running on actual hardware.
These measured parameters are further used to tune the transformation scheme of the
legacy software application.
5) We tested our framework at different applications that belong to diversified industrial
1.2. Design Space Exploration 7
domains such as audio transcodecs [27], video transcodecs [8], speech codecs, and
bioinformatics applications [28] [29].
6) The work is further extended in [30] [27] to characterize application-architecture
correlation, that are well suited for a pre-design assessment of an embedded system
design. It answers the question whether a given hardware architecture is an appropriate
choice for a given multimedia software application or not.
It may be noted, the terms power consumption and energy consumption are often in-
terchanged. It is important to distinguish between these two when we talk of either of
these in the context of programs running on mobile applications. Mobile systems run
on limited energy available in a battery. Therefore, the energy consumed by the system
or by the software running on it, determines the length of the battery life.
This thesis is based on the following publications.
• N. Zafar Azeemi, A. Sultan ”Characterization of Bioinformatics Applications on
Multimedia Processor”, in Proc. IEEE Cairo International Biomedical Engineering
Conference (CIBEC ’06), pages BI06-BI09, 195 - 200, Cairo, Egypt, December,
2006.
• N. Zafar Azeemi ”Handling Architecture-Application Dynamic Behavior in Set-
top Box Applications”, in Proc. IEEE International Conference on Information
and Automation (ICIA ’06), pages 195 - 200, Colombo, Sri Lanka, December,
2006.
• N. Zafar Azeemi, A. Sultan, A. Muhammad ”Parameterized Characterization of
Bioinfomatics Workload on SIMD Architecture”, in Proc. IEEE International Con-
ference on Information and Automation (ICIA ’06), pages 189 - 194, Colombo,
Sri Lanka, December, 2006.
• N. Zafar Azeemi ”Multicriteria Energy Efficient Source Code Compilation for De-
pendable Embedded Applications”, in Proc. IEEE International Conference on
Information Technology (IIT ’06), Dubai, UAE, November, 2006.
• N. Zafar Azeemi ”Compiler Directed Battery-Aware Implementation of Mobile Ap-
plications”, in Proc. IEEE 2nd International Conference on Emerging Technologies
(ICET ’06), pages 151 - 156, Peshawar, Pakistan, November, 2006.
• N. Zafar Azeemi ”A Multiobjective Evolutionary Approach for Constrained Joint
Source Code Optimization”, in Proc. ISCA 19th International Conference on Com-
puter Application in Industry (CAINE ’06), pages 175 - 180, Las Vegas, Nevada,
USA, November, 2006.
• N. Zafar Azeemi ”Probabilistic Iterative Compilation for Source Optimization of
Embedded Programs”, in Proc. 2006 IEEE International SoC Design Conference
(ISOCC ’06), pages 323 - 328, Seoul, Korea, October, 2006.
8 1 Introduction
• N. Zafar Azeemi, M. Rupp ”Multicriteria Low Energy Source Level Optimization of
Embedded Programs”, in Proc. Tagungsband zur Informationstagung Mikroelek-
tronik (ME ’06) IEEE Austria, pages 150 - 158, Vienna, Austria, October, 2006.
• N. Zafar Azeemi ”Architecture-Aware Hierarchical Probabilistic Source Optimiza-
tion”, in Proc. ISCA 19th International Conference on Parallel and Distributed
Computing Systems (PDCS ’06),pages 90-95, San Francisco, USA, September,
2006.
• N. Zafar Azeemi ”Power Aware Framework for Dense Matrix Operations in Mul-
timedia Processors”, in Proc. IEEE 9th International Multi-topic Conference (IN-
MIC ’05), Karachi, Pakistan, December, 2005.
• N. Zafar Azeemi, M. Rupp ”Energy-Aware Source-to-Source Transformations for
a VLIW DSP Processor”, in Proc. IEEE 17th International Conference on Micro-
electronics (ICM ’05), pages 133 - 138, Islamabad, Pakistan, December, 2005.
• N. Zafar Azeemi ”A Framework for Architecture Based Energy-Aware Code Trans-
formations in VLIW Processors”, in Proc. International Symposium on Telecom-
munication (IST ’05), pages 393 - 398, Shiraz, Iran, September, 2005.
1.3 Thesis Outline
This thesis is organized in five chapters, as shown in Figure 1.4. A brief description of
each chapter is given below.
Chapter 1: We discuss the different design limitations, such as high level system design
constraints, fabrication level constraints, battery technology constraints etc. We explore
the design slacks that exist in contemporary work [31] [24] [5] for energy aware code
optimization. We explain the thesis structure and provide a detailed list of contributions.
Chapter 2: This chapter lays the necessary foundation for the development of our
energy cycle aware iterative compilation framework. Our methodology optimizes a soft-
ware application for energy consumption, execution time as well as efficient hardware
architecture utilization. As compared to [5] [32] [33] [34], we elaborate our method
for generic multimedia processors. Unlike [35] [36] [36], we define software applica-
tion in terms of its architectural behavior. We provide a simplified overview of typical
multimedia processors. Though various multimedia operation models are presented in
[37] [31] [38] [39] [40], but their complexity refrain them to be readily usable in a real
time optimization environment. We use a simplified multimedia operation model devel-
oped in [4], that views the instruction set in terms of load/store operations, compute
operations, special register operations and control flow operations. The measurement
of energy consumption made by an application at a real-time platform is a first step
1.3. Thesis Outline 9
Fig. 1.4: Thesis Structure.
to know in any energy constrained embedded system and can be used to estimate
the battery lifetime of the system. The experimental setup proposed in [5] [32] [41]
for instruction/program current measurement, addressing modes, immediate operands,
and exhaustive characterization is very time consuming. We present here a measure-
ment platform that is generic and applicable to most off-the-shelf available multimedia
processors. It is based on current measurement at both processor and memory input
lines. Unlike the instruction based energy model presented in [42] [24], we propose a
simplified energy consumption model based on code blocks. We expose a step-by-step
procedure for the measurement of software application energy consumption at a target
hardware architecture. As compared to [24] [32] [41], we apply our framework at two
major application domains, multimedia and bioinformatics. The multimedia application
set consists of encoders and decoders (transcodecs) encompassing three media types -
speech, video, and audio (music), whereas, we categorize the basic functionality offered
by all bioinformatic tools into four groups. They are pattern recognition algorithms, rule
based analysis, biological data bases and biological taxonomy. The results published
10 1 Introduction
in [28] [29] reveal the usefulness of our framework at diversified application domains.
Several energy reduction opportunities at design level are also presented.
Chapter 3: Our energy cycle aware compilation framework is powered by a source
code transformation engine. Unlike [43] [42] [24], we implement our scheme by first
investigating the ’C’ source code of application for cycle energy taxing blocks, based
on trace data collected during a profile of the application as mentioned in Chapter 2.
Here, we present a novel heuristic that searches the solution space for an optimal source
code transformation scheme. We demonstrate that the algorithm executes a solution
and evaluates the energy-time tradeoff based on a user-defined metric. Based on the
evaluation, it selects the next solution to be evaluated. The heuristic terminates when
desired objectives are achieved. Our gradient mode iterative compilation scheme has
two salient features. First, it requires queuing code blocks such that blocks pertaining
similar expression profile most likely to benefit from the same transformation scheme.
Second, it completes in a discrete number of steps based on the number of code blocks,
whereas schemes mentioned by Sinha et al. in [33] and Tiwari et al. in [5] offer searches
that grow exponentially as the number of code blocks increases. We also expose our
scheme by analyzing a video encoding application (MPEG-1 encoder). Further merits
and demerits of the scheme are also explained in different application scenarios.
Chapter 4: The gradient mode iterative compilation as proposed in the previous chapter,
belongs to a class of compilation termed as feedback directed compilation. It brings
relatively small improvement, as it effectively restricts itself to trying different back-end
optimizations. The major impediment to such approach is the heuristic search technique
itself. Unlike [32] [41], in this chapter we consider the optimization problem as a single
task, where all desired aims have to be taken into account simultaneously. We present
a new method, which is based on the optimization of a multicriteria, objective function,
where the desired aims of architecture-based energy-cycle optimization are formulated as
penalty terms of such objective function. Further, we describe how the maximization of
the objective function can be achieved by using a Genetic Algorithm (GA). The interface
of the proposed methodology to our energy cycle aware compilation framework is also
explained. We also expose the minutia of our methodology e.g., selection of constraints,
development of fitness function, formation of Hertz matrix. We discuss two multimedia
applications in depth to elaborate the advantage of the algorithm.
Chapter 5: In this chapter we introduce the concept of application-architecture char-
acterization with the help of our ECACF and multivariate statistics techniques. To our
knowledge this is a first attempt to obtain such characterization from the application
expression profiles.
The application-architecture correlation is a bidirectional process matching algorithmic
structure with hardware architecture and vice vera. The programmer will benefit from
this efficient mapping and produce better source codes. Applications of similar function-
ality may yield similar Application Expression Profile (AEP), and hence can be suitable
1.3. Thesis Outline 11
for similar hardware platform. We explore the fact that despite the simplicity of our
methodology, the analysis of large matrices provided by an application expression profile
under different levels of transformation at different architectures is not trivial and re-
quires an advanced knowledge of discovery processes. To this end, we introduce a new
methodology to evaluate the application portability using multivariate statistics. We
demonstrate how box plot, scree plot, and PCA biplots can be used to characterize an
application at a given hardware architecture. We expose the minutia of methodology by
exploring the AEP across three different hardware platforms at diversified applications.
Finally, we demonstrate how dAEP can be used to find out the legacy code portability
across platforms.
12 1 Introduction
2. ENERGY-CYCLE AWARE COMPILATION
FRAMEWORK (ECACF)
Miniaturization of computing systems is finding applications in special areas such as
hand-held computation, tiny robots, guidance systems in automated vehicles, to name
just a few. Also, these systems or their users move from place to place. Because of
their small size and their mobility requirement, they are powered by batteries of low
rating. In order to avoid frequent recharging and/or replacement of the batteries, there
is significant interest in low-energy system design. Energy consumption is an area of
growing concern in system design. It leads to variety of system related issues, such as
battery life, thermal limits, packaging constraints, and cooling options [44]. Though
energy is actually consumed by the hardware, energy consumption can be reduced apart
from using low-energy electronics by suitably manipulating the software systems. This
is because the hardware activities are controlled through the software. Let a program
X run for T seconds to achieve its goal, VCC be the supply voltage of the system, and
I be the average current in Amperes drawn from the power source for T seconds. We
can rewrite T as T = N x τ where N is the number of clock cycles and τ is the clock
period. Then, the amount of energy consumed by X to achieve its goal is given by: E
=VCC x I x N x τ joules. Since for a given hardware, both VCC and τ are fixed, E
∝ I x N. However, at the application level, it is more meaningful to talk about T than
N, and therefore, we express energy as E ∝ I x T. This expression is the foundation of
our ECACF. It shows the main idea in the design of energy-efficient software that is to
reduce both T and I. From the running time (average case) of an algorithm we achieve
a measure of T . However, to compute I, one must consider the current drawn during
each clock cycle. This is illustrated in Section 2.5.
Given the fact that power is the rate of energy consumption, in this thesis, we refer to
power and energy interchangeably. Low power design is a complex endeavor requiring
a broad range of strategies from floor planning on silicon substrate to the design of
application software. In Table 2.1, we enlisted several strategies for achieving energy
efficiency in an energy-conscious system design. In the following section, we review some
of these strategies.
14 2 Energy-Cycle Aware Compilation Framework (ECACF)
Power Reduction Strategies MES Design LevelsFabrication Level Power Reduction Low level
Processor Level Power Reduction Intermediate level
EDA Tools Level Power Reduction High level
Compiler Level Power Reduction High level
Low Power Data Structures High level
Idle Model Power Reduction Intermediate level
Power Reduction in Distributed Computing High level
Power Reduction in Communication Systems High level
Battery Aware Power Reduction High level
Tab. 2.1: Energy reduction techniques for embedded system design.
2.1 Energy Saving Techniques - A Review
We review a wide spectrum of strategies, shown in Table 2.1, ranging from the hardware
fabrication process to energy efficient communications system. Energy saving due to
different approaches are, in the best case, multiplicative. E.g., in an IDCT application
implemented in [44] [45] [46] [47], a 30% energy saving from low-energy electronics
together with a 23% saving from compiler techniques will yield a total energy saving of
(1-((1-0.30)(1-0.23)))×100%= 46.1%.
However, generally the total energy saving is less, say, in this example 34%, because the
various energy saving strategies may adversely affect each other.
2.1.1 Fabrication level power reduction
The power consumption in a CMOS digital circuit is expressed as [48]
P = (CLV 2DDfp) + (ISCVDD) + (IleakgeVDD) (2.1)
where VDD is the supply voltage, fp is the output switching frequency, CL is the output
capacitance load, ISC is the short circuit current pulse, generated when both n- and
p-transistors are briefly turned on during the output switching, and Ileakage is the leakage
current. The first term on the righthand side of the power equation is the dominant
factor [48]. It is expected that power saving with two orders of magnitude can be
achieved using low-power electronics. About half of the power reduction will come from
architecture changes and management of switching activity (fp). The other half of
power reduction will come from using advanced materials technology to allow reduction
of VDD to 1 V or below from 5 or 3.5 V while also reducing CL [48] [49].
2.1. Energy Saving Techniques - A Review 15
2.1.2 Processor level power reduction
Mobile embedded system requires small form factors and hence processors designed for
high-end desktops are not suitable for such application. Havinga et al. in [50] show that
microprocessors can account for up to 33% of a typical notebook power budget, which
is around 15W. Therefore, processor designers include a number of features to reduce
power consumption. E.g., in TriMedia processor TM130x [4] and Blackfin processor
ADSP533S some of the power reduction features are dynamic idle-time shutdown of
separate execution units, low-power cache design, and power considerations for standard
cells, data-path elements, and clocking. The processor also supports three static power
management modes doze, nap, and sleep [51]. These modes reduce power at a global
level when the processor is idle for an extended period of time. Since CMOS circuits
consume power during the charging and discharging of capacitances, reducing switching
activity saves power. At the architecture-level, two strategies to reduce switching activi-
ties are Gray code addressing and cold scheduling of instructions [52] [53]. Experimental
results show that cold scheduling reduces switching by 20 ∼ 30%. The Gray codes ad-
vantage over the binary code is that each memory access changes the address by only
one bit. Thus, a significant number of bit switches can be eliminated using Gray code
addressing. Also, by decomposing a finite-state machine into several submachines, [54]
suggest that it is possible to selectively turn off portions of a circuit, thereby reducing
the switching activities. Tiwari et al. [31] have studied the idea of shutting off parts of
a logic circuit that are not needed in a particular computation on a per-clock-cycle basis.
This saves the power used in all the useless transitions in those parts of the circuit. Burd
et al. in [55] and Govilak et al. in [56] have suggested that power consumption in a
CPU can be reduced by dynamically changing its operating frequency and voltage. Fur-
ther studies to expose the role of prediction and of smoothing in dynamic speed-setting
policies is discussed in [57]. Havinga and Smit [50] propose energy saving by exploiting
locality of reference with dedicated, optimized modules. The idea of locality of reference
is to offload as much work as possible from the CPU to programmable modules that are
placed in the data streams.
2.1.3 EDA tools level power reduction
The design of low-power systems cannot be achieved without good power-conscious
EDA tools. EDA tools are used at all levels of hardware design: behavioral, architectural,
logic and physical. For a detailed exposition of power-conscious EDA tools, the reader
is referred to tutorials by [58] [59] [14].
16 2 Energy-Cycle Aware Compilation Framework (ECACF)
2.1.4 Compiler level power reduction
Compiler design techniques contribute to energy saving in several ways [60] [61]. Kolson
and Nicolau [62] [40] [63] address the problem of allocating memory to variables in em-
bedded DSP (digital signal processing) software. The goal is to maximize simultaneous
data transfers from different memory banks to registers [64] [65] [66]. In several DSP
applications mentioned in [67] [68], two registers are loaded with the required data and
an arithmetic operation is performed. Loading two registers with a single double transfer
instruction draws a little more current than a move instruction. Both the instructions
take one clock cycle each. However, energy is saved by using the double transfer, be-
cause the double transfer instruction loads the two registers in one clock cycle, whereas
we need two clock cycles to sequentially load the registers. Experimental results for a
few applications on a Blackfin DSP processor in [30] show that up to 47% of energy
can be saved by this approach. Instructions with memory operands have much higher
energy costs than instructions with register operands [30]. This suggests that energy
can be saved by suitably assigning the live variables of a program to registers. But, a
processor has only a small number of registers. When the number of simultaneous live
variables is larger than the number of available registers, some of the variables must be
spilled to memory. Register assignment for loop variables is important because loops
are typically executed many times. Algorithms for optimal register assignment to loop
variables are presented in [69] [70] [71] [62]. This algorithm can be included in the
code generation part of a compiler.
2.1.5 Low power data structures
Kondo et al. [72] propose a method of implementing set data types with minimum power
consumption. In a programming language, one can implement the set data type using a
variety of concrete data structures such as arrays, pointer arrays, linked list and binary
tree [73]. Thus, to implement the set operations, such as locate, insert, and remove
a record from a set, one has to manipulate the memory elements in a concrete data
structure as proposed in [74] [75] [33] [42]. It is the memory accesses in the process
of set operations that actually consume power. Thus, the power consumption in set
operations is a function of the number of memory elements used in implementing a set
data type, the number of read and write operations are performed in the implementation,
and some logic details such as capacitance of memory elements, voltage level, and
frequency of operation. The concrete data structures are compared on the basis of a
filling factor, which is the fraction of the locations that would be filled if implementation
is in arrays [76] [77] [78]. It has been shown that for different levels of filling factor,
different concrete data structures lead to low values of the power cost function. E.g.,
for filling factors greater than 60%, arrays are better in implementing energy efficient
set data types [72].
2.1. Energy Saving Techniques - A Review 17
2.1.6 Idle mode power reduction
The doze mode is an innovative approach to conserving energy [79] [80] [81] [60]. It is
very attractive in a communication environment where a mobile system may occasionally
send or receive messages. In the doze mode, the clock speed is reduced and no user
process is executed. Rather, a mobile host simply waits for any incoming message. Upon
receiving a message, the host resumes its normal mode of operation. The energy saving
due to this mode depends on the local computations on a mobile and the pattern of
communication between a mobile and a support station [82]. Simulation studies in [41]
show that energy saving due to this mode spreads over a wide range of 2 ∼ 98%.
2.1.7 Power reduction in distributed computing systems
Agent based computation is a relatively new idea in distributed computing [83] [81]
[84]. General agent-based distributed computing systems have been designed using the
concept of Lindas tuple space [85]. Wei et al. [86] discuss how energy-efficient
distributed algorithms in a mobile computing environment can be designed using a tuple
space managed on the fixed network of a mobile system. Lin et al. [22] propose a power
efficient commit protocol which supports conventional two-phase commit services. A
distributed autonomous system called Noah (Network oriented application harmony)
has been proposed in [87] built in the Mitsubishi laboratory. Though the purpose of
Noah is not to save energy, it demonstrates how agent based systems can be built using
a tuple space as the medium for process communication. By shifting most workload
to peer fixed hosts, the load, the power consumption and the message exchanged via
expensive wireless links in a mobile host are greatly reduced.
2.1.8 Power reduction in communication systems
The receiver subsystem of a mobile station need not be active all the time [88]. Most
digital cellular and cordless systems provide power cycling at the mobile units. Mobile
stations can periodically relax (power cycle) their receivers as a means of conserving
energy. Since the receiver of a mobile unit is not continuously ready to receive messages
from the local support station (base station), some kind of coordination between a base
station and a mobile unit is necessary. Salkintzis et al. [89] propose a page-and-answer
protocol. Intuitively, the protocol works as follows:
When a base station has a message for a mobile unit, the base station sends a small
paging packet to the mobile unit. If the mobile unit receives the paging packet, that
is if the mobile receiver is up, the mobile sends an answer packet to the base station.
Obviously, if the paging message is sent at a time when the receiver is powered off, no
answer packet is generated by the mobile and the base station will once again page the
18 2 Energy-Cycle Aware Compilation Framework (ECACF)
mobile after some time. Upon receiving an answer packet, the base station sends the
desired message to the mobile unit.
Kravets and Krishnan [90] propose power saving by selectively choosing short periods
of time to suspend communications and shut down the communication device. Applying
this method to a transport protocol and using three simulated communication patterns,
they have achieved up to an 83% saving in the energy consumed by the communication
system. Chlamtac et al. [91] address the problem of wireless access protocols which
include an energy constraint and develop three energy conserving protocols for various
loads: grouped-tag TDMA, directory, and pseudorandom. Singh et al. [92] argue that
there is a need for using power-aware metrics, such as minimize energy consumed per
packet, minimize variance in node power levels, maximize time to network partition, etc.,
in the design of power efficient routing protocols. They show that these metrics in a
shortest-cost routing algorithm reduces the cost/packet of routing packets by 5 ∼ 30%over shortest-hop routing.
2.1.9 Battery aware power reduction
Chiasserini and Rao [18] have shown how battery behavior can be exploited to prolong
battery life. In particular, they identify the phenomenon of charge recovery that takes
place under pulsed discharge conditions as a mechanism that can be exploited to enhance
the capacity of an energy cell. The bursty nature of many data traffic sources suggests
that there might be a natural fit between the two. Bai and Lai [93] implement some
methods to let the low power CPU efficiently do some kind of computation intensive
tasks, such as graphic image processing and displaying. Their methods include reducing
the computation complexity of bitmap file processing, using fixed-point math instead
of floating point math, prestoring the table of trigonometric functions, and using a few
lines of assembly language code in the inner loop of graphic image processing program
to improve its performance. These methods lead to a speed up of the programs by a
factor of three to six.
In [44], we argue that mobile applications development require us to rethink the concept
of an algorithm from the viewpoint of battery life. Instead of asking for the best result,
a user may say :
’Give me the best result you can find, using no more than X units of resource R.’
Or, one can let the system make the tradeoff between fidelity and resource consumption
by saying:
’Give me the best result you can obtain economically.’
2.2. Multimedia DSPCPU Architecture 19
2.2 Multimedia DSPCPU Architecture
A multimedia processor is a media processor for high-performance multimedia appli-
cations that deals with high-quality video and audio. Typically, an extended general-
purpose CPU ( called the DSPCPU) makes it capable of implementing a variety of
multimedia algorithms from popular multimedia standards such as MPEG-1 and MPEG-
2. The key features behind this powerful processor are as follows:
• A general-purpose VLIW processor core coordinates all the on-chip activities.
In addition to implementing the non-trivial parts of multimedia algorithms, this
processor runs a small real-time operating system that is driven by interrupts from
the other units.
• DMA-driven multimedia input/output units that operate independently and that
properly format data to make software media processing efficient.
• DMA-driven multimedia coprocessors that operate independently and in parallel
with the DSPCPU to perform operations specific to important multimedia algo-
rithms.
• A high-performance bus and memory system that provides communication between
the processing units.
• A flexible external bus interface.
A typical multimedia processor is based on a three-level hierarchy of operators:
• Instructions
• Operations
• RISC operations
One instruction may contain five operations as depicted in Figure 2.1. Each operation
may execute multiple arithmetic operations. E.g., for TriMedia DSP processor TM130x,
one such operation is the command IFIR(a, b). This command contains a total of three
arithmetic operations: Two multiplications and one addition (aHI × bHI + aLO × bLO).
Up to five operations including two IFIR commands can be issued in each machine
cycle. The ability of TriMedia’s VLIW architecture to execute multiple operations in
parallel gives it a big advantage over traditional RISC and CISC architectures found in
current mass-market microprocessors.
20 2 Energy-Cycle Aware Compilation Framework (ECACF)
Fig. 2.1: TriMedia VLIW instruction [4].
2.2.1 Multimedia Processor Execution Model
The multimedia processor processor provides a large set of general purpose registers,
generally named as r0, r1, and so on. In addition to the hardware program counter PC,
there are a few user-accessible special purpose registers to hold CPU branch addresses.
The CPU issues one long instruction every clock cycle. Each instruction consists of
several operations (five operations for the TM1300 microprocessor) [4]. Each operation
is comparable to a RISC machine instruction, except that the execution of an operation
is conditional upon the content of a general purpose register. Examples of operations
are:
IF r10 iadd r11 r12 → r13 (if r10 true, add r11 and r12 and write sum in r13)
IF r10 ld32d(4) r15 → r16 (if r10 true, load 32 bits from mem[r15+4] into r16)
IF r20 jmpf r21 r22 (if r20 true and r21 false, jump to address in r22)
Each operation has a specific, known execution latency (in clock cycles). For example,
in case of TM1300, iadd takes 1 cycle. This means that the result of an iadd operation
started in clock cycle ’i’ is available for use as an argument to operations issued in cycle
’i+1’ or later. The other operations issued in cycle ’i’ cannot use the result of iadd.
Similarly the ld32d operation has a latency of 3 cycles. The result of an ld32d operation
started in cycle ’j’ is available for use by other operations issued in cycle ’j+3’ or later.
Branches, such as the jmpf example above have three delay slots. This means that if a
branch operation in cycle ’k’ is taken, all operations in the instructions in cycle k+1, k+2
and k+3 are still executed. In the above examples, r10 and r20 control the conditional
execution of the operations. This is also referred to as guarding, where r10 and r20
contain the guard of the operation.
The implementation of architecture restricts the choice of operations that can be per-
formed in parallel or can be packed into an instruction. For example, the DSPCPU in
TM1300 allows no more than two load/store class operations to be packed into a single
instruction, shown in Figure 2.2. Also, no more than five results (of previously started
operations) can be written during any one cycle. The packing of operations is not nor-
2.2. Multimedia DSPCPU Architecture 21
mally performed by the programmer. Instead, the instruction scheduler takes care of
converting the parallel intermediate format code into packed instructions ready for the
assembler. The rules are formally described in the VLIW Description File (VDF) used
by the instruction scheduler and other tools.
Fig. 2.2: TriMedia functional unit assignment [4].
2.2.2 Multimedia Processor Operations Overview
In this section we present a brief overview of the multimedia processor instruction set.
Readers are encouraged to refer to [4] for details.
Conditional Execution: In multimedia processor architectures, all operations are op-
tionally ’guarded’. A guarded operation executes conditionally, depending on the value
in the ’guard’ register. For example, a guarded add is written as:
IF R23 iadd R14 R10 → R13.
This should be taken to mean if R23 then R13 ← R14 + R10. The ’if R23’ clause
controls the execution of the operation based on the LSB of R23. Hence, depending
on the LSB of R23, R13 is either unchanged or set to contain the integer sum of R14
and R10. Guarding applies to all TM1300 operations, except the iimm and uimm (load-
immediate) operations. Guarding controls the effect on all programmer visible state of
the system, i.e. register values, memory content, exception raising and device state.
Load and Store Operations: Memory is byte addressable. Loads and stores have to
be naturally aligned, i.e. a 16-bit load or store must target an address that is a multiple
of two. A 32-bit load or store must target an address that is a multiple of four. For
22 2 Energy-Cycle Aware Compilation Framework (ECACF)
TM1300, the BSX bit in the PCSW (program control status word) register determines
the byte order of loads and stores. E.g., see ld32 and st32 in Appendix A of [4], only
32-bit load and store operations are allowed to access MMIO registers in the MMIO
address aperture. The results are undefined for other loads and stores. A load from
a non-existent MMIO register returns an undefined result. A store to a non-existent
MMIO register times out and then does not happen. There are no other side effects of
an access to a nonexistent MMIO register. The state of the BSX bit has no effect on
the result of MMIO accesses. Loads are allowed to be issued speculatively. Loads that
are outside the range of valid data memory addresses for the active process return an
implementation dependent value and do not generate an exception. Misaligned loads
also return an implementation dependent value and do not generate an exception.
Compute Operations: Compute operations are register-to-register operations. The
specified operation is performed on one or two source registers and the result is written
to the destination register.
Immediate Operations load an immediate constant (specified in the opcode) and produce
a result in the destination register.
Floating-Point Compute Operations are register-to-register operations. The specified
operation is performed on one or two source registers and the result is written to the
destination register. Unless otherwise mentioned all floating point operations observe
the rounding mode bits defined in the PCSW register. All floating-point operations
not ending in flags update the PCSW exception flags. All operations ending in flags
compute the exception flags as if the operation were executed and return the flag values
(in the same format as in the PCSW); the exception flags in the PCSW itself remain
unchanged.
Multimedia Operations are special compute operations. They are like normal compute
operations, but the specified operations are not usually found in general purpose CPUs.
These operations provide special support for multi-media applications.
Special-Register Operations: Special register operations operate on special registers,
such as program control status word, branch address holding registers etc.
Control-Flow Operations: Control-flow operations change the value of the program
counter. Conditional jumps test the value in a register, and based on this value, change
the program counter to the address contained in a second register or continue execution
with the next instruction. Unconditional jumps always change the program counter
to the specified immediate address. Control-flow operations can be interruptible or
non-interruptible. The execution of an interruptible jump is the only occasion where a
multimedia processor allows special event handling to take place.
2.3. Workload Description 23
2.3 Workload Description
Our workload consists of two major application domains, multimedia and bioinformatics.
Both use compute and data intensive algorithms. In this section we present in detail the
diversity found in these application domains, that we selected for the rigorous testing of
our ECACF. The variability in the input data streams is also discussed.
2.3.1 Multimedia Applications
The multimedia application set consists of encoders and decoders (transcodecs) encom-
passing three media types - speech, video, and audio (music) - and is summarized in
Table 2.2 to Table 2.5. We obtained codes for these applications from various public
domain sources [94] [95] [96] [21]. The applications were chosen for their importance
in real systems and (we believe) to be representative enough to make the inferences in
this study. We evaluated all our applications with four inputs, summarized in Table 2.6.
Here, we only report results from a single input for each application. We chose the input
that gave the highest (normalized) standard deviation in per frame execution time on
our base system. We call these inputs the default inputs, and list them in the second
column of Table 2.6. Results with the other inputs are similar, both quantitatively and
qualitatively. The G.728, H.263, and MPEG codecs statically distinguish multiple frame
types. G.728 uses an adaptive algorithm, where certain parameters are updated every
four frames. The processing of each frame in a single four-frame cycle is different due
to the calculation of these parameters. Thus, we treat these as different types of frames
(numbered one through four). The H.263 and MPEG codecs use almost the same video
compression scheme. A key difference is that MPEG uses three different types of frames
- I frames do not exploit inter-frame redundancy, P frames exploit inter-frame redun-
dancy using a previous frame, and B frames exploit such redundancy using a previous
and a later frame. Our H.263 codecs do not use B frames. They use a single I frame at
the beginning of the video and P frames for the rest. We do not include the I frame in
our analysis. It takes excessively long to simulate a frame with the MPEG codecs using
the frame sizes specified by the MPEG-2 standard (about 4 to 16 hours per frame for
MPEGenc. We scaled down the frame size to 176x144 pixels so that we could simulate
a reasonable number of frames to assess execution time variability. We ensured that
the scaling did not affect the cache behavior by performing a working set analysis and
running representative experiments with larger frame sizes and different cache sizes. As
the chosen frame size conforms to the H.263 standard, we used the same size for the
H.263 codecs for consistency. Also for consistency, we used the same set of four inputs
for both MPEG and H.263 codecs. These inputs contain a great deal of motion to
stress the applications. H.263 was designed for low bit-rate applications such as video
conference (which typically have less motion); therefore, our results from these inputs
represent an upper bound on the expected variability for H.263.
24 2 Energy-Cycle Aware Compilation Framework (ECACF)
Application Description Input Vector SampleRate/Through-put
GSMenc Low bit-rate speech codingbased on the European GSM6.10 provisional standard. UsesRPE/LTP (residual pulse ex-citation/long term prediction)coding at 13 Kb/s. Compressesframes of 160 16-bit samplesinto 264 bits.
orignova 20 ms (160 sam-ples), 8 KHz
GSMdec homemsg
G728enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.
lpcqutfe 625 µs, (5 sam-ples), 8 KHz
G728dec homemsg
G723enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.
lpcqutfe 625 µs, (5 sam-ples), 8 KHz
G723dec homemsg
G729enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.
lpcqutfe 625 µs, (5 sam-ples), 8 KHz
G729dec homemsg
Tab. 2.2: Multimedia Benchmarks (Speech Transcodecs).
2.3.2 Bioinformatics Workload
Due to a significant increase in biological threats against humane, plants and other
species during last two decades, there is a growing realization that bioinformatics and
molecular biology equipments should be available in small form factors, that can be
readily available in field [97]. This lead to development of battery as well as execu-
2.3. Workload Description 25
Application Description Input Vector SampleRate/Through-put
H263enc Low bit-rate video coding basedon the H.263 standard. Primar-ily uses inter-frame coding (Pframes). Widely used for bit-rates less than 64 Kb/s.
orignova 40 ms, 25 frames/s
H263dec buggy
H264Lenc Low bit-rate video coding basedon the H.264 standard. Primar-ily uses inter-frame coding (Pframes). Widely used for bit-rates less than 64 Kb/s.
orignova 40 ms, 25 frames/s
H264Ldec buggy
MPEGenc High bit-rate video codingbased on the MPEG-2 videocoding standard. Uses intra-frame (1) and inter-frame (P,B) coding. Typical bit rate is1.5-6 Mb/s.
Buggy 33 ms, 30 frames/s
MPEGdec flwr
MPEG-1 encoder High bit-rate video codingbased on the MPEG-1 videocoding standard.
Buggy 33 ms, 30 frames/s
MPEG-1 encoder flwr
NLIVQ Non linear interpolative vectorquantization, image processingcodec
cameraman.tif 512x512 resolu-tion, Gray scale
Tab. 2.3: Multimedia Benchmarks (Video Transcodecs).
Application Description Input Vector SampleRate/Through-put
MP3enc Audio decoding based on theMPEG Audio Layer-3 standard.Synthesizes an audio signal outof coded spectral components.Typical bit rate is 16-256 Kb/s.
filter 26 ms (1151 sam-ples), 44.1 KHz
MP3dec filter
Tab. 2.4: Multimedia Benchmarks (Audio Transcodecs).
26 2 Energy-Cycle Aware Compilation Framework (ECACF)
Application DescriptionFFT Fast Fourier Transform
IDCT Inverse Discrete Cosine Transform
T64 Matrix Transpose 64x64
M100 Matrix Multiplication 100x100
Tab. 2.5: Generic DSP application Benchmarks [7].
Domain Test Vector Description FeaturesAudio CatSteven Soft rock song 2500 frames, av-
erage length 65.25seconds
Sting Pop songBeethoven 2500 classical piece
Video Flwr Drive-by of houses 450 frames, each18 seconds forH.263 and 15seconds for MPEG
Cact Panoramic viewBuggy Buggy raceTens Table tennis match
Speech Homemsg An answering message Average frame sizefor GSM codecs is500, for G.72x is19000, length: 20seconds
Orignova Sentences read by different adultslpcqutefe Sentence read by a boy
Tab. 2.6: Test Vectors Characterization.
tion time efficient handheld devices for bioinformtics applications. Bioinformatics is an
interdisciplinary research area that helps to produce ’sensible’ and ’useful’ information
from the wealth of data that has been produced by the genome sequencing projects.
We categorize the basic functionality offered by all bioinformatics tools into four groups,
they are:
1. Algorithm for pattern recognition, probability formulae are used to determine the
statistical similarity in given two or more than two sequences.
2. Rule-bases analysis defines how a mathematical or statistical technique can be applied.
Different sets are defined with a membership, and set of rules are also created to elaborate
associativity. A basic set theory is used to fire a rule.
3. Biological data bases are uniformly and efficiently maintained archives of consistent
data that contain information and annotation of DNA and protein sequences, DNA
and protein structures as well as DNA and protein expression profiles [98] [99]. An
2.3. Workload Description 27
important feature of these databases is their simplicity in access and query management.
In addition some websites [100] [101] [102] provide visualization tools to aid biological
interpretation.
4. Biological taxonomy records the differences in sequences across different classes
helping further to reduce the similarity errors.
We chose applications for their importance in real system and representative enough to
make the inferences in this study. They are summarized in Table 2.7. We obtained
codes for these applications from various public domain sources. For lack of space, we
only report their underlying algorithm; details may be found in [99] [97] [102]. The
input databases are obtained from the NIH genetic sequence database ’GenBank’, NCBI
assembly archive ’Genome Assembly Archive’, Homologus structure alignment database
’HOMSTRAD’, the NIMH-NCI protein-disease database ’PDD’ and ’The Lens’ [100]
[102].
Application Pseudonym Features AlgorithmsGENESPLICER A01 Detect splice sites in the
genomic DNAHigh accuracy and com-putationally efficient
TIGRSCAN A02 DNA modeling Generalized HiddenMarkov Model (GHMM),HMM
TRANSTERMIS A03 Rho-independent tran-scriptional terminators
Statistical estimationtechniques
GENSCAN A04 Predict complete genestructure
Search algorithms
MUMMER A05 Genome Sequence align-ment
Tree algorithms
GLIMMERHMM A06 Find gene sequence ineukaryotes
IMM, Splice site models,Maximal dependence de-composition techniques
GENIE A07 Gene finder in vertebrateand human DNA
GHMM, Neural Net-works
FGENE A08 Find splice sites, genes,promoters
Linear discriminantanalysis
GRAIL A09 Analysis of DNA se-quence
Automated computation
GENEMARK A10 Find genes in bacterialDNA sequence
Markov chains
NetPlaneGene A11 Sequence analysis Neural network
GLIMMER A12 Coding regions in micro-bial DNA
Interpolated MarkovModels (IMM)
Tab. 2.7: Bio-Computation Applications Benchmark .
28 2 Energy-Cycle Aware Compilation Framework (ECACF)
2.4 Energy Cycle Aware Compilation Framework Methodology
The ECACF is shown in Figure 2.3. The source code is processed successively for
static code analysis, post compiler analysis and finally for scheduling analysis. A VLIW
processor descriptor file (VDF) is used to provide architecture information to compiler,
scheduler and finally to the machine code generator. The VDF file contains a list of
pseudo and machine operations, latency of the operations, opcodes, slot assignment
schemes, processor operating frequency, instruction cache feature (associativity, block
size, number of sets) and main memory features (size, order, read/write latencies). This
file format is compatible as mentioned in [103] [4] [81] [104]. Here, we follow the
same VLIW naming convention as used in [104]. This feature has made our scheme
architecture independent. A list of parameters is generated in each step during the
methodology flow. Intermediate trace files are generated during the code processing
flow to produce AEP, such as code size, execution time number of cache miss (for both
instruction and data caches), data cache conflicts, data bank alignment, highway usage,
scheduling factor and slot utilization. After the simulation these parameters are used
to compute transformation control factors such as unrolling factor, grafting depth and
blocking metrics. These control factors are further explained in [25]. Iteratively after
each cycle all these parameters are recorded again and are compared to preset user
constraints mentioned in a User Constraint File (UCF). This file contains desired values
for code, execution time, energy and allowed percentage cache miss. Energy is measured
at the target platform (the setup is explained in Section 2.5). All these parameters are fed
back to the transformation cost analyzer. In each successive transformation it is decided
that whether energy-cycle performance has been optimized or not. The source code is
optimized by undergoing code restructuring schemes known as loop unrolling, decision
tree grafting and loop tiling. Additional benefits are gained by combining traditional
compiler optimization algorithms, such as constant and variable propagation, dead code
elimination, strength reduction etc..
2.4. Energy Cycle Aware Compilation Framework Methodology 29
Fig. 2.3: Transformation methodology.
30 2 Energy-Cycle Aware Compilation Framework (ECACF)
2.4.1 Application Expression Profile
From a ’C’ source code to an executable binary, an embedded application has to go
through many tools: the text writing notepad, compiler, scheduler, linker, and the
loader. The urge ’how can I?’ is transformed into the conscious biased perception, en-
tailed by embedded systems emerging from software hardware co-design. The software
leads and the hardware follows the technological limitations. The behavior, a software
implementation can express on a hardware is limited by the liberty offered by the hard-
ware architecture and the ability of programmers to code the ’how can I?’. The above
issues indicate that for a ’good’ energy-cycle performance there is a need to gather
more detailed profiles, containing information about system behavior on various levels
as shown in Figure 2.4. The main goal of such vertical profiling is to further improve the
understanding of system behavior through correlation of profile information at different
levels.
Fig. 2.4: Vertical application profile layers.
Hitherto, an executable application development hierarchy is composed of compilation,
scheduling, linking, and binary code generation. Finally, this code is downloaded to
the SDRAM attached with the multimedia processor. Our Application Profile Monitor
(APM) extracts application behavioral parameters as mentioned above. This infor-
mation is extracted from the vertical profile layer block as shown in Figure 2.4. An
application is profiled both in terms of its static and run time (dynamic) behavior. The
way an application expresses itself, we call Application Expression Profile (AEP) for a
given hardware architecture. We characterize an application expression profile using the
following conventions:
1) Name : It describes the name of the profile monitor.
2) Definition: It defines the profile monitor as used in our ECACF.
2.5. Experimental Setup 31
3) Location: It shows the location of the monitor in the application development hier-
archy such as compilation, scheduling, linking etc.
4) Type : There are two possible types: static or dynamic.
5) Range: The possible range of value a monitor can have.
6) Level: If a parameter is measured directly from the code, it is called primary monitor,
in other case if it is computed using one or more parameters, we call it secondary monitor.
E.g., a primary monitor can be written as:
Name: Processor Frequency
Definition: The operating frequency of the microprocessor
Location: VDF
Type: static
Range: Typical 100MHz - 233MHz (depends on given hardware architecture)
Level: Primary
Similarly, a secondary monitor can be written as:
Name: Scheduling Factor
Definition: Computed this factor by dividing infinite machine cycle time with finite
machine cycle time
Location: Transformation Engine and Scheduler
Type: Dynamic
Range: 0 to 1
Level: Secondary
A complete list of profile monitors is provided in Appendix A.
2.5 Experimental Setup
The energy consumption by an application at a realtime platform is a first step to be
known in any energy constrained embedded system and can be used to estimate the
battery lifetime of the system. In this section, we describe an energy measurement
method for a software application running on a realtime multimedia VLIW processor.
The method is described for TM1300 Philips DSP processor, but it is applicable to other
multimedia processors, for e.g., Blackfin ADSP533S. The measurement framework has
been incorporated into our ECACF, that allows a software application programmer to
measure a realtime energy consumption by running the candidate ’C’ source code.
32 2 Energy-Cycle Aware Compilation Framework (ECACF)
2.5.1 Related Work for Energy Measurement
The energy consumption of a software application running on target hardware depends
on the processor, data path and instruction set architecture [31]. The switching energy
consumed depends linearly on the operating frequency and quadratically on the supply
voltage. Other architectural parameters which strongly affect the energy consumption
of a processor are cache size, datapath width, number of functional units (multipliers,
shifters), register file size, legacy support hardware, multimedia extension support, etc.
In general, it is practically impossible to predict how much energy a software application
will consume on another processor given the energy consumption profile on one processor,
without some prior calibration and measurement on the other one.
Software energy estimation through exhaustive instruction energy profiling was first pro-
posed in [5]. The basic experimental setup used in [5] is shown in Figure 2.5. The
approach proposed in [5] is based on the current measurement drawn by the processor as
it repeatedly executes a certain instruction or sequences of instructions. This is achieved
by putting the sequence in a loop and measuring the current values. The measured val-
ues correspond to the base current cost of instructions. The program is broken up into
basic blocks and the base cost of each instance of a block is obtained by adding the base
cost of instructions in the block. These costs are provided in a base cost table. Tiwari
et al. in [5] obtained a run-time execution profile (instruction trace) for the program.
Using this information the number of times the basic block is executed is determined
and the overall base cost is computed. The effect of the circuit state (inter-instruction
effects) is incorporated by analyzing pairs of instructions. A cache simulation is also
performed to determine stalls and a cache penalty is also added to the final estimate.
The principal disadvantage of this approach is that it involves an elaborate instruction
trace analysis. Assuming an Instruction Set Architecture (ISA) with K instructions, K2
instruction energy profiles have to be obtained to accurately determine base and inter-
instruction costs. Moreover, most instructions have a lot of variations and an exhaustive
characterization is very time consuming.
2.5.2 Proposed Methodology
Our energy measurement setup shown in Figure 2.6 is close to Figure 2.5. Given that
energy is a time integral of a power-time product, and keeping voltage fixed, the energy
depends on the current variation appearing in the CPU and memory current consumption.
During program execution, the current variation in the CPU and memory depends on
the following major factors:
1. The switching activity caused by instruction execution in CPU and load/store activity
to/from memory.
2.5. Experimental Setup 33
Fig. 2.5: Experimental setup for instruction/program current measurement [5].
2. The cache misses, they lead to CPU or cache stalls and hence require extra cycles as
mentioned in [4].
The instantaneous current drawn by a CPU varies rapidly with time showing sharp spikes
around clock edges. This is because at clock edges the processor circuits switch (i.e., get
charged or discharged) resulting is switching current consumption. Expensive hardware
with a lot of measurement bandwidth and low noise is required to accurately track the
CPU instantaneous current consumption. From an energy measurement perspective,
however, we are interested in average current consumption. To a first order, battery
lifetime depends on the average current and the amount of stored energy in the cell.
Measuring average current is simpler and can be achieved by using a current meter.
The current meter averages the instantaneous current over an averaging window and
the corresponding readings are stable average values. The average current itself varies
as the program executes.
We captured current variations using an HP54710 oscilloscope. This scope has a 4G
sample/ second sampling rate. In addition, some real time arithmetic functions can also
be performed on the input signals at single or multiple channels. As shown in Figure 2.6,
current is captured in term of differential voltage drop across a 0.01 Ω sense resistor
34 2 Energy-Cycle Aware Compilation Framework (ECACF)
Fig. 2.6: Proposed experimental setup for application current measurement at processorand memory.
inserted in the current path (only input to channel 1 is shown, for channel 2 and 3,
the connections for differential input are similar). The input differential voltage drop
at each channel is divided (using oscilloscope divide function) by 0.01 Ω to obtain the
current consumption. When an application is allowed to run on the target hardware, the
switching activity is produced, this leads to current variation on current paths at CPU
core voltage input, CPU peripheral voltage input and memory supply voltage input.
Figure 2.7 shows the CPU core current variation captured for a vector quantization
(VQ) application. It is plotted against the application execution time. The application
is allowed to run until the completion, i.e., 2800 msec. This plot clearly indicates the
current consumption profile at different time instants during the application execution
life cycle. It may be noted that during the application execution life time, one or more
code blocks in an application might have been executed twice or more. The basic code
block is a piece of code containing sequential instructions. As we mentioned, in APM,
time capture monitors are added in the original source code. They are inserted at the
beginning and end of each code block to obtain the number of times that a block is
accessed and the duration of its execution. Correspondingly, we generate an address
range versus current consumption plot. Figure 2.8 shows a plot over the address range
0x600000 to 0x603000 at address interval or step size 1024 bytes (note that we are using
the prefix ’0x’ to show hexadecimal values). There are 13 code blocks in the application.
The length of the vertical bar at address 0x601C00 corresponds to the average current
consumption over the address range 0x601C00 to 0x602000 at a step size 1024(0x400).
2.5. Experimental Setup 35
Fig. 2.7: Current consumption for vector quantization (VQ) application execution lifecycle.
Fig. 2.8: CPU core current consumption versus address range for VQ application.
36 2 Energy-Cycle Aware Compilation Framework (ECACF)
The step size is adjusted according to the granularity of the program basic code block.
It may vary from 256 bytes, 512 bytes, 1024 bytes, 2048 bytes and 4096 bytes. A code
block may consist of varying numbers of address ranges, e.g., in Figure 2.8 code block
1 (CB01) is executed during address range 0x600000 to 0x600400, while code block 3
(CB03)is executed during address range 0x600800 to 0x600C00.
Now we formulate our energy consumption. Let a given program source code ′X ′ has′m′ code blocks, then
X = CB1, CB2, CB3, ..., CBm.
The total energy of the program code is the sum of energy consumed in individual code
blocks and if the j-th code block CBj is executed from time ta to tb, then the energy
consumed by the code block will be:
ECBj =∫ tb
ta
PCBjdt. (2.2)
The power consumed by each individual code block is the sum of total power consumed
over the code block address ranges.
For code block CBj , it will be
PCBj =∑n
i=1 Pi, where ′n′ is the total number of address ranges (vertical bars) and Pi
is the power consumed by the i-th address range:
Pi = vcic,i + vpip,i + vmim,i (2.3)
vm, vc and vp are operating voltages for memory, processor core and processor peripheral,
respectively. Similarly, ic,i, ip,i and im,i are the current consumptions in the processor
core, the processor peripheral and the memory, respectively, for the corresponding i-th
address range. For our experiments we set vm = 3.3 V, vc = 2.2 V and vp = 3.3 V.
We obtain the execution time for each address range from the time base scale of the
oscilloscope.
Example: We capture the current consumption in memory, CPU core and CPU periph-
eral for the address range 0x200400 to 0x202C00 in an audio transcodec G-728 [94] at
a step size 0x400 (it may be noted that the current consumption is captured in terms
of execution time and later we converted it in terms of address range using the pro-
cedure mentioned above). They are shown in Figure 2.9 to Figure 2.11, respectively.
This address range corresponds to the code block (function) ’code book’ in the baseline
source code. The reason behind such a current consumption depends on the coded al-
gorithm. Here we shall restrict our discussion only to elaborate the energy measurement
methodology. There are nine current consumption bars in Figure 2.9 to Figure 2.11
2.6. Conclusions 37
corresponding to nine address ranges. The power dissipation in memory for ’code book’
would be vmΣ9k=1im,k, where ′k′ is number of address ranges in our code block (’code
book’).
Then the power consumed by the ’code book’ in memory is:
3.3V (4.37+8.02+0.47+1.90+2.91+2.60+6.63+7.25+0.35)mA=113.87 mW.
Similarly, from Figure 2.10 and Figure 2.11 the power consumed in CPU core and CPU
peripheral is 73.01 mW and 166.83 mW, respectively.
The total power consumption will be: Ptotal = Pmemory + PCPUcore + PCPUperipheral
i.e., Ptotal = 353.70 mW.
From the oscilloscope time base the sample duration for the ’code book’ is 0.267 seconds.
Thus the total energy consumed by the ’code book’ is 94.44 mJ.
The total energy consumed by the program is obtained by summing the energy con-
sumption of each individual code block.
Fig. 2.9: Memory current consumption versus address range for G-728 audio transcodec.
2.6 Conclusions
In this chapter we layed the foundation for the development of our energy cycle aware
iterative compilation framework. Our methodology optimizes a software application for
energy consumption, execution time as well as efficient hardware architecture utilization.
38 2 Energy-Cycle Aware Compilation Framework (ECACF)
Fig. 2.10: CPU core current consumption versus address range for G-728 audiotranscodec.
Fig. 2.11: CPU peripheral current consumption versus address range for G-728 audiotranscodec.
We elaborate our method for generic multimedia processors and define software appli-
cation in terms of its architectural behavior. We provide a simplified overview of typical
multimedia processors. Unlike conventional complex multimedia operation models, we
use a simplified multimedia operation model developed, that views the instruction set
2.6. Conclusions 39
in terms of load/store operations, compute operations, special register operations and
control flow operations. We elaborate the importance of measurement of energy con-
sumption made by an application at a realtime platform, that is a first step to know
in any energy constrained embedded system and can be used to estimate the battery
lifetime of the system. We present here a measurement platform that is generic and
applicable to most off-the-shelf available multimedia processors. It is based on current
measurement at both processor and memory input lines. We propose a simplified en-
ergy consumption model based on code blocks. We expose a step-by-step procedure
for the measurement of software application energy consumption at a target hardware
architecture. As compared to contemporary work, our framework is tested on two ma-
jor application domains, multimedia and bioinformatics. The multimedia application
set consists of encoders and decoders (transcodecs) encompassing three media types -
speech, video, and audio (music), whereas, we categorize the basic functionality offered
by all bioinformatic tools into four groups. They are pattern recognition algorithms, rule
based analysis, biological data bases and biological taxonomy. Moreover, our results
reveal the utility of our framework at diversified application domains.
40 2 Energy-Cycle Aware Compilation Framework (ECACF)
3. GRADIENT MODE ITERATIVE
COMPILATION (GMIC)
In Chapter 2, a framework is presented for executing a single application in several source
transformation settings. The basic idea is to first identify the compute and data intensive
code blocks in programs, we call them Energy Cycle Hungry Code Block (ECHCB)
and then execute a series of experiments, with each ECHCB assigned a predetermined
Transformation Scheme (TS). A simplified flow of methodology is shown in Figure 3.1.
As explained in Chapter 2, we obtain the application expression in our ECACF that is
further used by a Transformation Engine block and Code Evaluation block as shown in
Figure 3.1. Based on the desired objectives, the transformation engine decides whether a
given application should go through successive transformations and hence compilation.
If energy-cycle constraints are not met in the UCF, the transformation engine block
transforms the code according to the Gradient Mode Iterative Compilation (GMIC)
algorithm and provides it to a native Application Build Environment block. This block
produces the machine code for the transformed application source code, that is later
allowed to execute on the target platform to obtain the dynamic application expression
profile. The whole process is repeated until each successive transformation meets the
desired optimization objective as mentioned in the UCF.
We implement our scheme by first investigating the ’C’ source code of application for
cycle energy taxing blocks, based on trace data collected during a profile of the appli-
cation as mentioned in Chapter 2. For χ code blocks in an application and λ possible
TS, there are λχ unique solutions, where a solution is an assignment of a transforma-
tion scheme to each code block. We present a novel heuristic that helps to search the
solution space and eventually finds solutions (or transformation scheme) to satisfy the
desired energy-time tradeoff for a given application. In each step a heuristic takes one
code block and tries to optimize it with the available set of transformation schemes.
It proceeds to the next code block only when the previous code block is optimized or
until there is no more TS available. Minutia of the proposed heuristic are elaborated in
Section 3.2.
In the following section we explain our profiling technique for determining an efficient
TS for each code block. Firstly, it describes our scheme for identification and prioritizing
of candidate code blocks in the source code. Secondly, it discusses the mechanism that
42 3 Gradient Mode Iterative Compilation (GMIC)
Fig. 3.1: Gradient mode Iterative Compilation Methodology (GMIC).
collects performance data and the energy-time tradeoff. Finally, it discusses our method
for choosing an assignment of TS to each code block.
3.1 GMIC Architecture
We use a straightforward programming model, which primarily applies to multimedia
and streaming applications. Specifically, it starts by obtaining a trace of the application
in question, which we call the baseline code. From there, it divides a program into Code
Blocks (CB). A code block is composed of procedure blocks and independent sequential
code blocks. Division is performed by examining the trace and using an ad hoc approach
that conforms to the following principles:
First, all CBs having their essential profile larger than the profile mentioned in the UCF
are considered. In the UCF they are enlisted with their cyclomatic complexity, nesting
depth, and paths.
Second, the priority conflict is resolved by the weighted values of CBs, i.e., if two CBs
have the same access profile, then the CB with highest cyclomatic complexity shall be
considered first and so on. For the latter rule, the priority is indexed as access rate,
cyclomatic complexity, nesting depth, and paths.
3.1. GMIC Architecture 43
Fig. 3.2: Fraction of JPMO CB in an MPEG-1 application, the code blocks are num-bered from fb01 to fb34.
3.1.1 Performance Qualifier Measurement
We introduce the Joules Per Million of Operations (JPMO) as a performance measure
for the selection of candidate CBs. This measure is computed as average energy con-
sumption per CB in million of operations. We have found JPMO to be effective in
determining energy cycle hungry code blocks. For example, Figure 3.2 gives an example
of how JPMO varies in an MPEG-1 video encoder. The Figure 3.3 shows a window of
7 CBs. Here, JPMO clearly helps to partition the code into ECHCB. These CBs were
determined by hand, but there is a potential to automate the partitioning.
3.1.2 Code Block Queuing
Our method for transforming ECHCB requires queuing these blocks such that blocks
pertaining a similar expression profile most likely benefit from the same TS. This saves
the search time to find the transformation scheme for the next candidate code block
which is similar in expression profile to the previous one. Hence, the same TS can be
applied to the next CB. Thus, our approach requires distinguishing a code block that
has a good energy-time tradeoff from one that does not. That is, we must estimate the
effect on the energy consumption and execution time of the block when executing such
blocks in a successive transformation. The key here again is JPMO (introduced above),
which estimates CBs that have a good energy-time tradeoff.
44 3 Gradient Mode Iterative Compilation (GMIC)
Fig. 3.3: Fraction of JPMO contributed by code blocks in an MPEG-1 application- (awindow view for seven blocks).
3.1.3 Code Block Expression Profile
The first step gathers profile data during an execution of the program. A collection of
application expression is performed in different code blocks and is following the scheme
mentioned in the next section. The information we collect includes the type of function
call and location (program counter). It shows the status (TS, time, energy etc.) and
metrics (number of useful instructions, instruction cache misses, data cache misses etc).
The extraction of code block expression profile is already explained in Chapter 2.
3.1.4 Transformation Scheme
The dynamics of GMIC is powered by a set of transformation schemes. They are chosen
according to their rate of appearance during the compilation. We examine a wide range
of transformation scheme and grouped them with respect to their highest appearance
in compilation to the lowest [68]. We use four sets of transformation schemes (TS1,
TS2, TS3, TS4), they are enlisted in Table 3.1. The second column describes the name
of optimization corresponding to each TS. Some transformations use one or more lower
level transformation schemes as well. Overall, loop transformation is considered as the
most beneficial in our framework, but blind use may lead to increase cache misses and
eventually high energy consumption. The third column shows the rate of the TS at the
order of their appearance in conventional DSP compilers. The rate decreases from TS1
to TS4. The fourth column shows the optimization level of each transformation scheme.
3.2. Implementation 45
The sequence of transformations is in the order of aggression, this is shown in the last
column of Table 3.1. TS1 is at lowest optimization level. Our algorithm increases the
level of transformation scheme according to the performance objective defined in the
performance tuple ρ(Energy, Cycles, Cache Misses, Functional Unit Utilization).
It may be noted, the proposed sequence may not be the best one. We found it efficient
across our benchmark applications.
TransformationScheme
Optimization Types Rate Optimization Level
TS1 Basic block Highest LowValue propagationHoisting loop invariantVariable optimization
TS2 TS1 High MediumFunctional blockLoop normalizationBreak up large expression treesLoop optimization
TS3 TS1 Medium HighGlobal optimizationDismantle array instructionsLoop optimization
TS4 TS2 Low HighestAggressive decision tree graftingTS3
Tab. 3.1: Transformation Schemes.
3.2 Implementation
The gradient mode iterative compilation is steered by the algorithm, depicted in Fig-
ure 3.4. Our ultimate goal is to find a transformation scheme that is ’acceptable’ for the
energy-cycle constraints. Determining which of the two solutions is ”better” depends
on how a user wants to trade off energy savings and time delay. Provided a program
partitioned into executable code blocks, we proceed to our method for determining an
effective assignment of transformation to code block. If there are χ code blocks and
λ transformation schemes, then the number of possible solutions (block-transformation
control) for the program are λχ . In general, the search space is too large to explore by
brute force. Therefore, the second part of our method is a heuristic that we use to find
the ”best” solution. The heuristic finds the ”best” TS in a CB, then moves on to the
next code block. Once it moves on to another CB, the TS for the preceding CB has
been determined. Therefore, it is important that CBs are sorted. Initially, the solution
46 3 Gradient Mode Iterative Compilation (GMIC)
(a vector of TS) is set to the baseline value-all zeroes. The recursive function is invoked
on the 0th code block. It executes the program using the next TS in this CB (all other
CB are as before). If the energy-time tradeoff (defined in UCF) of this new solution is
better than the current solution, it is accepted. The algorithm recursively tries the next
aggressive TS on this CB.
The TS is determined when the new tradeoff is worse than the current or when there are
no TS. After setting the TS, it moves on to the next code block. This heuristic runs at
most λχ times. After each program execution, the energy and time are measured and
compared via the user-defined relationship. For our tests, we use a simple and intuitive
evaluation of the tradeoff based on the slope of the line between two solutions. The
slope (Ω) is defined as the ratio of energy savings to time delay:
Ω =Jk − Jk+1
Ck − Ck+1(3.1)
Where, k and k+1 are two consecutive solutions, J=Energy consumption, and C=Execution
times.
Following conventions are implicitly true:
• Ω = -1 (i.e., 45 degree below the horizon) means savings and delay are equally
weighted.
• Ω = 0 means minimize energy.
• Ω = ∞ means minimize time.
We consider a new solution with a larger slope (in magnitude) than the user-defined
limit to be better. We advocate this metric because it is reasonable and it is easy to
visualize.
3.3 Example: Optimization of an MPEG-1 encoder
In this section we elaborate our methodology step-by-step with an example. We study
an MPEG-1 encoder (MPEGencoder) in depth. In the energy-cycle graph shown in
Figure 3.5, the baseline is mentioned as the ’bb’ point, where no optimization is applied.
It may be noted the higher of two points uses more energy, and the further right of two
points takes more time. For all other points, at least one CB is transformed. Each point
is labeled as a tuple, for e.g., point 21 means, the 2nd code block and transformation
scheme 1.
Our analysis of the JPMO identified five code blocks in MPEGencoder, CB7, CB17,
CB18, CB31, CB33. For convenience, we use their pseudonym CB1, CB2, CB3, CB4,
3.3. Example: Optimization of an MPEG-1 encoder 47
X = CB01, CB02, … CB m is an application vector composed of Energy Cycle Hungry CodeBlocks (ECHCB) in descending order. TS = TS1, TS2, TS3, TS4 is Transformation Scheme (TS) vector composed of available transformations. ρ(Total Energy Consumption, Execution Time, Cache Misses, Functional Unit Utilization) is the performance tuple as mentioned in user constraint file (UCF) Ω is the energy-cycle slope between the two consecutive solution SstsCount is the executable application binary that has been transformed ‘stsCount’ times. S0 is executatble application binary that has been optimized by native compiler for minimum execution time. Source to source transformation parameter : 1. X, array of ECHCBs , 2. TS , array of available transformation schemes, 3. The slope windows (W) , 4. sts_count S sts_count ← StS(X, TS, W, sts_count) // Source to Source code transformation call to proceed to optimal S f Algorithm StS(X, TS, W, sts_count)
Build the Application S sts_count and obtain AEP
Compute performance tuple form AEP and store in ρ If ρ satisfies UCF
return X endif initialize CB_count, TS_count
Next_Iter: Get (CB)CB_count
Apply (TS)TS_count sts_count++ Build the Application S sts_count and obtain AEP
Compute performance tuple form AEP and store in ρ’
If ρ’ satisfies UCF return X
Compute Ω for ρ and ρ’
If ( Ω Є W) /* if slope belongs to user defined slope limits W */ /* current TS for current CB is acceptable, so make it an anchor point for next iteration */
CB_ter = CB_count TS_ter = TS_count endif else
/* the efficiency of current TS for current CB is not satisfactory, therefore maintain the previous TS and get next code block */ CB_count++ TS_count = TS_ter
endif If (TS_count++ > TS_max) then /* if all TS are applied, then proceed to next code block */
CB_count++ endif If (CB_count > m) /* all code blocks have been considered for transformation */ return X Goto Next_Iter
Fig. 3.4: GMIC algorithm.
48 3 Gradient Mode Iterative Compilation (GMIC)
Fig. 3.5: Heuristic track of CT-Tuple for an MPEG-1 encoder application.
CB5, they are enlisted in descending order w.r.t JPMO. We also assign a unique tuple to
Code block - Transformation scheme assignments as a (CT), may be written as: CTxy;
x= number of candidate CB, and y=number of TS. In this example x = 1,2,3,4,5 and
y= 1,2,3,4. E.g., CT14 means code block 1, optimized with transformation scheme 4.
Any point on the graph will show a unique solution to our heuristic, it may be noted
that its not an exact solution, but satisfied the constraints. Initially, the application is
optimized for minimum execution time, without taking into account optimal architecture
utilization, energy consumption etc. Such solution is labeled ”00”, meaning our heuristic
is in inactive state for the source code. In the figure we have drawn a polygon line
connecting 5 points with the bb point, to form an Energy Cycle Bay (ECB), that are
”good” choices under a simple slope-based Energy-execution time metric. Any solution
”inside” the ECB is not a ”good” choice according to this metric, except those laying
on the lower edge of the bay. Recall that the goal of our profiling algorithm is to do
”better” than the baseline. As our chosen metric is the slope between two solutions, Ω,
our algorithm will select a unique point in ECB. This is illustrated in Table 3.2, which
shows a CTxy tuple track for only four gradient windows. The steps of the algorithm
are shown in the table. The first column shows the user defined window of slope for
the convergence of heuristic to find the closed solution on the ECB. The second column
shows the two solutions being compared, and the arrow indicates the direction of the
slope. Of these two solutions, the energy and time of the one on the left is known, so
only one run of the program is necessary to complete this step. The third column shows
the EC slope, and the fourth column indicates whether it is less than the user-defined
limit i.e., column one. The last column shows the final point that is anchored by the
heuristic to find the next optimum.
3.4. Discussion 49
Slope Window CTxy Tuple Track Ω Direction Solution
Ω <-10 00 → 11 -25.667 True 11 is anchored::
00 → 52 -2.48 False
-10 < Ω < -3 11 → 12 -3 False11 → 21 -3.23 True 21 is anchored
-3 < Ω < -1 21 → 31 -2.7 True 31 is anchored
-1 < Ω < 0 31 → 32 -1.2 False31 → 41 -0.87 True 41 is anchored
Tab. 3.2: Gradient Table.
In the first row of Table 3.2, Ω is steep (large negative number). Such a value favors
time delay over energy savings. In this case and all others, the algorithm selects an
appropriate TS in the order of its optimization level i.e., from lower to highest. Thus,
we start with the candidate solution 11. In the first step, the slope from 00 to 52 is
greater than Ω (not as steep); therefore, it is rejected. We back up to the previous
solution (11) and try the next TS. Next, solution 12 is rejected. Thus, the algorithm
selects the next most energy cycle hungry code block 2 for last successful TS 1. Here,
we find the slope Ω, which is sufficiently steep. In this case, 21 is accepted and 12
is rejected. The algorithm moves to the next aggressive transformation scheme 2 for
the same code block 1 and then determines the transformation for the first code block,
ultimately selecting 11. Similarly, in next case the slope is within limit resulting on
CT31 tuple. Followed by this from 31 to 32, the limit is slightly lower (more steep),
which results in acceptance of the next CT41 tuple, followed by the rejection of the
CT32 tuple. In this section we are more focused in elaborating our technique, while
ignoring the energy cycle benefits. Though implicitly they have been met, but we left
this discussion until the next section.
3.4 Discussion
Figure 3.6 to Figure 3.10 show results from our five benchmarks applications that we
executed. They are FFT, IDCT, T64, M100, H264L [105]. We plot all applications
according to their average time of execution, as explained in Chapter 2. Each graph
is showing execution time in milliseconds against energy consumption. Based on our
heuristic impact on the optimization of these applications, we observe different types of
slope sensitivities in tacking solutions from one anchored CTxy tuple to the next candi-
date. Applications are discussed below in the context of their track slope sensitivities.
Low EC Gradient Applications: We start our energy cycle benefit discussion from
Figure 3.5, where the baseline code was at the extreme right side of the graph. Our
50 3 Gradient Mode Iterative Compilation (GMIC)
heuristic was aiming for low energy and a low cycle count for the MPEG-1 encoder
application. Tracking CT00 tuple to CT11 tuple reduced down the energy to 6% with
a time penalty of 3%. Similarly following the whole track, the energy cycle benefits
from CT00 tuple to CT41 tuple is -89% / 10%, it means that energy is decreased by
10 percent while execution cycles are significantly increased to 89% as compared to the
CT00 tuple. It may be noted that CT00 tuple is the right most point on the graph that
aims for minimum execution cycles at maximum energy cost. Therefore CT41 tuple is
a tradeoff between the energy gains and execution cycles penalty as compared to the
baseline code. Our algorithm saves 4% energy at the penalty of 5% execution cycle.
Fig. 3.6: Heuristic track of CTxy tuple for FFT application.
Fig. 3.7: Heuristic track of CTxy tuple for IDCT application.
3.4. Discussion 51
Fig. 3.8: Heuristic track of CTxy tuple for T64 application.
Figure 3.6 shows the heuristic flow for the Fast Fourier Transform (FFT) algorithm
where the cycle/ energy gain is 77%/-10%. Being compute intensive such algorithms
are always energy consuming, but their loop structure make them a favorite choice for
energy reduction especially when they are used for high order filter. Similar behavior can
be observed in Figure 3.7 and Figure 3.8 for the IDCT and the T64 application. Overall,
the behavior of the IDCT application varies widely. Each code block takes benefits of
the next TS. Hence the slope is more steep. For T64 it is important to consider the
array size. Arrays that favor high localization in the on-chip cache memory show less
energy dissipation.
High EC Gradient Applications: H264L shows a significant gain in execution cycle
as well as energy reduction (see Figure 3.10). As compared to the baseline code, the
time penalty is 17% at an energy saving gain of 32%. H264L is mostly used in handheld
device, where energy saving is of prime importance. Our profiling shows that H264L
source code has a large number of localized procedure calls, that fit well in an onchip
cache. In the same vein, the size of the input frame sequences is also suitable to the
size of our data cache.
Non-sensitive Applications: In case of matrix multiplication of order 100 (M100) our
heuristic shows no benefits,(see Figure 3.9). M100 has only one main procedure call, no
communication (except once at the end of the program), and is CPU bound. Therefore,
transformation schemes have no effect on the program improvement. In such cases a
native compiler is sufficient to produce an optimal application.
52 3 Gradient Mode Iterative Compilation (GMIC)
Fig. 3.9: Heuristic track of CTxy tuple for M100 application.
Fig. 3.10: Heuristic track of CTxy tuple for H-264L application.
3.5 Conclusions
In this chapter we introduce our slope directed technique to drive the iterative compi-
lation in our energy aware framework. The ’C’ source code is divided into code blocks
depending on their energy cost. We introduce joules per million of operations as a
performance measure for a CB. Execution time and energy consumption of transformed
applications are compared against the user constraints. Once a solution is achieved
for highest energy cost, then our heuristic starts tracking the next available low energy
transformation. Successively, it finds lower energy solutions at the cost of time penalty.
3.5. Conclusions 53
Our technique is sensitive to the order of the code blocks. Due to greedy search, our
heuristic search for the next available CT tuple is very slow. We improve this in the next
chapter, where optimization objectives are modeled as a multiobjective problem and the
solution space is searched with the help of a genetic algorithm.
54 3 Gradient Mode Iterative Compilation (GMIC)
4. MULTICRITERIA STOCHASTIC ITERATIVE
COMPILATION (MSIC)
4.1 Introduction
In contrast to a general purpose computer, an embedded system typically runs one appli-
cation for its lifetime. In GMIC as proposed in Chapter 3, only a moderate improvement
is achieved, as it effectively restricts itself to trying different back-end optimizations.
The major impediment to such approach is the heuristic search technique itself. In this
chapter we consider the optimization problem as a single task, where all desired aims
have to be taken into account simultaneously. The new method is based on the opti-
mization of a multicriteria, objective function. The desired aims of architecture-based
energy-cycle optimization are formulated as penalty terms of such an objective function.
The maximization of the objective function is achieved using a Genetic Algorithm (GA).
A simplified flow of methodology is shown in Figure 4.1. As explained in Chapter 2, we
obtain the application expression in our ECACF that is further used by a Transformation
Engine block and MSIC block as shown in Figure 4.1. Based on the desired objec-
tives, the transformation engine decides whether a given application should go through
successive transformations and hence compilation. If energy-cycle constraints are not
met in the UCF, the transformation engine block transforms the code according to the
Multicriteria Stochastic Iterative Compilation (MSIC) algorithm and provides it to a na-
tive Application Build Environment block. This block produces the machine code for
the transformed application source code, that is later allowed to execute on the target
platform to obtain the dynamic application expression profile. The whole process is re-
peated until each successive transformation meets the desired optimization objective as
mentioned in the UCF.
In next section we propose source level optimization as a multicriteria problem. We
expose the minutia of our methodology for e.g., selection of constraints, development of
the fitness function, as well as the formation of the Hertz Matrix (HM). We discuss two
multimedia applications in depth to elaborate the advantage of the proposed algorithm.
56 4 Multicriteria Stochastic Iterative Compilation (MSIC)
Fig. 4.1: A simplified view of framework with multicriteria methodology extension.
4.2 Model Development
Multicriteria optimization is very different than a single-objective optimization. In the
later, the aim is to obtain the best design which is usually the global minimum or
global maximum depending on the desired objective. While former, there may not
exist one solution which is considered to be the best with respect to all objectives.
Instead there exists a set of solutions which are superior to the rest of solutions in the
search space when all objectives are considered but are inferior to the other solution
in the space in one or more objectives. These solutions are also known as Pareto-
optimal solutions or nondominated solutions [106]. Since genetic algorithms work with
a population of points, a number of Pareto-optimal solutions may be captured using
GAs. A genetic algorithm belongs to the class of stochastic optimization methods
[106] [6] [107]. Although it does not guarantee finding of the global optimal solution,
the result is typically a good approximation of it. The concept of the GA allows for
working parallel with many feasible solutions (individuals) by operating between these
solutions. Because of working with many solutions in parallel, it is improbable that the
genetic algorithm stalls in any local optimum and thus likely that it finds the global
solution. The algorithm is well suited to our problem, where the objective function is
non-smooth, non-differentiable and discontinuous, because the GA does not demand
any of these properties. However, the following two properties regarding search space
and objective function are demanded:
• Firstly, every point of the search space must be able to be coded as finite length
string.
• Secondly, every point of the search space must have a positive fitness described
by the objective function.
4.2. Model Development 57
Assume that all possible transformations are known. The assumption is sound because
the optimization space is in practice limited by architectural constraints, e.g., number
of available functional units, or best fit for the code block in cache. By using AEPs the
transformation scheme is determined for every possible code restructuring.
4.2.1 Objects and Constraints
We have two objects for the optimization:
1. Instruction per cycle (η) and
2. Energy saving (ξ).
For every measured point of the optimization space, it is observed that:
• The successive architecture utilization (in terms of functional units, internal reg-
ister usage, best cache fit) must be greater than a predefined, system dependent
limit (i.e., execution cycle and energy threshold).
• The predecessor transformation scheme must overlap the successor in order to
follow a smooth optimization. The smooth optimization over two samples of code
is defined by the minimum and maximum limits of the transformed code. If the
output profile of the code is between these limits, this point must lie on a smooth
curve for optimization.
The problem is now to find that number γ, γ < Γ, of Γ transformation possibilities and
their yielded code profile (i.e., AEP) that maximizes our two objectives. We formulate
the above optimization problem as the following multicriteria, optimization problem with
two components η and ξ:
MAXρf(ρ) = MAX
ραη(ρ) + βξ(ρ) (4.1)
subject to an individual ρ, and two possible weighing terms α and β. They are explained
below.
Algorithm Flow of GA: We use the GA and consider ρ as an individual. An individual
contains information of the transformation space and the previous iteration. Our popula-
tion for this multiobjective GA is composed of dominated and non dominated individuals.
The basic line of the algorithm is derived from a steady-state genetic algorithm given in
[6], where only one replacement occurs per generation. The first modification we have
brought in the GA lies in the selection step. The selection phase implements a roulette
wheel selection. The crossover and mutation operators are then applied. The crossover
is applied on both selected individuals, generating one child. The mutation is applied
58 4 Multicriteria Stochastic Iterative Compilation (MSIC)
on the best individual. The best resulting individual is integrated into the population,
replacing the worst ranked individual in the population. Figure 4.2 presents the model of
the algorithm. Initial solutions are randomly generated using a uniform random number
of transformation schemes. As a result, the initial population is spread along the search
space in terms of the number of transformation schemes.
Fig. 4.2: Simplified Genetic Algorithm Model [6].
Development of Fitness Function and Selection of Weights: The first term of
the fitness function in Equation (4.1), 0≤ η(ρ) ≤ 1, denotes the achieved fraction
of the Instruction Per Cycle (IPC) for the total transformation space. The second
term 0≤ ξ(ρ) ≤ 1, denotes the fraction of points where the energy saving is fulfilled.
Coefficients 0≤ α ≤ 1, 0≤ β ≤ 1 are weight factors to the criteria and they define
the importance of different criteria with respect to each other, e.g. if α=1 and β=0
the method optimizes only IPC, similarly for α=0.5 and β=0.5, the method optimizes
overlapped IPC and the energy function. The values of α and β depend on the user
requirement associated with available CPU cycles and energy budget for a candidate
application.
The Choice of Individuals: The individual sample points in the transformation space
are chosen with a uniform probability distribution. They are profiled later by an eval-
uation of the application expression profile at the target architecture. The selected
individual transformations are updated based on their success, i.e., IPC and energy sav-
ing factor of the sequence as a whole. The constraints are modeled as a penalty term
of the fitness function f(ρ). Transformations contributing to better performance are
rewarded while those resulting in performance losses are penalized. Thus, future sample
points are more likely to include previously successful transformations more frequently
and search their neighborhood more intensively.
4.2. Model Development 59
4.2.2 Case Study I - Arbitrary Application
As an example we solve the following code optimization problem:
We assume that the search space consists of 29 transformation schemes, and they are :
7 loop transformation,
12 variable operations,
5 data packaging schemes,
5 cache optimization.
In addition it contains 20,000 transformation points (resolution is controlled by steering
factors such as grafting depth, cache block size etc. [25]). For simplicity, their IPC is
considered only for the useful instructions that are executed during the run time profiling.
The problem is to find among them the optimal transformation scheme, such that it
maximizes the fitness function as mentioned above. We optimized IPC with and without
overlapping energy goals.
Case 1: TS1 (α=1, β=0), only IPC is optimized and
Case 2: TS2 (α=1, β=1), both IPC and E are overlapping goals.
Two transformation schemes are depicted in Figure 4.3 to Figure 4.5 as TS1(α=1 and
β=0) and TS2(α=1 and β=1). The steps are calculated over 200 generations, Figure 4.3
shows the development of the total fitness (as a fraction of maximum fitness), Figure 4.4
shows the fraction of IPC and Figure 4.5 shows the fraction of points where IPC and
energy overlapping conditions are fulfilled. Note that, each successive point on these
graphs is showing the improvement over the baseline version of the same code. They
are computed as follows :
fnorm = f−fbaselinefbaseline
, ηnorm = η−ηbaselineηbaseline
, ξnorm = ξ−ξbaselineξbaseline
Fig. 4.3: Development of fitness function for Case Study 1 in TS1 and TS2.
60 4 Multicriteria Stochastic Iterative Compilation (MSIC)
Fig. 4.4: Fraction of IPC for Case Study 1 in TS1 and TS2.
Fig. 4.5: Fraction of IPC and Energy overlapping for Case Study 1 in TS1 and TS2.
Figure 4.3 shows that in both cases the fitness values are increasing. The fitness function
in TS1 only maximizes for the IPC, but energy is implicitly related to the IPC because
both are being optimized at the same hardware platform. In the same line, there are
many factors contributing to both IPC and energy, such as cache misses, functional unit
utilizations and other architectural attributes. The fitness curve appearing for TS2 has
a lower rise than for TS1. As TS1 was looking for optimized IPC, any optimization
to IPC implicitly leads to reduction in cache misses using appropriate code block sizes,
higher functional unit utilization and an increase in the scheduling factor. The goals
are different, if optimization is made only for energy saving. An increase in functional
unit utilization reduces the energy significantly, but it might lead to an increase in cache
misses. In this case the increase in cache misses is due to compaction of code to achieve
higher parallelism and hence increase the functional unit utilization. The slower rise
in IPC for TS2 is observed in Figure 4.4, because here objectives were both energy
saving and IPC maximization. Despite the importance set by α=1 in TS1, the applied
optimization schemes did optimize energy as we expected and depicted in Figure 4.5.
4.2. Model Development 61
4.2.3 Case Study II - Nonlinear Interpolative Vector Quantization (NLIVQ)
In this section we consider a more complete source to source transformation methodol-
ogy for an image compression application NLIVQ from our benchmark. Our aim is to
optimize the energy saving such that both IPC and architectural utilization are taken
into consideration. The aim of the architectural usage is to ensure that on chip cache,
and functional units are efficiently utilized. In order to enhance architectural utilization
for every software application, the application expression profile for CPU usage is needed.
This data contains information of the actual CPU utilization for each CB composing the
application. We can make a table of application CB versus the percentage of load it
shows over the maximum CPU load. E.g., for a CPU operating at 100 MHz, if a code
block CB1 needs a 40% of the maximum CPU operational time, then it can be said that
code block CB1 is consuming 40MHz of the CPU. We call such a table Hertz Matrix
(HM), that enlist the available Hz (or CPU cycle) for each function or code block. It
means that in a HM, the distribution of CPU cycles corresponds to the distribution of
the frequency of code blocks inside the applications. This requirement can be obeyed by
using a fixed number of high frequency code blocks and then applying transformation
schemes on those code blocks. By using the fixed number of code blocks it is possible to
calculate the proportional distribution of CPU cycles. Figure 4.6 shows the CPU usage
for different code blocks in the NLIVQ application at a processor running at 145MHz.
The total workload of the application is 14.046% of the available CPU computation
power. The individual contribution of each code block can be computed as a ratio of
code block CPU usage to the total application workload. E.g., % CPU usage by code
block F01 is (0.99767/14.046)*100 = 7.1%. Percentage CPU workload for some code
blocks is shown in Table 4.1.
Let us suppose that the CPU utilization is divided into several cycle slots based on CB
lifetime (i.e., time each CB need on the processor in terms of CPU cycles). After this,
it is easy to calculate the fraction of CPU cycles in each time slot regarding to the
maximum allowed CPU cycles for the whole application. The aim of optimization is
to find a transformation scheme such that the obtained IPC obeys the available CPU
cycles. We call it architecture utilization optimization problem. We optimize energy and
IPC simultaneously as well as functional unit utilization. We formulate the optimization
problem as an extension to Equation (4.1) as follows:
MAXρf(ρ) = MAX
ραη(ρ) + βξ(ρ) + δζ(ρ) (4.2)
subject to individual ρ, where individuals have the same characteristics as explained in
Case Study 1.
The architectural constraint is modeled as penalty term ζ(ρ), 0 ≥ ζ(ρ) ≥1, which
measures the observed CPU utilization (in terms of functional unit utilization) using an
62 4 Multicriteria Stochastic Iterative Compilation (MSIC)
Fig. 4.6: Fraction of CPU cycles for CB life time (CBLT)in NLIVQ application (25 CBare numbered from F01 to F25).
individual ρ. The other penalty terms are explained in the previous section. When the
weights α, β, δ are equal, all objects are equally important. This means that we try to
find results that give quite good IPC, energy saving and functional unit utilization at
the underlying hardware.
We consider the NLIVQ example to elaborate the concept. NLIVQ is composed of 25
CBs as shown in Figure 4.6. Based on the CBLT, we considered only 16 code blocks
for the transformation enlisted in Table 4.1. As they cover 84% of the CPU cycles,
it is an appropriate choice. The aim to reduce CPU cycles is simply considered as
50% improvement to the original CPU cycles. For example CB F01 will be optimized
for 3.55% of the total CPU cycles (145MHz). In order to demonstrate the working
of multicriteria optimization we optimize the fitness function by using different weight
values (α, β, δ) for the energy saving, functional unit utilization and IPC to optimize
architecture utilization.
Figure 4.7 to Figure 4.10 depict the development of the fitness function (as a fraction
of maximum fitness), fraction of the IPC, fraction of the the energy saving, fraction
of the Function Unit Utilization (FUU). As discussed in Case Study 1, these values are
plotted as a fraction of their values for the baseline version of same source code. They
are computed as follows:
fnorm = f−fbaselinefbaseline
, ηnorm = η−ηbaselineηbaseline
, ξnorm = ξ−ξbaselineξbaseline
, ζnorm = ζ−ζbaselineζbaseline
There were 400 generations for each run and they were repeated several times in or-
der to get statistically reliable results. For brevity yet concise, we selected only three
transformation schemes out of nine. Their weight adjustment is mentioned below:
4.2. Model Development 63
Code Blocks Actual CPU Cycles(%) Desired CPU Cycles (%)
F01 7.1% 3.55%
F03 6.8% 3.40%
F15 6.6% 3.31%
F24 6.4% 3.19%
F16 6.2% 3.11%
F08 6.0% 3.11%
F23 5.8% 3.02%
F21 5.3% 2.88%
F19 5.3% 2.67%
F20 5.2% 2.66%
F17 5.2% 2.59%
F05 4.9% 2.46%
F06 3.5% 1.75%
F14 3.4% 1.69%
F25 3.4% 1.68%
F13 3.0% 1.50%
Tab. 4.1: CBLT in CPU cycles for NLIVQ.
TS04(α=1, β=0, δ=1);
TS07(α=1, β=1, δ=1);
TS09(α=0.6, β=0.1, δ=0.9);
Fitness values are not growing as higher in TS04, and TS07 as compared to TS09 as
shown in Figure 4.7. Though energy saving does not contribute in the development of
fitness function but it does grow in TS04 and TS09. As we discussed in Case 1, the
application of the transformation scheme implicitly affects the energy consumption as
well. The ripples in TS07 (see Figure 4.10) reflect the negative impact on the fraction
of energy saving achieved due to the application of the optimization scheme. Figure 4.8
and Figure 4.10 reveal an implicit relation between the FUU and growth of IPC. This
was expected for the NLIVQ algorithm; both follows each other almost linearly, but that
may not be the case in general as demonstrated in [30].
A careful weight adjustment may produce the desired results. For TS09, the choice
of the weights was made after many experimental iterations. Any random selection of
weights may lead to several compiler iterations. E.g., TS04(α=1, β=0, δ=1), though
aims for better IPC and architectural utilization (in terms of FUU), but we observed
a very poor performance in fitness function development, and achieved a low fraction
of IPC as well as in FUU. In TS07(α=1, β=1, δ=1) all criteria are equally significant
and it shows good on the average results regarding all criteria. We have found that the
choice of weights is also sensitive to the underlying application algorithm and the coding
styles. E.g., for a typical MPEG-1 encoder application optimization these weights were
64 4 Multicriteria Stochastic Iterative Compilation (MSIC)
selected as α=0.7, β=0.4, δ=0.1 (discussed in next section).
Fig. 4.7: Development of the fitness function for NLIVQ.
Fig. 4.8: Fraction of IPC for NLIVQ.
4.2. Model Development 65
Fig. 4.9: Fraction of energy saving for NLIVQ.
Fig. 4.10: Fraction of functional unit utilization for NLIVQ.
66 4 Multicriteria Stochastic Iterative Compilation (MSIC)
In order to visualize the optimization results in terms of CPU load, the numerical test
results are enlisted in Table 4.2. The first column shows the list of candidate code
blocks, the desired percentage of CPU load is shown in the second column (it is similar
to column three of Table 4.1). The achieved fractions of the CPU target load in enlisted
CBs for three different schemes TS04, TS07 and TS09 are shown in the third, fourth
and fifth columns, respectively. In TS09 the desired and optimal values are very close to
each other. To see this more clearly, Table 4.3 presents the sum of the absolute values
of the differences between desired and optimized values of each TS. Table 4.3 shows
very clearly that the TS09 gives a good result when concerning the obeying architecture
utilization. Using this measurement TS09 achieves approximately a nine times better
result as compared to TS04.
Code Blocks Target CPU Cycles (%) TS04 (%) TS07 (%) TS09 (%)F01 3.55% 5.97% 5.47% 1.99%
F03 3.40% 5.71% 5.24% 1.90%
F15 3.31% 5.57% 5.10% 1.86%
F24 3.19% 5.36% 4.92% 1.79%
F16 3.11% 5.23% 4.79% 1.74%
F08 3.02% 5.08% 4.66% 1.69%
F23 2.88% 4.84% 4.44% 1.61%
F21 2.66% 4.46% 4.09% 1.49%
F19 2.65% 4.45% 4.08% 1.48%
F20 2.59% 4.36% 3.99% 1.45%
F17 2.59% 4.35% 3.99% 1.45%
F05 2.46% 4.13% 3.79% 1.38%
F06 1.75% 2.94% 2.70% 0.98%
F14 1.69% 2.84% 2.60% 0.95%
F25 1.68% 2.83% 2.59% 0.94%
F13 1.50% 2.53% 2.31% 0.84%
Tab. 4.2: Achieved CPU cycles (%) in ECHCB of NLIVQ application for TS04, TS07,TS09.
Transformation Schemes TS04 TS07 TS09
Sum of Abs. Difference 0.2860 0.2271 0.1850
Tab. 4.3: Sum of absolute difference for for TS04, TS07, TS09.
4.3 Performance Comparison with GMIC
Though we discussed here our scheme in detail for NLIVQ image compression application,
the method is well suited for other compute-data intensive multimedia applications e.g.,
MPEG-1, a video codec. The application features are mentioned in Table 2.3. In
4.4. Conclusions 67
this section we consider to optimize an MPEG-1 encoder for our target hardware and
improvements over the iterative compilation scheme discussed in Chapter 3. Our aim
is to optimize an MPEG-1 application for its five most energy cycle hungry code blocks
among 35 CBs, such that following constraints will be fulfilled:
• Processor maximum speed is 180 MHz,
• Available cycles for MPEG-1 encoder is 120 MHz (60 MHz are reserved for other
activities for e.g., user interface, panel display etc.)
Table 4.4 shows the optimized fractions of IPC, energy saving and functional unit utiliza-
tion and their improvement after optimization. The object function in Equation (4.2),
with parameters α=0.7, β=0.4, δ=0.1 was used. There were 410 iteration steps for each
run and they were repeated several times in order to get statistically reliable results.
MPEG-1 encoder
Parameter Results (MSIC) %Improvement ofMSIC over GMIC
%Search Time Re-duction in MSIC ascompared to GMIC
IPC 87% 15%Energy Saving 23% 46% 49%Functional Unit Util. 77% 7%
Tab. 4.4: Performance comparison between GMIC and MSIC.
The result shows clearly that the optimization scheme is beneficial even for compute-
data intensive multimedia applications. Energy saving and IPC are optimized by a factor
of 0.46 and 0.15. The improvement in functional unit utilization is small, because its
weight δ was low. The salient feature of this scheme is its faster convergence to good
solution as compared to GMIC, that is another significant impediment in implementation
of such offline optimization schemes.
4.4 Conclusions
In this chapter we considered the source to source transformation as a multicriteria op-
timization problem, where IPC and energy saving are optimized simultaneously. The
optimization approach is demonstrated for real time multimedia applications. The opti-
mized result is then more reliable than the result of traditional methods, where usually
off-line compilation is performed without considering the architectural benefits. We
demonstrated architecture utilization as an important consideration while satisfying the
desired aims for IPC and energy constraints. IPC was taken into consideration in that
sense that the IPC obtained increases the target CPU utilization while reducing the
68 4 Multicriteria Stochastic Iterative Compilation (MSIC)
energy consumption. As compared to GMIC, the proposed methodology is faster and
the target of the source to source transformation is to find an efficient source at given
hardware constraints. As constraints for optimal solution different kind of properties
were demanded, like maximum computation power, low energy consumption and effec-
tive target hardware utilization in terms of cache, functional units, and on-chip registers
to obtain a high architecture-application correlation.
5. APPLICATION-ARCHITECTURE
CHARACTERIZATION
Embedded systems are software running on hardware. An efficient embedded system is
that one for which the software application fully utilizes the underlying architecture to
deliver optimal energy-cycle performance. The application-architecture correlation is a
bidirectional process, matching the algorithmic structure with hardware architecture and
vice versa [108] [109] [110]. The programmer will benefit from this efficient mapping
and produce better source codes. The mapping of algorithm and data structures onto
the machine architecture includes processor scheduling, memory maps, inter-processor
communications, to name a few. These activities are usually architecture dependent.
Optimal mappings are sought for various processor architectures. The implementation
of these mappings relies on efficient compiler and operating system support. Parallelism
can be exploited at algorithm design time, at program time, at compile time and run
time. In Chapter 3 and Chapter 4 we illustrated with examples how a native compilation
environment for VLIW processors can be exploited for efficient code generation. Efficient
code generation means, a code that takes the benefits offered by the architecture. We
show that a multi layer profile mechanism can be used to optimize the embedded system
efficiency. Our energy cycle aware compilation framework (ECACF) has a great interest
for designers working in mobile computing embedded system development, where there
design goal is to measure the application behavior across different architectures. Appli-
cations of similar functionality may yield similar expression profiles, and hence can be
suitable for similar hardware platforms. We tested our ECACF at diversified application
domains that varies from multimedia to bioinformatics. Despite the simplicity of our
methodology, the analysis of the large matrices of application expression profiles un-
der different levels of transformation at different architecture is not trivial and requires
advanced knowledge discovery processes. There exists several kind of representations
available to express knowledge that can be extracted from AEP. Knowledge discovery
in available data, also known as data mining is the efficient discovery of previously un-
known, valid, potentially useful and understandable patterns in large volume of data
[111]. Patterns in the data can be represented in many different forms including clas-
sification rules, association rules, clusters, sequential patterns, time series, contingency
tables and others [112]. Typically, the number of patterns generated is very large but
70 5 Application-Architecture Characterization
only a few of these patterns are likely to be of any interest to the domain expert ana-
lyzing the data. The reason for this is that many of the patterns are either irrelevant
or obvious, and do not provide new knowledge. To increase the utility, relevance and
usefulness of the discovered patterns, techniques are required to reduce the number of
patterns that need to be considered. Techniques which satisfy this goal are broadly re-
ferred to as interestingness measures [113] [112]. The analysis of relationship measures
among variables is a fundamental task at the heart of such interestingness measures.
In this chapter we propose to analyze AEP data with the help of multivariate statistical
techniques, in order to determine the application-architecture (A-A) correlation between
the different applications at one platform and similar applications across different plat-
forms. We use scatter plots, box plots, scree plots and Principal Component Analysis
(PCA) biplots to explore the correlation between application and underlying hardware
architecture. In next section we shall introduce the basic concept and definitions used
in our methodology.
5.1 Terminologies
5.1.1 Principal Component Analysis (PCA):
PCA is used for dimensionality reduction in a data set by retaining those characteristics
of the data set that contribute most to its variance, by keeping lower-order principal
components (e.g., PC1, PC2, PC3) and ignoring higher-order ones (such as PC4,PC5
and higher). Such low-order components often contain the most important aspects of
the data. But this is not necessarily the case, depending on the application. PCA
is an orthogonal linear transformation that transforms the data to a new coordinate
system such that the greatest variance by any projection of the data comes to lie on
the first coordinate (called the first principal component), the second greatest variance
on the second coordinate, and so on. PCA is a way of identifying patterns in data,
and expressing the data in such a way as to highlight their similarities and differences.
Since patterns in data can be hard to find in data of high dimension, where a graphical
representation is not available, PCA is a powerful tool for analyzing data. We use PCA
biplots to visualize the black box impact of compiler and hardware architecture over the
software applications.
5.1.2 Scree Plot:
The Scree plot shows the relative fit of each principal component. It does this by
plotting the proportion of the data variance that is fit by each component versus the
component number. The plot shows the relative importance of each component in
5.1. Terminologies 71
fitting the data. The numbers beside the points provide information about the fit
of each component. The first number is the proportion of the data variance that is
accounted for by the component. The second number is the difference in variance from
the previous component. The third number is the total proportion of variance accounted
for by the component and the preceding components.
The Scree plot can be used to aid in the decision about how many components are
useful. We use it to make this decision by looking for an elbow (bend) in the curve. If
there is one (and there often is not be likely to) then the components following the bend
account for relatively little additional variance, and are good candidates to be ignored.
5.1.3 Box Plot:
The Box, Diamond and Dot plot uses boxes, diamonds and dots to form a schematic of a
set of observations. The schematic can give you insight into the shape of the distribution
of observations. Some Box, Diamond and Dot plots have several schematics. These side-
by-side plots help to see if the distributions have the same average value and the same
variation in values.
The plot always displays dots. They are located vertically at the value of the observations
shown on the vertical scale. The dots are ’jittered’ horizontally by a small random amount
to avoid overlap.
The plot can optionally display boxes and diamonds. Boxes summarize information about
the quartiles of the variable distribution. Diamonds summarize information about the
moments of the variable distribution. The box plot is a simple schematic of a variable
distribution. The schematic gives information about the shape of the distribution of the
observations. The schematic is especially useful for determining if the distribution of
observations has a symmetric shape. If the portion of the schematic above the middle
horizontal line is a reflection of the part below, then the distribution is symmetric.
Otherwise, it is not. In the box plot, the center horizontal line shows the median, the
bottom and top edges of the box are at the first and third quartile, and the bottom and
top lines are at the 10th and 90th percentile. Thus, half the data are inside the box,
half outside. Also, 10% are above the top line and another 10% are below the bottom
line. The width of the box is proportional to the total number of observations.
5.1.4 Scatter Plot:
The scatter plot matrix is designed to display the relationship between all pairs of several
variables. The plot matrix consists of plot cells containing little scatter plots formed from
a pair of variables. The variables are represented by the X-axis and Y-axis of each plot
cell. The observed values on the two variables are represented by points in the little
72 5 Application-Architecture Characterization
scatter plot. Each point represents the values for one observation on two variables.
Normally distributed variables will have scatter plots which have the greatest density in
the middle, are roughly elliptical in shape, and have no obvious outliers. The scatter
plot matrix can be used as a control panel for selecting variables, pairs of variables and
triples of variables.
5.1.5 Differential Application Expression Profile (dAEP):
An application may behave differently in the following scenarios:
1. The same application is executed on two different platform, and
2. When two different versions of the same application that are compiled with different
optimization settings are executed on the same platform.
In both scenarios, when it is executed we get two application expression profiles. We
call the difference in performance between the two platforms an architecture-centric
differential application expression profile. While the performance difference between the
two different versions is called compiler-centric differential application expression profile.
An example of a compiler centric dAEP is shown in Table 5.1. The table shows the appli-
cation expression profile (code size, execution time, energy consumption, slot utilization
etc.), across different transformation schemes from Iter-1 to Iter-7. Each transformation
iteration (Iter-1 to Iter-7) shown in the Table 5.1 corresponds to percent relative change
to the original profile for the baseline version of MPEGdec. E.g, each successive iteration
give rise to code size from 15% to 87%, while first iteration has reduced the execution
time by 6% ( see -6% in Iter-1 column). Similarly energy consumption is also decreased
by 1% (see -1% in Iter-1 column). Iter-7 column shows the optimal application expres-
sion profile improvement over the baseline version. We call it dAEP as compared to the
baseline version.
Relative Measures Iter-1 Iter-2 Iter-3 Iter-4 Iter-5 Iter-6 Iter-7
CodeSize 15% 28% 13% 26% 72% 79% 87%ExecutionTime -6% -14% -19% -50% -66% -73% -80%EnergyConsump. -1% -8% -4% -14% -19% -21% -23%SlotUtilization 17% 19% 54% 45% 64% 70% 77%SchedulingFactor 4% 4% 10% 17% 36% 40% 44%HighwayUsage 94% 182% 221% 319% 327% 359% 395%InstrucCacheMiss -6% -13% -9% -18% -28% -30% -33%
Tab. 5.1: MPEGdec profile for successive transformations [8].
5.2. Application Characterization 73
5.2 Application Characterization
Our objective is to characterize software applications at three hardware platforms. We
choose 20 applications from our benchmark set mentioned in Chapter 2. We use an ap-
plication pseudonym instead of their full name as mentioned in Table D.1. We optimized
these applications for following processors:
1. Philips TriMedia Processor PNX1302
2. Analog Device Blackfin ADSP533S
3. Intel PIII 850 embedded processor.
We obtain the AEP with the help of our energy cycle aware compilation framework. We
choose eight attributes in order to characterize applications at each hardware architec-
ture.
• Cache Miss (CMISS)
• Code Size (CODESIZE)
• Highway Usage (HIUSE)
• Slot Utilization (SLTUTIL)
• Register Usage (REGUSE)
• Scheduling Factor (SCHFAC)
• Cycle Efficiency (CYCEFF)
• Energy Saving (ENSAVING)
To find any potential relation between these attributes, we plot them on a scatter plot
for 20 applications. A visual inspection to find direct or indirect relationship between
the attributes leads the characterization procedure further. Later, we plot PCA biplot
to explore further the impact of compiler and underlying hardware architecture on these
applications. We explain it with three case studies at the above mentioned hardware
platforms.
5.2.1 Case Study 1
We obtain eight application attributes for the TriMedia processor. It is a media processor
for high-performance multimedia applications that deals with high-quality video and
audio. Typically, an extended general-purpose CPU (called the DSPCPU) makes it
capable of implementing a variety of multimedia algorithms from popular multimedia
74 5 Application-Architecture Characterization
standards such as MPEG-1 and MPEG-2 [4]. A scatter plot for our applications is
shown in Figure 5.1. These values are enlisted in Table D.2, Appendix D. Figure 5.1
displays the relationship between all pairs of application attributes. The plot matrix
consists of the application pseudonym (A01-A20) containing little scatter plots formed
from a pair of attributes. The attributes are represented by the X-axis and Y-axis of each
application. The observed values on the two attributes are represented by points in the
little scatter plot. Each point represents the values for one observation on two attributes.
We draw a line to signify any potential relation between the two attributes. Vertical
lines in REGUSE versus ENSAVING and CYCEFF, little scatter plot show that REGUSE
does not have higher variability as compared ENSAVING and CYCEFF. Similarly, a linear
relation exist between ENSAVING and CYCEFF. Though it is a well known fact that
SCHFAC and SLTUTIL are linearly related [110], but an inverse relation between the
two in little scatter plot, shows the compiler inefficiency to exploit the parallelism offered
by the TriMedia platform. SLTUTIL versus ENSAVING and SLTUTIL versus CYCEFF
show a linear relation between each other. This is expected, because an increase in
parallelism increase the cycle efficiency as well as energy saving [2].
A preliminary analysis of the scatter plot clearly indicates a potential relation between
the application profiles on the TriMedia architecture. We further analyze it with PCA.
We obtain the PCA for our data shown in Table D.2, Appendix D. In order to identify
the number of necessary principal components, we plot them on a box plot as shown in
Figure 5.3. The first principal component PC1 shows the maximum variability, whereas
PC2 and PC3 are the next larger principal components. All principal components and
their proportional contributions to the variability are depicted as a bar plot in the Fig-
ure 5.2). Though this plot is a discontinuous function, a dotted line is drawn between
the PCs to highlight the Scree plot elbow (bend). It shows that PC1 and PC2 are
sufficient to represent the variability in the application expression profiles for TriMedia
platforms. We plot a PCA on a biplot to further explore the applications behavior. The
biplot is drawn with the help of PC1, PC2, PC3 that covers approximately 90% data
variability as shown in Figure 5.4.
Generally PCA is used to reduce the data dimension, here, we focus on the qualitative
analysis of biplot. To the best of our knowledge this is first attempt to explore application
expression profiles and application-architecture correlations on PCA biplots. We explain
first, how we analyze the biplot shown in Figure 5.4.
• Application names are mentioned as solid dots.
• Thick lines show the Application Expression Vectors (AEV), they correspond to
eight application attributes.
• Thin lines show the principal components (PC1, PC2, PC3).
5.2. Application Characterization 75
Fig. 5.1: Scatter plot for 20 applications at the TriMedia processor.
• Though the biplot is a three dimensional plot, it is depicted here in such a way, so
that it can show the maximum association between AEV, PCs and applications.
The spread of application dots around the PCs and AEVs show how much an application
has enjoyed the architectural benefits. The plane formed by all of them, corresponds
to the architectural liberty offered to the compiler as well as the application in terms
of AEVs. An embedded system runs an application binary, which is an outcome of an
application build flow environment (see Figure 2.4). The PCA biplot helps to see the
potential lacks in compiler as well as in application coding.
76 5 Application-Architecture Characterization
Fig. 5.2: PCA Scree plot for 20 applications at the TriMedia processor.
Fig. 5.3: PCA box plot for 20 applications at the TriMedia processor.
5.2. Application Characterization 77
Fig. 5.4: PCA biplot for 20 applications at the TriMedia processor.
78 5 Application-Architecture Characterization
All the AEVs heading in the same direction support each other for e.g.,
HIUSE and SLTUTIL;
ENSAVING, SCHFAC and REGUSE;
All the AEVs heading in the opposite direction negate each other for e.g.,
CYCEFF and CMISS;
Applications close to AEVs support them, for e.g., A05, A12, A18 and A03 exploit
the parallelism offered by the TriMedia platform, as vectors HIUSE, SLTUTIL are also
in the same direction as ENSAVING and CYCEFF. Applications A09, A15 are most
energy efficient, while A04, A09, A14, A20 are cycle efficient. Despite the aggressive
transformation scheme, the applications A17, A01, A06, A13, A16, A02 are not able to
exploit the architectural benefits. The increase in cache misses has lead to a decrease in
cycle efficiency, since the location of these applications is exactly opposite to the CYCEFF
expression vector. Similarly, applications A19, A07, A06, A13, A02 are energy inefficient
applications. These applications are dominated with branch operations and hence lead
to a higher number of cache misses, that eventually lead to more energy consumption.
This is because the TriMedia architecture lacks the branch prediction unit. Our ECACF
has produced very compact code for applications A04 and A11. These applications
take advantage of TriMedia custom operations [4]. These operations offer many single
commands to perform array operations at data streams. For the data manipulation in
many algorithms, however, 32-bit data and operations are wasteful of expensive silicon
resources. Important multimedia applications, such as the decompression of MPEG
video streams, spend significant amounts of execution time dealing with eight bit data
items. Using 32-bit operations to manipulate small data items makes inefficient use of
32-bit execution hardware in the implementation. If these 32-bit resources could be used
instead to operate on four eight-bit data items simultaneously, performance would be
improved by a significant factor with only a tiny increase in implementation cost.
As our aim to characterize applications at TriMedia is mainly concerned with the porting
issue. The trend is increasing towards assembling an off-the-shelf hardware and porting
applications from Independent Software Vendors (ISV). To find an application which
suits the target hardware, PCA biplot proved to be useful tool. As we analyze above
applications such as A17, A01, A06, A13, A16, A02 are not well suited for TriMedia
architecture because of branch dominated operations. While applications such as A05,
A12, A18, A10, A09, A14, A20 are very well suited for TriMedia processors. We can
conclude that applications dominated with large matrix operations, data streaming and
localized operations produce better performance both in terms of cycle efficiency and
energy saving.
5.2. Application Characterization 79
Fig. 5.5: Scatter plot for 20 applications at the Blackfin processor.
5.2.2 Case Study 2
We optimized our 20 applications for the Blackfin processor. The results are enlisted in
Table D.3, Appendix D. The relationship between the application attributes is shown in
Figure 5.5. Vertical lines in REGUSE versus ENSAVING, CYCEFF and CODESIZE little
scatter plot show that REGUSE does not have higher variability as compared ENSAVING,
CYCEFF and CODESIZE. Similarly, a linear relation exists between ENSAVING and
CYCEFF. This behavior is similar to what we observed in TriMedia (Case Study 1). A
linear relation is observed between the CMISS and CODESIZE, though in general there
could be no apparent relation between the two, because cache missing (CMISS) is a run
80 5 Application-Architecture Characterization
Fig. 5.6: PCA biplot for 20 applications at the Blackfin processor.
time behavior of the application, while code size (CODESIZE) is a static feature. In our
opinion this very behavior is an outcome of the Blackfin compiler. During optimization,
it increases the size of the code to handle branch operations. Attempts to increase
the spatial access in iterative function calls and temporal access in multi folded loops,
result in an increase in code size. Though it reduces the cycle count but the increase
in cache miss leads to an increase in energy consumption. Apparently, the Blackfin
compiler exploits the branching for better cycle performance but decreases in energy
performance.
The PCA biplot in Figure 5.6 shows a different response for all applications as compared
to Figure 5.4 for the TriMedia architecture. The expression vectors CODESIZE, HIUSE
and CMISS are very well correlated with each other as well as PC2, we have already
commented on it. Whereas, SLTUTIL, ESAVING, SCHFAC, CYCEFF and REGUSE are
well correlated with each other as well as PC1. As PC1 corresponds to the maximum
variability in terms of architectural usage by the applications. The biplot shows, the
Blackfin processor offers better performance for A03, A14, A10, A18, A20, A01, A02
and A09. While applications on the left side of biplot such as A12, A07, A19, A04, A11,
A17, A13, A05 are not exploiting any architectural benefits offered by the processor.
Most of these applications are data dominated, and involve pointer operations. This
5.3. Architecture-Centric Application Characterization 81
points to the poor performance of native compilers to handle pointer operations. If the
aim is to port these application to such processor, it is recommended to transform the
underlying algorithm to be in small functions and localized array operations. Though,
the energy saving as shown in Table D.3 is not very promising, but in practice Blackfin
is known as energy efficient processor. We assume that, the energy performance in
practice is gained by using the Power Management Unit (PMU) available in the Blackfin
processor. It may be noted that our ECACF optimizes a given application explicitly
at the source code level (i.e. source to source transformation). During optimization
iterations we always turn the native power optimization unit off. The primary advantage
of this methodology is, first we optimize the application binary, later we can reduce
energy further down by scheduling power management units.
5.2.3 Case Study 3
We optimized our 20 applications for a general purpose INTEL PIII 850 processor, results
are enlisted in Table D.3, Appendix D. This processor is implemented with baseline
version of VLIW architecture [4]. The relationship between the application attributes
is shown in Figure 5.7. Unlike, the TriMedia and Blackfin processor, we do not observe
a large correlation between the attributes. For the sake of completion, we have shown
their PCA biplot as well (see Figure 5.8). There is a large variability in the application
spread. We do not observe any cluster of application that exploits any of eight attributes
explicitly. Moreover, the code size is a big issue in PIII native compilation environment.
HIUSE is exactly opposite to the expression vector of CMISS; it shows the optimal use
of internal buses, which reduces the cache miss and eventually energy. A closer look into
the applications A18, A14, A13 reveals that, being an audio codec, they perform most of
the operation in local loops at a small chunk of data. The size of the instruction blocks
or data blocks are well correlated with the instruction and data cache. We assume that,
this feature is owned by the native compiler, which ensures a local optimization rather
than a global or inter-procedural one.
5.3 Architecture-Centric Application Characterization
In the previous section, we explore the application variability at a given hardware archi-
tecture. In this section we shall explore the application portability across the platforms.
We observe the differential application expression profile (dAEP) between our three tar-
get hardware platforms. The basic idea is depicted in Figure 5.9. The absolute difference
between the two AEPs across two different platform is used as dAEP. As the difference
in application behavior is across two platforms, we call it architecture-centric application
expression profile. We obtain the dAEP for the following scenarios:
82 5 Application-Architecture Characterization
Fig. 5.7: Scatter plot for 20 applications at the PIII 850 processor.
1. Across the TriMedia processor and the Blackfin processor for 20 applications.
2. Across the Blackfin processor and the PIII 850 processor for 20 applications.
3. Across the TriMedia processor and the PIII 850 processor for 20 applications.
We analyze the behavior for the eight attributes, explained in the previous section. Table
D.5, Table D.6 and Table D.7 in Appendix D enlist the dAEP for three scenarios. The
PCA biplots for each scenario are shown in Figure 5.10, Figure 5.11 and Figure 5.12,
respectively.
Applications close to AEVs are those favorite to both platforms, for e.g., A01, A06,
5.3. Architecture-Centric Application Characterization 83
Fig. 5.8: PCA biplot for 20 applications at the PIII 850 processor.
Fig. 5.9: Differential AEP across three hardware platforms.
A09 on the average performed better cycle efficiency at both the TriMedia and the
Blackfin processor (see Figure 5.10). Similarly, A14, A04, A11 are not well suited to
both platforms, due to a higher cache miss rate. Application clusters at the left are
not suited for the portability, they perform well on either platform, for e.g A10, A15
84 5 Application-Architecture Characterization
Fig. 5.10: PCA biplot for 20 applications across the TriMedia processor and the Blackfinprocessor.
are energy and cycle efficient at the TriMedia processor but show poor performance at
the Blackfin processor. Here, the biplot clearly identifies the cluster of applications well
suited for portability across the two platforms.
Figure 5.11 shows the A12, A14, A01, A08, A20, A09 are both energy and cycle efficient
for both Blackfin and PIII 850 processors. While applications cluster A02, A18, A05,
A11, A10 and A15 are not well suited for portability.
Figure 5.12 shows a high portability between the TriMedia and PIII 850 processors. The
5.3. Architecture-Centric Application Characterization 85
Fig. 5.11: PCA biplot for 20 applications across the Blackfin processor and the PIII 850processor.
AEVs SLTUTIL, REGUSE, SCHFACT, ENSAVING, CYCEFF are very close to each other
and heading into the same direction. Their contribution to PC1 is also very high. The
application in the vicinity for e.g., A11, A18, A02, A06, A17, A04, A03, A20, A09, and
A12 are very well suited for portability. While applications cluster on the left containing
A01, A07, A08, A15, A10, A05, A14, A13, and A19 perform well across either of the
two platforms and show poor portability.
86 5 Application-Architecture Characterization
Fig. 5.12: PCA biplot for 20 applications across the TriMedia processor and the PIII850 processor.
5.4 Conclusions
In this chapter we show that our energy cycle aware compilation framework (ECACF)
has a great interest for designers working in mobile computing embedded system devel-
opment, where their design goal is to measure the application behavior across different
architectures. Applications of similar functionality may yield similar expression profiles,
and hence can be suitable for similar hardware platforms. We introduce a new methodol-
5.4. Conclusions 87
ogy to evaluate the application portability using multivariate statistics. We demonstrate
how box plot, Scree plot, and PCA biplots can be used to characterize an application
at a given hardware architecture. We expose the minutia of our methodology by ex-
ploring the AEPs across three different hardware platforms at diversified applications.
Finally, we demonstrate how dAEP can be used to find out legacy code portability across
platforms.
88 5 Application-Architecture Characterization
6 CONCLUSIONS
In this thesis we propose a framework, where software applications optimally utilize
the hardware architecture to deliver energy-cycle performance within user defined con-
straints. Our energy aware framework in [25] meets the demand by incorporating the
following features in native multimedia DSP compilation environments.
1) The framework transforms the legacy application source code into optimal ’C’ source
code, taking advantage of different slacks appearing in the application-to-binary devel-
opment hierarchy.
2) Unlike conventional techniques, ’C’ source code is iteratively compiled for different
performance goals both in terms of execution time as well as energy dissipation.
3) Our post profiling techniques published in [26] evaluate the application performance
not only at compilation layer (as conventional compiler does) but also at scheduling
layer, linker layer, machine code generation layer and finally at loader layer.
4) We measure the realtime performance of application running on actual hardware.
These measured parameters are further used to tune the transformation scheme of the
legacy software application.
5) We tested our framework at different applications that belong to diversified industrial
domains such as audio transcodecs [27], video transcodecs [8], speech codecs, and
bioinformatics [28] [29].
6) The work is further extended in [30] [27], to characterize application-architecture
correlation, that are well suited for a pre-design assessment of an embedded system
design. It answers the question whether a given hardware architecture is an appropriate
choice for a given multimedia software application or not.
90 6 Conclusions
APPENDICES
A. LIST OF APPLICATION EXPRESSION
PROFILE (AEP) MONITORS
Name: Processor Frequency
Definition: The operating frequency of a multimedia processor
Location: VDF
Type: Static
Range: Typical 100MHz - 233MHz (depends on given hardware architecture)
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Execution Time
Definition: The total execution time of a software application for a given input test
vector.
Location: Target HW
Type: Dynamic
Range: Measured in seconds
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Energy
Definition: Amount of energy consumed by the software application for a given input
test vector.
Location: Target HW
Type: Dynamic
Range: Measured in milli joules (mJ)
Level: Primary
94 A List of Application Expression Profile (AEP) Monitors
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Scheduling Factor
Definition: Computed this factor by dividing infinite machine cycle time with finite
machine cycle time [110] [114] [115].
Location: Transformation Engine and Schedular
Type: Dynamic
Range: 0 to 1
Level: Secondary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Average cycles on finite machine
Definition: The finite machine cycle time averaged according to the probabilities of
execution of the block of code [110] [114] [115].
Location: Target HW
Type: Dynamic
Range: Measured in cycles
Level: Secondary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Useful Issues per Cycle
Definition: Useful operations issued dynamically per number of dynamic instructions [110] [114] [115].
Location: Target HW
Type: Dynamic
Range: Integer
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Slot Utilization.
Definition: The finite machine cycle time averaged according to the probabilities of
execution of the block of code [110] [114] [115].
Location: Transformation Engine
Type: Dynamic
Range: Measured in percentage
95
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Ideal cycles
Definition: We provide the estimated infinite machine cycle time for static code.
Location: Target HW
Type: Dynamic
Range: Measured in cycles
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: JPMO
Definition: Joules per million of operations, computed as measure energy per number
of million of operations
Location: Target Hardware
Type: Dynamic
Range: Measured in joules per million of operations
Level: Secondary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: OPC
Definition: Operations per cycle, computed as number of operations per total number
of executed cycles [110].
Location: Schedular
Type: Dynamic
Range: Measured in operations per cycle (integer)
Level: Secondary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: IPC
Definition: Instruction per cycle, computed as number of operations per total number
of executed cycles [110].
Location: Native simulator (for e.g., tmSim for TriMedia TM130x)
Type: Dynamic
96 A List of Application Expression Profile (AEP) Monitors
Range: Measured in instruction operations per cycle (integer)
Level: Secondary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Architecture Affinity Number (AAN)
Definition: It is computed as :
AAN =numberofoperationsstatic
(numberofinstructionstatic ∗ numberofissueslot)(A.1)
Location: Transformation Engine
Type: Dynamic
Range: 0 to 1
Level: Secondary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Register Usage
Definition: Number of maximum live register at any time of program execution [110].
Location: Schedular
Type: Dynamic
Range: Interger number (depends on VLIW architecture)
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Following profile monitors are obtained with the help of a tool ’csource’ from [116]
Name: Code Size
Definition: Size of the executable binary
Location: Linker
Type: Static
Range: Integer
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name:Nesting
Definition: Maximum nesting level of control constructs
97
Location: Compiler
Type: Static
Range: Integer
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Paths
Definition: Number of possible paths, not counting abnormal exits or gotos
Location: Compiler
Type: Static
Range: Integer
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Cyclomatic
Definition: The measure of the complexity of a function’s decision structure. The
cyclomatic complexity is also the number of basis, or independent, paths through a
module. Also sometimes called the McCabe Complexity after its originator.
Location: Compiler
Type: Static
Range: Integer
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name:Modified
Definition: Cyclomatic except each case statement is not counted;the entire switch
counts as 1
Location: Compiler
Type: Static
Range: Integer
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Strict
98 A List of Application Expression Profile (AEP) Monitors
Definition: cyclomatic except logical operators are counted as 1
Location: Compiler
Type: Static
Range: Integer
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Name: Essential
Definition: Measure of the amount of unstructured code in a function
Location: Compiler
Type: Static
Range: Integer
Level: Primary
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
B VLIW DESCRIPTOR FILE (VDF) FORMAT
Our ECACF is generic with respect to the VLIW architecture. In this section we explain
the structure of our VDF, which is similar to [4]. In order to compile for a specific
target machine, the compilation tools are parameterized through a textual description
known as the VLIW description format and can be integrated as shown in Figure 2.3.
As entries in VDF are generic, any VLIW processor description can be added into our
VDF. Different fields of VDF are explained below:
Operation:
The operation section defines operation names and the properties associated with them.
The section consists of the reserved keyword OPERATIONS, followed by any number of
operation groups. Each operation group consists of the arity, operation properties, and
operation names in the operation group.
E.g.,
UNARY PARAMETRIC (UNSIGNED 0 TO 127) iaddi isubi
indicates both iaddi and isubi take a single argument, and both operations contain a
parameter that is unsigned and in the range 0 to 127.
Pseudo-Operation:
The pseudo-operation section consists of the reserved keyword PSEUDO-OPERATIONS,
followed by any number of pseudo-operation mapping rules. Each rule consists of the
tree operation name, followed by a string in quotation marks that defines the mapping,
ending with a semicolon to terminate the rule entry. The string defines how an operation
is expanded. Each use of the pseudo-operation is rewritten to the form specified by the
string. E.g, following string defines the iles operation (integer less than) as igtr (integer
greater than), with its arguments swapped.
iles ”igtr 21” ;
iles ”igtr 21” ;
Unit Type:
The unit type section defines a functional unit type in the machine (such as data mem-
ory unit, integer arithmetic/logic unit, floating point divider unit, and so forth). It
consists of the keyword FUTYPE, followed by the name of the unit type, followed by
100 B VLIW Descriptor File (VDF) Format
unit type properties, then the keyword OPERATIONS, followed by a list of all operations
implemented in that functional unit type. E.g.,
FUTYPE shifter LATENCY 1 OPERATIONS asli roli asri lsri asl rol asr lsr ;
Target Machine:
This section describes the target machine configuration. The ISSUESLOTS entry defines
the number of issue slots in the machine. The REGISTERS entry declares the size of
the register file. The WRITEBUSES entry defines the number of writeback buses used
to write back the results of computations into the register file.
E.g., typical description for TriMedia architecture [4] is as follows:
MACHINE
ISSUESLOTS 5
REGISTERS 128
WRITEBUSES 5
FUTYPE const SLOT 1 2 3 4 5
FUTYPE alu SLOT 1 2 3 4 5
FUTYPE dmem SLOT 4 5
FUTYPE shifterSLOT 1 2
FUTYPE dspalu SLOT 1 3
FUTYPE branch SLOT 2 3 4
Instruction Format: The instruction format section consists of the reserved IFORMAT
keyword, followed by the bitfields subsection, and then the opcodes subsection. The
bitfields subsection consists of the keyword BITFIELDS and six bitfield length specifiers.
These bitfield specifiers can be in any order, though the assembler always packs the
bitfields in a particular order. This section specifies bitfield size (in bits).
E.g., IFORMAT section description for TriMedia architecture [4] is as follows:
OPCODES
iimm 95
uimm 95
iadd 12
isub 13
imax 15
101
imin 14
igtr 17
igeq 16
ieql 37
nop 255
Readers are encouraged to refer [103] [4] [81] [104] for further detail about the entries
of VDF structure.
102 B VLIW Descriptor File (VDF) Format
C. USER CONSTRAINTS FILES (UCF)
FORMAT
The UCF format has following fields:
Processor Operating Frequency: It describes the processor actual operating frequency,
though processor could be driven to much higher frequency. (Range = processor depen-
dent)
Main Memory Size: It describes the attached main memory size, though actual memory
size that can be glued to processor chip may be higher. (Range = processor dependent)
Slot Utilization: The percentage of slot utilization for an application, higher the per-
centage, more would be the application compilation time. It is not necessary that our
ECACF may meet this parameter, because it is directly related to the parallelism offered
by the application itself. (Range = 0 to 100%)
Total CPU Load: The workload offered by an application to CPU, user sets this para-
meter based on his constraints for available CPU cycles, to increase the CPU productivity.
(Range = 0 to 100%)
Total Energy Dissipation: The energy consumed by an application, user sets this
parameter based on his constraints for available battery budget, to increase the energy
saving. (Range = 0 to 100%)
Scheduling Factor: It is associated with CPU utility, briefly higher the number greater
is CPU utilization. It contributes to energy saving significantly. (Range = 0 to 1)
Tree Depth: The number of times a tree can be replicated at the execution exit is called
tree depth. It increases the parallelism in VLIW processor and reduce the execution time.
(Range = 0 to 1)
Unfolding Depth: The number of times a loop can be unfolded/ unrolling. (Range =
1,2,4,8)
Search Time Out: Time out for search algorithm.(Range = user dependent)
Search Generations: The maximum number of generation, used only in our MSIC
algorithm. (Range = user dependent)
104 C User Constraints Files (UCF) Format
Transformation Schemes: The set of transformation schemes to be used by our
ECACF, this entry is generally provided by the user. (Range = user dependent)
C.1 UCF for MPEG-1 encoder example in Section 3.3
Processor Operating Frequency: 180 MHz
Main Memory Size: 32 Mbyte
Slot Utilization: 80%
Total CPU Load: 40%
Total Energy Dissipation: 14,000 mJoules
Scheduling Factor: 0.5
Tree Depth: 0.5
Unfolding Depth: 8
Search Time Out: Manual
Search Generations: None
Transformation Schemes: TS1, TS2, TS3, TS4
C.2 UCF for NLIVQ example in Section 4.2.3
Processor Operating Frequency: 145 MHz
Main Memory Size: 16 Mbyte
Slot Utilization: 80%
Total CPU Load: 5%
Total Energy Dissipation: 5,000 mJoules
Scheduling Factor: 0.5
Tree Depth: 0.5
Unfolding Depth: 8
Search Time Out: None
Search Generations: 700
Transformation Schemes: TS01, TS02, TS03,..., TS17
D APPLICATION ATTRIBUTES
Application Pseudonyms Description DomainA01 G728enc Speech
A02 GENESSPLICER Bioinformatics
A03 TRIGRSCAN Bioinformatics
A04 MPEGdec Video
A05 H263enc Video
A06 M100 General
A07 G728dec Speech
A08 NLIVQ Image Processing
A09 GENIE Bioinformatics
A10 H263dec Video
A11 M64 General
A12 MPEGenc Video
A13 GSMdec Speech
A14 GSMenc Speech
A15 GRAIL Bioinformatics
A16 G723enc Speech
A17 G723dec Speech
A18 MP3enc Audio
A19 G728enc Speech
A20 MP3dec Audio
Tab. D.1: Pseudonyms for 20 applications.
106 D Application Attributes
Applications SchFac RegUse HiUse SltUtil Cmiss CodeSize EnSaving CycEffA01 0.08 0.04 0.16 0.5 0.43 0.33 0.14 0.15
A02 0.04 0.02 0.09 0.59 0.39 0.24 0.16 0.26
A03 0.02 0 0.19 0.66 0.34 0.28 0.21 0.27
A04 0.15 0.09 0.22 0.36 0.2 0.18 0.23 0.16
A05 0.08 0.04 0.39 0.8 0.35 0.29 0.36 0.33
A06 0.07 0.05 0.26 0.24 0.38 0.28 0.06 -0.13
A07 0.12 0.02 0.15 0 0.39 0.18 -0.08 -0.24
A08 0.15 0.07 0.02 0.7 0.2 0.44 0.34 0.5
A09 0.12 0.02 0.05 0.59 0.13 0.21 0.29 0.46
A10 0.06 0.02 0.3 0.98 0.26 0.34 0.45 0.56
A11 0.1 0.02 0.06 0.27 0.07 0.05 0.16 0.22
A12 0.02 0.02 0.24 0.74 0.23 0.43 0.32 0.34
A13 0.12 0 0.1 0.36 0.39 0.35 0.07 0.07
A14 0.01 0 0.17 0.85 0.03 0.01 0.43 0.67
A15 0.15 0.11 0.09 0.77 0.45 0.16 0.3 0.48
A16 0.2 0.16 0.24 0.08 0.17 0.47 0.16 -0.13
A17 0.13 0.04 0.34 0.15 0.26 0.46 0.09 -0.21
A18 0.15 0.11 0.37 0.61 0.2 0.39 0.38 0.25
A19 0.17 0 0.03 0.08 0.49 0.36 -0.1 -0.18
A20 0.11 0.07 0.16 0.84 0.06 0.15 0.48 0.67
Tab. D.2: AEP for optimized 20 applications at the TriMedia processor.
107
Applications SchFac RegUse HiUse SltUtil Cmiss CodeSize EnSaving CycEffA01 0.157 0.020 0.067 0.838 0.159 0.282 0.413 0.797
A02 0.140 0.111 0.100 0.980 0.257 0.201 0.484 0.737
A03 0.197 0.055 0.052 0.605 0.462 0.122 0.203 0.375
A04 0.032 0.024 0.320 0.268 0.460 0.364 0.016 -0.221
A05 0.049 0.027 0.126 0.378 0.217 0.107 0.143 0.171
A06 0.125 0.001 0.175 0.535 0.108 0.001 0.289 0.415
A07 0.020 0.011 0.234 0.061 0.351 0.246 -0.063 -0.287
A08 0.075 0.018 0.040 0.498 0.235 0.407 0.183 0.254
A09 0.197 0.106 0.012 0.825 0.000 0.280 0.520 0.784
A10 0.080 0.017 0.090 0.769 0.186 0.274 0.345 0.534
A11 0.148 0.011 0.081 0.068 0.475 0.071 -0.093 -0.145
A12 0.008 0.007 0.386 0.023 0.271 0.434 -0.027 -0.412
A13 0.115 0.022 0.357 0.314 0.241 0.116 0.166 0.024
A14 0.159 0.097 0.224 0.620 0.284 0.152 0.322 0.361
A15 0.158 0.117 0.263 0.795 0.445 0.493 0.355 0.333
A16 0.098 0.029 0.072 0.380 0.121 0.215 0.194 0.250
A17 0.073 0.051 0.159 0.265 0.313 0.266 0.075 -0.021
A18 0.160 0.008 0.315 0.783 0.167 0.079 0.423 0.539
A19 0.190 0.052 0.095 0.080 0.434 0.303 -0.034 -0.154
A20 0.044 0.033 0.328 0.885 0.174 0.155 0.443 0.534
Tab. D.3: AEP for optimized 20 applications at the Blackfin processor.
108 D Application Attributes
Applications SchFac RegUse HiUse SltUtil Cmiss CodeSize EnSaving CycEffA01 0.184 0.098 0.026 0.779 0.49 0.044 0.286 0.651
A02 0.106 0.013 0.022 0.428 0.236 0.295 0.155 0.296
A03 0.012 0.004 0.288 0.193 0.459 0.455 -0.042 -0.368
A04 0.045 0.025 0.089 0.862 0.481 0.081 0.256 0.597
A05 0.11 0.048 0.165 0.207 0.313 0.18 0.06 -0.037
A06 0.02 0.002 0.361 0.354 0.185 0.314 0.165 -0.025
A07 0.133 0.081 0.362 0.011 0.145 0.139 0.096 -0.239
A08 0.19 0.113 0.092 0.557 0.489 0.37 0.201 0.278
A09 0.095 0.065 0.296 0.384 0.485 0.229 0.099 -0.054
A10 0.187 0.005 0.066 0.772 0.077 0.398 0.42 0.763
A11 0.128 0.1 0.178 0.496 0.438 0.19 0.18 0.2
A12 0.13 0.083 0.367 0.874 0.056 0.104 0.549 0.752
A13 0.18 0.032 0.358 0.445 0.311 0.201 0.227 0.148
A14 0.12 0.016 0.348 0.593 0.215 0.393 0.308 0.272
A15 0.075 0.032 0.299 0.589 0.156 0.048 0.313 0.403
A16 0.095 0.036 0.017 0.632 0.002 0.222 0.357 0.672
A17 0.012 0.006 0.083 0.916 0.314 0.309 0.33 0.668
A18 0.192 0.12 0.359 0.347 0.005 0.324 0.356 0.198
A19 0.169 0.052 0.259 0.309 0.402 0.065 0.112 0.039
A20 0.028 0.018 0.25 0.251 0.286 0.435 0.065 -0.16
Tab. D.4: AEP for optimized 20 applications at the PIII 850 processor.
109
Applications SchFac RegUse HiUse SltUtil Cmiss CodeSize EnSaving CycEffA01 0.075 0.016 0.095 0.333 0.271 0.052 0.276 0.649
A02 0.101 0.096 0.014 0.39 0.13 0.035 0.326 0.479
A03 0.182 0.052 0.135 0.051 0.118 0.155 0.008 0.11
A04 0.12 0.068 0.102 0.095 0.261 0.183 0.213 0.384
A05 0.036 0.015 0.265 0.421 0.129 0.185 0.216 0.157
A06 0.055 0.053 0.085 0.295 0.269 0.284 0.234 0.547
A07 0.099 0.009 0.088 0.057 0.042 0.066 0.021 0.047
A08 0.073 0.053 0.018 0.197 0.04 0.031 0.16 0.242
A09 0.073 0.087 0.043 0.239 0.126 0.074 0.228 0.324
A10 0.016 0 0.211 0.212 0.074 0.063 0.103 0.023
A11 0.05 0.01 0.017 0.201 0.402 0.02 0.249 0.366
A12 0.012 0.009 0.145 0.719 0.046 0.006 0.347 0.751
A13 0.008 0.022 0.254 0.046 0.144 0.239 0.092 0.041
A14 0.144 0.096 0.053 0.229 0.257 0.138 0.11 0.309
A15 0.008 0.007 0.178 0.022 0.008 0.333 0.053 0.143
A16 0.101 0.128 0.167 0.298 0.045 0.252 0.034 0.377
A17 0.053 0.012 0.185 0.118 0.05 0.194 0.011 0.191
A18 0.011 0.1 0.055 0.175 0.031 0.313 0.043 0.285
A19 0.025 0.048 0.068 0.001 0.056 0.062 0.068 0.025
A20 0.071 0.032 0.165 0.047 0.115 0.004 0.037 0.141
Tab. D.5: dAEP for optimized 20 applications across the TriMedia and the Blackfinprocessors.
110 D Application Attributes
Applications SchFac RegUse HiUse SltUtil Cmiss CodeSize EnSaving CycEffA01 0.028 0.078 0.041 0.058 0.331 0.238 0.127 0.145
A02 0.034 0.099 0.078 0.552 0.021 0.095 0.329 0.442
A03 0.185 0.051 0.237 0.412 0.003 0.334 0.245 0.744
A04 0.013 0.002 0.231 0.595 0.022 0.284 0.24 0.818
A05 0.062 0.021 0.039 0.171 0.096 0.073 0.083 0.208
A06 0.105 0.001 0.186 0.181 0.078 0.313 0.123 0.44
A07 0.113 0.07 0.128 0.05 0.205 0.107 0.159 0.048
A08 0.115 0.095 0.052 0.059 0.255 0.038 0.018 0.025
A09 0.101 0.041 0.284 0.441 0.485 0.051 0.421 0.839
A10 0.107 0.012 0.024 0.003 0.109 0.125 0.075 0.229
A11 0.021 0.088 0.097 0.428 0.037 0.119 0.273 0.346
A12 0.121 0.077 0.019 0.851 0.215 0.329 0.576 1.164
A13 0.064 0.009 0.001 0.13 0.07 0.086 0.061 0.124
A14 0.039 0.081 0.124 0.026 0.069 0.24 0.014 0.089
A15 0.083 0.084 0.036 0.206 0.288 0.445 0.042 0.069
A16 0.003 0.007 0.055 0.252 0.118 0.007 0.163 0.422
A17 0.061 0.045 0.076 0.652 0 0.042 0.255 0.689
A18 0.032 0.113 0.044 0.436 0.163 0.245 0.067 0.341
A19 0.021 0 0.164 0.23 0.032 0.237 0.145 0.194
A20 0.016 0.015 0.078 0.634 0.112 0.28 0.378 0.694
Tab. D.6: dAEP for optimized 20 applications across the Blackfin and the PIII 850processors.
111
Applications SchFac RegUse HiUse SltUtil Cmiss CodeSize EnSaving CycEffA01 0.102 0.062 0.137 0.275 0.06 0.29 0.149 0.503
A02 0.068 0.003 0.064 0.162 0.151 0.059 0.003 0.037
A03 0.003 0.002 0.102 0.464 0.115 0.178 0.253 0.634
A04 0.107 0.067 0.129 0.499 0.283 0.101 0.027 0.434
A05 0.026 0.006 0.226 0.592 0.033 0.112 0.299 0.366
A06 0.049 0.052 0.1 0.114 0.191 0.029 0.11 0.107
A07 0.014 0.061 0.216 0.007 0.247 0.041 0.18 0
A08 0.043 0.042 0.069 0.139 0.294 0.069 0.142 0.218
A09 0.028 0.045 0.242 0.202 0.36 0.022 0.193 0.514
A10 0.122 0.012 0.235 0.209 0.183 0.062 0.028 0.206
A11 0.029 0.078 0.114 0.226 0.365 0.139 0.024 0.02
A12 0.109 0.068 0.126 0.133 0.169 0.324 0.229 0.413
A13 0.056 0.031 0.255 0.084 0.074 0.154 0.153 0.082
A14 0.105 0.015 0.177 0.255 0.188 0.379 0.124 0.398
A15 0.075 0.077 0.213 0.184 0.297 0.111 0.011 0.074
A16 0.104 0.121 0.222 0.55 0.163 0.245 0.197 0.799
A17 0.114 0.033 0.261 0.77 0.05 0.152 0.244 0.879
A18 0.043 0.013 0.011 0.261 0.193 0.068 0.024 0.057
A19 0.004 0.048 0.232 0.231 0.087 0.299 0.213 0.219
A20 0.086 0.047 0.088 0.587 0.227 0.284 0.415 0.835
Tab. D.7: dAEP for optimized 20 applications across the TriMedia and the PIII 850processors.
112 D Application Attributes
E LIST OF ACRONYMS
AEP Application expression profile
AEV Application expression vector
APM Application profile monitor
BSX Byte sex (little endian or big endian)
CB Code block
CISC Complex instruction set computer
CMISS Cache miss
CODESIZE Code size
CPU Central processing unit
CYCEFF Cycle efficiency
dAEP Differential application expression profile
DSP Digital signal processor
DSPCPU DSP CPU
EC Energy cycle
ECACF Energy cycle aware compilation framework
ECB Energy cycle bay
ECHCB Energy cycle hungry code block
EDA Electronic data automation
ENSAVING Energy saving
FFT Fast fourier transform
FUU Functional unit utilization
GA Genetic algorithm
GMIC Gradient mode iterative compilation
HIUSE Hiway usage
HM Hertz Matrix
IC Integrated circuits
IDCT Inverse discrete cosine transform
ILP Instruction level parallelism
IPC Instruction per cycle
ISA Instruction set architecture
ISV Independent software vendor
JPMO Joules per million of operations
LSB Least significant bit
114 E List of Acronyms
M100 Matrix of order 100x100
M64 Matrix of order 64x64
MES Mobile embedded systems
MMIO Memory mapped input output
MSIC Multicriteria stochastic iterative compilation
NI-CD Nickle Cadmium
NI-MH Nickel metal hydride
NLIVQ Non linear vector quantization
OPC Operation per cycle
PC Personal computer
PCA Principal component analysis
PCSW Program counter status word
PDA Personal data assistant
PMU Power management unit
REGUSE Register usage
RISC Reduced instruction set computer
SCHFAC Scheduling factor
SDRAM Synchronous data random access memory
SLTUTIL Slot utilization
TS Transformation scheme
UCF User constraint file
VDF VLIW descriptor file
VLIW Very long instruction word
VQ Vector quantization
WCET Worse case execution time
BIBLIOGRAPHY
[1] “Desktop CPU Power Consumption Guide,”
http://www.techarp.com/.
[2] “Intel Processor Chronicle,” http://developer.intel.com/design/.
[3] Natibo, “Rechargeable Battery/Systems for Communication/Electronic Applica-
tions,” http://www.acq.osd.mil/ott/natibo/docs/BatryRpt-2.pdf.
[4] P. Electronic, “TM1300 Data Book,” North America Corporation, vol. Oct., 1999.
[5] V. Tiwari and S. Malik, “Power Analysis of Embedded Software: A First Approach
to Software,” in Proceedings of IEEE Transactions on VLSI Systems, vol. 2, Dec.
1994.
[6] M. Lorenz, T. Draeger, R. Leupers, P. Marwedel, and G. P. Fettweis, “Low-Energy
DSP Code Generation Using a Genetic Algorithm,” in Proceedings of the IEEE on
Computer Design 2001, Austin, Texas, Jan. 2001.
[7] A. V. Oppenheim and R. Schafer, Discrete Time Signal Processing. New Jersey:
Prentice Hall, 1989.
[8] N. Z. Azeemi, “Probabilistic Iterative Compilation for Source Optimization of
Embedded Programs,” in Proceeding of the IEEE 2006 International SoC Design
Conference, Seoul, Korea, Oct. 2006, pp. 323 – 328.
[9] J. L. Hennessy and D. A. Patterson, Computer Architecture A Quantitative Ap-
proach, 2nd ed. Kluwer Academic Publishers, 1995.
[10] “Advanced Configuration and Power Interface Specification,”
http://www.teleport.com/ acpi.
[11] “The StarCore DSP,” http://www.starcore-dsp.com/.
[12] Techsharp, “Intel StrongARM Processors,” http://developer.intel.com/design/strong/.
[13] “Intel SpeedStep Technology,” http://www.intel.com/mobile/pentiumIII/ist.htm.
116 Bibliography
[14] Intel StrongARM SA-1110 Microprocessor - Advanced Developer’s Manual. Intel
Corp., Jun. 2006.
[15] C. Small, “Shrinking Devices Puts the Squeeze on System Packaging,” in EDN
39(4), Feb. 1994, pp. 41–46.
[16] V. Gutnik and A. P. Chandrakasan, “An Embedded Power Supply for Low-Power
DSP,” in Proceedings of IEEE Transactions on VLSI Systems, ser. 4, vol. 5, Dec.
1997, pp. 425–435.
[17] “Moores Law,”
http://www.intel.com/intel/museum/25anniv/hof/moore.htm.
[18] C. Chiasserini and R. Rao, “Pulsed battery discharge in communication devices,”
in MOBICOM (1999), 1999, pp. 88–95.
[19] J. Eager, “Advances in Rechargeable Batteries Spark Product Innovation,” in
Proceedings of the 1992 Silicon Valley Computer Conference, Santa Clara, Aug.
1992, pp. 243–253.
[20] D. Stepner, N. Rajan, and D. Hui, “Embedded Application Design Using a Real-
Time OS,” in Proceedings of DAC 1999, New Orleans, 1999, pp. 151–156.
[21] P. S. R. Diniz, Adaptive Filtering Algorithms and Practical Implementation.
Kluwer Academic, 1997.
[22] A. Hoffmann and H. Meyr, Architecture Exploration for Embedded Processors
with LISA. Kluwer Academic Publishers, 2002.
[23] “Advance RISC Machines Architectural Reference Manual,” Prentice Hall, vol.
Advanced RISC Machines Ltd., 1996.
[24] V. Tiwari, S. Malik, and T. L. A. Wolfe, “Instructino Level Power Analysis and Op-
timization of Software.” in Journal of VLSI Signal Processing, vol. 13(2), August
1996.
[25] N. Z. Azeemi and M. Rupp, “Energy-Aware Source-to-Source Transformations for
a VLIW DSP Processor,” in Proceeding of the IEEE 17th International Conference
on Microelectronics, Islamabad, Pakistan, Dec. 2005, pp. 133 – 138.
[26] ——, “Multicriteria Low Energy Source Level Optimization of Embedded Pro-
grams,” in Proceedings of Tagungsband zur Informationstagung Mikroelektronik
06 IEEE Austria, Vienna, Austria, October, 2006, pp. 150–158.
[27] N. Z. Azeemi, “Multicriteria Energy Efficient Source Code Compilation for De-
pendable Embedded Applications,” in Proceeding of the IEEE International Con-
ference on Information Technology IIT 2006, Dubai, UAE, Nov. 2006.
Bibliography 117
[28] N. Z. Azeemi, A. Sultan, and A. Muhammad, “Parameterized Characterization of
Bioinfomatics Workload on SIMD Architecture,” in Proceeding of the IEEE Inter-
national Conference on Information and Automation 2006, Colombo, Sri Lanka,
Dec. 2006, pp. 189 – 194.
[29] N. Z. Azeemi and A. Sultan, “Characterization of Bioinformatics Applications on
Multimedia Processor,” in Proceeding of the IEEE Cairo International Biomedical
Engineering Conference 2006, Kairo, Egypt, Dec. 2006.
[30] N. Z. Azeemi, “Compiler Directed Battery-Aware Implementation of Mobile Ap-
plications,” in Proceeding of the IEEE 2nd International Conference on Emerging
Technologies 2006, Peshawar, Pakistan, Nov. 2006, pp. 151 – 156.
[31] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, and F. Baez, “Reducing
Power in High-performance Microprocessors,” in Proceedings of the 35th Design
Automation Conference, San Francisco, CA USA, Jun. 1998.
[32] “JouleTrack - A Web Based Tool for Software Energy Profiling,”
http://dry-martini.mit.edu/JouleTrack/.
[33] A. Sinha and A. Chandrakasan, “Energy Aware Software,” in Proceedings of the
XIII International Conference on VLSI Design, Calcutta, Jan. 2000.
[34] ——, “JouleTrack - A Web Based Tool for Software Energy Profiling,” in Pro-
ceedings the 38th Design Automation Conference, Las Vegas, Jun. 2001.
[35] ——, “Operating System and Algorithmic Techniques for Energy Scalable Wire-
less Sensor Networks,” in Proceedings of the Second International Conference on
Mobile Data Management, Hong-Kong, Jan. 2001.
[36] ——, “Energy Efficient Real-Time Scheduling,” in Proceedings of the Interna-
tional Conference on Computer Aided Design (ICCAD), San Jose, Nov. 2001.
[37] U. Thoeni, Programming real-time multicomputers for signal processing.
Prentice-Hall, 1994.
[38] R. B. Lee, “Subword Parallelism with MAX-2,” in IEEE Micro, ser. 4, vol. 16,
Aug. 1996, pp. 51–59.
[39] “The Intel XScale Microarchitecture,”
http://developer.intel.com/design/intelxscale/.
[40] “eCos Users Guide,”
http://sources.redhat.com/ecos/docs-latest/pdf/user-guides.pdf.
118 Bibliography
[41] M. Mehendale, A. Sinha, and S. D. Sherlekar, “Low Power Realization of FIR
Filters Implemented Using Distributed Arithmetic,” in Proceedings of Asia South
Pacific Design Automation Conference, Yokohama, Japan, Feb. 1998.
[42] A. Sinha, A. Wang, and A. Chandrakasan, “Algorithmic Transforms for Efficient
Energy Scalable Computation,” in Proceedings of the 2000 IEEE International
Symposium on Low-Power Electronic Design (ISLPED 00), Italy, Aug. 2000.
[43] W. Fornaciari, P. Gubian, D. Sciuto, and C. Silvano, “Power Estimation of Em-
bedded Systems: A Hardware/Software Codesign Approach,” IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 6, pp. 266–275.
[44] N. Z. Azeemi, “A Framework for Architecture Based Energy-Aware Code Transfor-
mations in VLIW Processors,” in Proceeding of the IEEE International Symposium
on Telecommunications 2005, Shiraz, Iran, 2005, pp. 393 – 398.
[45] T. N. N. Ahmed and K. R. Rao, “Discrete Cosine Transform,” in IEEE Transactions
on Computers, vol. 23, Jan. 1974, pp. 90–93.
[46] W. Chen, C. H. Smith, and S. C. Fralick, “A Fast Computational Algorithm for the
Discrete Cosine Transform,” in Proceedings of IEEE Trans. on Communication,
vol. 25, Sep. 1997, pp. 1004–1009.
[47] L. McMillan and L. A. Westover, “A Forward-Mapping Realization of the Inverse
Discrete Cosine Transform,” in Proceedings of the Data Compression Conference
(DCC 92), Mar. 1992, pp. 219–228.
[48] A. Chandrakasan, S. Sheng, and R. W. Broderson, “Low-Power CMOS Design,”
in IEEE Journal of Solid State Circuits, Apr. 1992, pp. 472–484.
[49] A. Chandrakasan and R. Brodersen, “Low Power CMOS Design,” IEEE Press,
1998.
[50] P. J. M. Havinga and G. J. M. Smit, “Octopus embracing the energy efficiency of
handheld multimedia computers,” in Proceedings of MOBICOM, 1999, pp. 77–87.
[51] M. B. Srivastava, A. P. Chandrakasan, and R. W. Broderson, “Predictive System
Shutdown and Other Architectural Techniques for Energy Efficient Programmable
Computation,” in Proceedings of IEEE Transactions on VLSI Systems, ser. 1,
vol. 4, Mar. 1996, pp. 42–54.
[52] H. Zhang and J. Rabaey, “Low-Swing Interconnect Interface Circuits,” in Pro-
ceedings of the International Symposium on Low Power Electronics and Design
1998, 1998, pp. 161–166.
[53] W. Athas and et. al., “Low Power Digital Systems Based on Adiabatic Switching
Principles,” in IEEE Transactions on VLSI Systems, ser. 4, vol. 2, Dec. 1994.
Bibliography 119
[54] S. H. Chow, Y. Ho, and T. Hwang, “Low power realization of finite state ma-
chines a decompostion approach,” in ACM Transactions on Design Automation
of Electronic Systems, Jul. 1996, pp. 315–340.
[55] T. Burd and et. al., “A Dynamic Voltage Scaled Microprocessor System,” in
Proceedings of International Solid State Circuits Conference 2000, 2000, pp. 294–
295.
[56] E. C. K. Govil and H. Wasserman, “Comparing Algorithms for Dynamic Speed Set-
ting of a Low-Power CPU,” in Proceedings of the ACM International Conference
on Mobile Computing and Networking, 1995, pp. 13–25.
[57] K. Govilak, E. Chan, and H. Wasserman, “Comparing algorithms for dynamic
speed-setting of a low-power CPU,” in Proceedings of MOBICOM, 1995, pp.
13–25.
[58] R. Min, T. Furrer, and A. P. Chandrakasan, “Dynamic Voltage Scaling Techniques
for Distributed Microsensor Networks,” in Proceedings of the IEEE Computer
Society-Workshop on VLSI (WVLSI 00), Apr. 2000.
[59] “Intel StrongARM SA-1100 Microprocessor Developer’s Manual,”
http://developer.intel.com/design/strong/manuals/278088.htm.
[60] “AMD K6 PowerNOW,”
http://www.amd.com/products/cpg/mobile/powernow.html.
[61] “AMPS Operating System and Software,” http://gatekeeper.mit.edu.
[62] D. J. Kolson, A. Nicolau, and N. Dutt, “Optimal register assignment to loops
for embedded code generation,” in ACM Transactions on Design Automation of
Electronic Systems, vol. 1(2), Apr. 1996, pp. 251–279.
[63] “eCos Reference Manual,”
http://sources.redhat.com/ecos/docs-latest/pdf/ecos-ref.pdf.
[64] “The µITRON API,”
http://sources.redhat.com/ecos/docs-latest/ref/ecos-ref.a.html.
[65] “The EL/IX Homepage,” http://sources.redhat.com/elix/.
[66] “eCos Downloading and Installation,”
http://sources.redhat.com/ecos/getstart.html.
[67] M. O. Tokhi, Parallel Computing for Real-time Signal Processing and Control.
Springer, 2003.
120 Bibliography
[68] T. V. K. Gupta, R. E. Ko, and R. Baruna, “Compiler-directed Customization
of ASIP Cores,” in International Symposium on Hardware/Software Co-Design,
2002, pp. 97–102.
[69] W. Horn, “Some Simple Scheduling Algorithms,” Naval Research Logistics
Quaterly, vol. 21, 1974.
[70] K. Ramamritham and J. A. Stankovic, “Dynamic Task Scheduling in Distributed
Hard Real-Time Systems,” in Proceedings of IEEE Software, ser. 3, vol. 1, Jul.
1984.
[71] F. Yao, A. Demers, and S. Shenker, “A Scheduling Model for Reduced CPU
Energy,” in Proceedings of IEEE Annual Foundations of Computer Science, 1995,
pp. 374–382.
[72] T. Kondo, M. Inoue, and K. Nakai, “Application of autonomous decentralized
system to the steel production computer control,” in In 3rd International Workshop
on Future Trends of Distributed Computing Systems, 1992, pp. 419–423.
[73] G. Buttazzo, Hard Real-Time Computing Systems - Predictable Scheduling Algo-
rithms and Applications. Kluwer Academic Publishers, 1997.
[74] “ARM Software Development Toolkit Version 2.11 : User Guide,” Advanced RISC
Machines Ltd., May. 1997.
[75] S. H. Nawab and et. al., “Approximate Signal Processing,” in Journal of VLSI
Signal Processing Systems for Signal, Image, and Video Technology, ser. 1/2,
vol. 15, Jan. 1997, pp. 177–200.
[76] “Microsoft Windows CE,” http://www.microsoft.com/windows/embedded/ce/.
[77] “The Palm OS Platform,” http://www.palmos.com/platform/architecture.html.
[78] A. S. Tanenbaum, Modern Operating Systems. Prentice Hall, Feb. 2001.
[79] “VIS Speeds New Media Processing,” in IEEE Micro, ser. 4, vol. 16,
http://www.acq.osd.mil/ott/natibo/docs/BatryRpt-2.pdf, Aug. 1996, pp. 10–20.
[80] M. D. Jennings and T. M. Conte, “Subword Extensions for Video Processing on
Mobile Systems,” in IEEE Concurrency, July-Sept. 1998, pp. 13–16.
[81] “Intel Pentium III SIMD Extensions,”
http://developer.intel.com/vtune/cbts/simd.htm.
[82] “Solution Engine,”
http://semiconductor.hitachi.com/tools/solution-engine.html.
Bibliography 121
[83] M. P. D. Sheet, “Dynamically Adjustable, Synchronous Step-Down Controller for
Notebook CPUs,” http://pdfserv.maxim-ic.com/arpdf/MAX1717.pdf.
[84] A. C. W. Heinzelman and H. Balakrishnan, “Energy Efficient Routing Protocols for
Wireless Microsensor Networks,” in Proceedings of the 33rd Hawaii International
Conference on System Sciences (HICSS 00), Jan. 2000.
[85] Q. Qiu and M. Pedram, “Dynamic Power Management Based on Continuous-Time
Markov Decision Processes,” in Proceedings of the Design Automation Conference
(DAC 99), New Oreleans, 1999, pp. 555–561.
[86] G. Wei and M. Horowitz, “A Low Power Switching Power Supply for Self-Clocked
Systems,” in Proceedings of International Symposium on Low Power Electronics
and Design, 1996, pp. 313–318.
[87] C. L. Liu and J. W. Layland, “Scheduling Algorithms for Multiprogamming in
a Hard Real-Time Environment,” in Journal of ACM, ser. 1, vol. 20, 1973, pp.
46–61.
[88] M. Satyanarayanan and D. Narayanan, “Multi-fidelity algorithms for interactive
mobile applications,” in 3rd International Workshop on Discrete Algorithms and
Methods for Mobile Computing and Communications (DIAL M99) , 1999, pp.
1–6.
[89] A. Salkintzis, C. Chamzas, and C. Koukourlis, “An energy saving protocol for mo-
bile data networks,” in International Conference on Advances in Communications
and Control (COMCON 5), Jun. 1995, pp. 26–30.
[90] R. Kravets and P. Krishnan, “Power management techniques for mobile commu-
nication,” in Proceedings of MOBICOM (1998), 1998, pp. 157–168.
[91] I. Chlamtac, C. Petrioli, and J. Redi, “Energy-conserving access protocols for
identification networks,” in IEEE/ACM Transactions on Networking, vol. 7(1),
Feb. 1999, pp. 51–59.
[92] S. Singh, M.Woo, and C. Raghavendra, “Power-aware routing in mobile ad hoc
networks,” in Proceedings of MOBICOM (1998), pp. 181–190.
[93] Y. Bai and C. Lai, “A bitmap scaling and rotation design for SH1 low power
CPU,” in 2nd International Workshop on Modeling, Analysis and Simulation of
Wireless and Mobile Systems (1999), 1999, pp. 101–106.
[94] S. Codecs, “G.71x and G.72x,” http://www.compression-links.info/G.711-G.72x.
[95] “MPEG Pointers and Resources,” http://www.mpeg.org/.
122 Bibliography
[96] T. Xanthopoulos and A. Chandrakasan, “A Low-Power DCT Core Using Adaptive
Bit-width and Arithmetic Activity Exploiting Signal Correlations and Quantiza-
tions,” in Proceedings of the Symposium on VLSI Circuits, Jun. 1999.
[97] JHMI, “Genome Sequencing,” www.bis.med.jhmi.edu.
[98] “The NCBI Bacteria genomes database,”
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/.
[99] NCBI, “Genome Sequencing,” http://www.ncbi.nlm.nih.gov/.
[100] GDB, “Genome Sequencing,” www.gdb.org.
[101] SEQ, “Genome Sequencing,” www.sequenceanalyses.org.
[102] oupjournal, “Genome Databases,” www.nar.oupjournal.org.
[103] “AMD-3DNOW! Technology Manual-Instruction Set Architecture Specification,”
http://www.amd.com/K6/k6docs/.
[104] “TMS320C54x DSP Function Library,”
http://www.ti.com/sc/docs/products/dsp/c5000/.
[105] N. Z. Azeemi, “Power Aware Framework for Dense Matrix Operations in Multi-
media Processors,” in Proceeding of the IEEE International Multitopic Conference
2005, Karachi, Pakistan, Dec. 2005, pp. 157–168.
[106] T. Baeck, Evolutionary Algorithms in Theory and Practice. Oxford University
Press, 1996.
[107] N. Z. Azeemi, “A Multiobjective Evolutionary Approach for Constrained Joint
Source Code Optimization,” in Proceeding of the ISCA 19th International Confer-
ence on Computer Application in Industry, Las Vegas, USA, Nov. 2006, pp. 175
– 180.
[108] ——, “Handling Architecture-Application Dynamic Behavior in Set-top Box Ap-
plications,” in Proceeding of the IEEE International Conference on Information
and Automation 2006, Colombo, Sri Lanka, Dec. 2006, pp. 195 – 200.
[109] “Architecture-Aware Hierarchical Probabilistic Source Optimization,” in Proceed-
ing of the ISCA 19th International Conference on Parallel and Distributed Com-
puting Systems, San Francisco, USA, Sep. 2006.
[110] K. Hwang, Advanced Computer Architecture. McGraw-Hill, 2001.
[111] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, “Data Mining with
optimized two-dimensional association rules,” in ACM Transactions on Database
Systems (TODS), ser. 2, vol. 26, June 2001.
Bibliography 123
[112] A. A. Nanavati, K. P. Chitrapura, S. Joshi, and R. Krishnapuram, “Association
Rule Mining: Mining generalised disjunctive association rules,” in Proceedings of
the tenth international conference on Information and knowledge management
CIKM ’01, Oct. 2001.
[113] P.-N. Tan, “Selecting the Right Interestingness Measure for Association Patterns,”
in ACM SIGKDD 02, Alberta Canada.
[114] C. Brandolese, W. Foranciari, F. Salice, and D. Sciuto, “Source-Level Execution
Time Estimation of C Programs,” in International Symposium on Hardware/Soft-
ware Co-Design, 2001, pp. 98–104.
[115] P. Puschner and C. Koza, “Calculating the maximum execution time of real-time
programs,” Journal of Real-Time Systems, vol. 1, no. 2, pp. 159–176, September
1989.
[116] “Software Testing,” http://hissa.ncsl.nist.gov/swassurance/strtest.html.
[117] “The GNU Project,” http://www.gnu.org/.
[118] “Intel StrongARM SA-1110 Linecard,”
http://developer.intel.com/design/strong/linecard/sa-1110.