compiler support for long-life, low-overhead intermittent
Post on 31-Dec-2021
3 Views
Preview:
TRANSCRIPT
Compiler Support for Long-life, Low-overhead IntermittentComputation on Energy Harvesting Flash-based Devices
Saim Ahmad
Thesis submitted to the Faculty of the
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Master of Science
in
Computer Science and Application
Matthew Hicks, Chair
Changwoo Min
Ali Butt
May 7, 2021
Blacksburg, Virginia
Keywords: Compiler Optimizations, Intermittent Computation, Energy Harvesting, Flash
Storage
Copyright 2021, Saim Ahmad
Compiler Support for Long-life, Low-overhead Intermittent Compu-tation on Energy Harvesting Flash-based Devices
Saim Ahmad
(ABSTRACT)
With the advent of energy harvesters, supporting fast and efficient computation on energy
harvesting devices has become a key challenge in the field of energy harvesting on ubiquitous
devices. Computation on energy harvesting devices is equivalent to spreading the execution
time of a lasting application over short, frequent cycles of power. However, we must ensure
that results obtained from intermittently executing an application do produce results that
are congruent to those produced by executing the application on a device with a continuous
source of power. The current state-of-the-art systems that enable intermittent computation
on energy harvesters make use of novel compiler analysis techniques as well as on-board
hardware on devices to measure the energy remaining for useful computation. However,
currently available programming models, which mostly target devices with FRAM as the
NVM, would cause failure on devices that employ the Flash as primary NVM, thereby
resulting in a non-universal solution that is restricted by the choice of NVM. This is primarily
the result of the Flash’s limited read/write endurance.
This research aims to contribute to the world of energy harvesting devices by providing
solutions that would enable intermittent computation regardless of the choice of NVM on
a device by utilizing only the SRAM to save state and perform computation. Utilizing the
SRAM further reduces run-time overhead as SRAM reads/writes are less costlier than NVM
reads/writes. Our proposed solutions rely on programmer-guidance and compiler analysis
to correct and efficient intermittent computation. We then extend our system to provide a
complete compiler-based solution without programmer intervention. Our system is able to
run applications that would otherwise render any device with Flash as NVM useless in a
matter of hours.
Compiler Support for Long-life, Low-overhead Intermittent Compu-tation on Energy Harvesting Flash-based Devices
Saim Ahmad
(GENERAL AUDIENCE ABSTRACT)
As batteries continue to take up space and make small-scale sensors hefty, battery-less devices
have grown increasingly popular for non-resource intensive computations. From tracking air
pressure in vehicle tires to monitoring room temperature, battery-less devices have countless
applications in various walks of life. These devices function by periodically harvesting energy
from the environment and its surroundings to power short bursts of computation. When
device energy levels reach a lower-bound threshold these devices must power off to scavenge
useful energy from the environment to further perform short bursts of computation. Usually,
energy harvesting devices draw power from solar, thermal or RF energy. This vastly depends
on the build of the device, also known as a microprocessor (a processing unit built to perform
small-scale computations). Due to these devices constantly powering on and off, performing
continuous computation on such devices is rather more difficult when compared to systems
with a continuous source of power.
Since applications can require more time to complete than one power cycle of such devices,
by default, applications running on these devices will restart execution from the beginning at
the start of every power cycle. Therefore, it is necessary for such devices to have mechanisms
to remember where the were before the device lost power. The past decade has seen many
solutions proposed to aid an application in restarting execution rather than recomputing
everything from the beginning. Solutions utilize different categories of devices with different
storage technologies as well different software and hardware utilities available to programmers
in this domain. In this research, we propose two different low-overhead, long-life computation
models to support intermittent computation on a subset of energy harvesting devices which
use Flash-based memory to store persistent data. Our approaches are heavily dependent
on programmer guidance and different program analysis techniques to sustain computation
across power cycles.
Acknowledgments
A special thanks to my research advisor, Matthew Hicks, and my research associate, Harrison
Williams, in helping me prepare my first research publication which forms a part of this
dissertation.
vi
Contents
List of Figures x
List of Tables xii
1 Introduction 1
2 A Difference World: High-performance, NVM-invariant, Software-only
Intermittent Computation 4
2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Why Intermittent Computation on Flash Devices? . . . . . . . . . . 10
2.3.2 Existing Programmer-guided Systems Kill Flash . . . . . . . . . . . . 12
2.3.3 SRAM’s Time-dependent Non-volatility . . . . . . . . . . . . . . . . 13
2.3.4 Intermittent Off Times are Short . . . . . . . . . . . . . . . . . . . . 15
2.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Detecting Unexpectedly Long Off Times . . . . . . . . . . . . . . . . 18
2.4.3 Bimodal Recovery Routine . . . . . . . . . . . . . . . . . . . . . . . . 20
vii
2.4.4 CAMEL Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.5 CAMEL Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Compiler Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Compiler Modifications . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.3 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.4 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.2 Time to death—Flash Failure . . . . . . . . . . . . . . . . . . . . . . 32
2.6.3 Run-time Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.4 Binary size overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.5 Automatic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 SABLE: A Compiler-only Alternative to CAMEL 42
3.1 Programmer-invention vs Compiler-analysis — tradeoff . . . . . . . . . . . . 43
3.2 SABLE Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1 Naive Canary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Idempotent Canary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.3 Naive CRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.4 Batch CRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 SABLE Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 SABLE Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Conclusion 52
Bibliography 53
List of Figures
2.1 Flash/device lifetime for existing programmer-guided intermittent computa-
tion approaches. Incessant checkpointing to Flash quickly renders the device
unusable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 The maximum time SRAM before a single bit fails in SRAM for across tem-
perature changes for three capacitor sizes. The horizontal bars represent the
maximum off-times from our meta-analysis of off-times reported in the energy
harvesting literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Interaction amongst components within CAMEL. . . . . . . . . . . . . . . . . 16
2.4 (a) Shows unmodified source code (b) shows the task divided code according
to the conventions described §2.4.4 (c) shows the execution of the code in (b)
after it has been instrumented by the compiler. . . . . . . . . . . . . . . . . 21
2.5 (1) Shows the start of a task (2) Shows a power failure midway execution of
a task (3) shows undo-logging before any tasks begins execution. The state
of the non-volatile and volatile buffers is shown after each of the three steps. 23
2.6 Shows how tasks use data in the differential buffers. The only non-idempotent
variable is result since it undergoes write-after-read. It is first read in line 11
and also written to. This sequence of instructions in assembly would result
in a write-after-read violation. . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Shows the pipeline for the generation of a CAMEL-certified executable. . . . . 26
x
2.8 CAMEL run-time overhead for Flash-based devices The global buffer size for
each benchmark is stated in parentheses on the x-axis. . . . . . . . . . . . . 34
2.9 CAMEL run-time overhead for FRAM-based devices. The global buffer size
for each benchmark is stated in parentheses on the x-axis. . . . . . . . . . . 34
2.10 CAMEL binary size increase compared to current state-of-the-art. . . . . . . . 38
3.1 Cumulative Density Curve for SABLE batch CRC which helps in determining
the ideal number of stores to batch . . . . . . . . . . . . . . . . . . . . . . . 48
List of Tables
2.1 Deployment lifetime (in hours) for existing programmer-guided systems on
Flash-based devices with expected time until a silent data corruption for
several CAMEL configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2 Checkpoints each system makes per benchmark. We can see that each system
makes a comparable number of benchmarks hence the difference in run-time
and binary-size overheads cannot be a result of different number of checkpoints. 37
3.1 Summary of checkpoints placed in Naive and Idempotent variants of the ca-
nary version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Summary of checkpoints placed in Naive and Batch variants of the CRC version. 49
3.3 Relative run-time for different SABLE implementations . . . . . . . . . . . . . . . . . 51
3.4 Relative binary size for different SABLE implementations . . . . . . . . . . . . . . . . 51
xii
List of Abbreviations
AR: Activity Recognition
BC: Bit Count
CEM: Cold-Chain Equipment Monitoring
CF: Cuckoo Filter
CRC: Cyclic Redundancy Check
FRAM: Ferroelectric Random Access Memory
IDEM: Idempotent
IR: Intermediate Representation
LLVM: Low-level Virtual Machine
MCU: Micro-controller Unit
NVM: Non-volatile memory
SRAM: Static Random Access Memory
TDNV: Time-dependent non-volatility
xiii
Chapter 1
Introduction
No matter what electronic device we use it is always accompanied by a battery to power
it, taking at least half the amount of the total space on the device if not more. Energy
harvesting [33, 39, 48] is an essential first step for us to dump batteries permanently, thereby
utilizing the remaining space for more essential components. To attain complete ubiquity
[42], we require that future devices take advantage of the capabilities of energy harvesters
which harvest energy from the surroundings in the form of solar, thermal and RF energy.
However, dumping batteries and moving to a system with scavenges energy from the envi-
ronment proves to be a challenge of its own. Energy harvesters require frequent power offs
to salvage energy from the environment to enable them to power on again. This means that
applications executing on such a device may fail to execute completely given that the time
required for them to finish execution is greater than the average power on time of an energy
harvesting device. Such challenges have caused the advent of intermittent computation, a
research area which works to ensure software can be intermittently executed in a correct and
efficient way, always resulting in completion of the running application.
The last decade saw an influx in programming models for intermittent computation on
energy harvesting devices. Probably the first system to revolutionize the field of energy har-
vesting was Mementos [37]. Mementos [37] utilized compile-time analysis techniques along
with hardware to measure available voltage at run-time to successfully execute programs
intermittently. This opened the gateway for multiple ideas that could bed utilized to build
1
intermittent computation models. Since the dawn of Mementos [37], we have seen different
classes of models arise. Some techniques utilize software-only and compiler-based techniques
to enable applications to successfully execute on intermittent power [46]. Recent systems
have seen the incorporation of programmer-guidance along with compiler analysis to make
software techniques easy to implement [4, 27, 30]. Programmer-guidance involves the pro-
grammer rewriting source code, usually by dividing it into a set of functions that comply with
the primitives defined by the underlying system the application is being ported to. These
schemes are termed continuous checkpointing models coupled with programmer-guidance.
Another class of systems only use hardware to measure the amount of energy available and
run-time and checkpoint accordingly before the device powers off [2]. These techniques are
just-in-time approaches, as they checkpoint just before losing power. All of the aforemen-
tioned systems utilized either the Flash or FRAM as non-volatile memory to checkpoint
and store state. However, a new class of the state-of-the-art has emerged that is NVM
invariant [44]. TotalRecall is a just-in-time programming model that uses the SRAM to
checkpoint data thus reducing the checkpointing overhead to a fraction of what other sys-
tems could achieve. This is made possible by the observation that the SRAM can retain
data for short periods of time after a device loses power. This discovery was made by [44]
and is elaborated further in §2.3.3.
This research will introduce programming models that are built using the observation that
the SRAM can retain state successfully for minutes after an energy harvesting device powers
off. These programming models are continuous-checkpointing, compiler-based, software-only
and NVM invariant. They can work on either the Flash or FRAM based systems without
requiring any modifications to the application source or programming model. Furthermore,
they do not require on-board hardware to continuously monitor available energy, thus making
these systems more adaptable to devices that do not provide energy measuring hardware.
2
§2 presents a manuscript of ours that builds a system by coupling continuous checkpointing
with programmer-direction.§3 Introduces an alternative to the system in §2 that requires no
programmer-direction and is a compiler-only based approach. However, this system is still
in it’s development and testing phase.
3
Chapter 2
A Difference World:
High-performance, NVM-invariant,
Software-only Intermittent
Computation
In this chapter we present our research paper, edited and modified to meet the requirements
of this dissertation. This paper was submitted to SOSP’21 on May 6th, 2021.
2.1 Abstract
Supporting long life, high performance, intermittent computation is an essential challenge
in allowing energy harvesting devices to fulfill the vision of smart dust. Intermittent compu-
tation is the extension of long-running computation across the frequent, unexpected, power
cycles that result from replacing batteries with harvested energy. The most promising ap-
proaches combine programmer direction and compiler analysis to minimize run-time over-
head and provide programmer control—without specialized hardware support. While such
strategies succeed in reducing the size of non-volatile memory writes due to checkpoint-
ing, they must checkpoint continuously. Unfortunately, for Flash-based devices, writing
4
checkpoints is slow and gradually kills the device. Without intervention, Flash devices
and software-only intermittent computation are fundamentally incompatible. To enable
programmer-guided intermittent computation on Flash devices, we design and implement
CAMEL. The key idea behind CAMEL is the systematic bifurcation of program state into two
“worlds” of differing volatility. Programmers compose intermittent programs by stitching
together atomic units of computation called tasks. The CAMEL compiler ensures that all
within-task data is placed in the volatile world. CAMEL places all data that is communi-
cated between tasks in the non-volatile world. Between tasks, CAMEL swaps the worlds,
atomically locking-in the forward progress of the preceding task. In preparation for the next
task, CAMEL resolves differences in world view by copying only differences due to the pre-
ceding task’s updates. This systematic decomposition into a mixed-volatility memory allows
programmer-guided intermittent computation on Flash devices while improving performance
for all NVM types. CAMEL extends correct operation from minutes to 1000’s of years, in-
creases performance up to 42x on Flash devices, and improves performance on FRAM devices
by 50%.
2.2 Introduction
Energy harvesting [33, 39, 48] is the key to realizing the vision of ubiquitous computing [23,
42]. Aggressive transistor scaling brings us to an inflection point: computing devices are
smaller than a grain of rice and operate on nano-watts of power [47], but the batteries
required to power them remain largely unchanged, leaving them large, heavy, expensive, and
sometimes flammable [3]. This asymmetric scaling means attaining ubiquity demands that
current and future devices shed batteries in favor of harvested energy.
The transition to harvested energy brings a new challenge: how can we support long-running
5
programs in the face of the frequent, unpredictable, power cycles brought on by the rela-
tive trickle of energy supplied by energy harvesting? Existing programs and programmers
alike assume a continuous supply of energy, while energy harvesters provide only enough
energy for short bursts of computation. Attempting to execute unmodified programs on
such short bursts of power dooms long-running programs to a never-ending series of restarts.
A naive application of existing checkpointing schemes [37] is inadequate as previous work
shows that, without careful attention to the memory durability ramifications of power cycles,
semantically incorrect executions occur [36]. Thus intermittent computation is born.
Intermittent computation approaches fit into one of two high-level classes:
• Just-in-time checkpointing: special-purpose hardware monitors available power,
committing a checkpoint and ceasing computation when power dips below a pre-defined
voltage threshold [1, 2, 22, 37, 44].
• Continuous checkpointing: a program is decomposed (by a programmer [4, 27, 30],
compiler [31, 46], or hardware [10, 28, 29, 41]) into a series of inherently-restartable sub-
computations, glued together by checkpoints. This results in power-failure-agnostic
program execution.
Programmer-guided continuous checkpointing intermittent computation systems are favored
when programmers require guarantees about execution [4, 27, 30]. In many real-world de-
ployments, some operations cannot be interrupted and then resumed arbitrarily following a
power cycle. Consider interacting with an external radio or a sensor; these devices cannot
handle power loss in the mid-transaction [27]. This forces the programmer to anticipate
the effects of a power loss at every point in the code and write routines to mitigate the
consequences. This is unscalable and error-prone.
Programmer-guided approaches relieve programmers of this burden through a combination
6
of compiler analysis and a C programming interface that exposes forward progress atomicity
as a first-class programming abstraction. Compiler analysis is comprehensive and error-free,
while the programming interface allows programmers to reason about the system-level effects
of computation—at a granularity that they are comfortable with or that mirrors the device’s
interface/protocol.
Programmer-guided approaches divide programs into a series of checkpoint-connected tasks.
Tasks represent atomic, restartable, units of computation, i.e., they either complete entirely
or not at all. Regardless of any power cycles, the result of task execution is semantically
consistent with the code. The fundamental principle is that tasks keep their changes private
until they complete. Early work conservatively versions all within-task data [27], while
follow-on work introduces novel classes of cross-task data-communication channels [4]. The
most recent approach further reduces overhead by using idempotence analysis to minimize
data copying [30].
This paper addresses two limitations in state-of-the-art programmer-guided intermittent
computation approaches:
• Flash device lifetime: the frequent non-volatile memory writes of continuous check-
pointing strategies—no matter the size—exhaust Flash’s limited write endurance [17].
• Performance: current approaches copy redundant data to within-task buffers due to
not reusing existing data.
Fixing the first flaw makes programmer-guided intermittent computation possible
and performant on Flash-based devces. Fixing the second flaw increases perfor-
mance across both FRAM and Flash-based devices.
We design and implement CAMEL, an extension to C and compiler support that enables
7
long-life, low-overhead intermittent computation on Flash-based systems—without hardware
support—as well as increasing performance on both FRAM and Flash devices. Our solu-
tion leverages the idea of two worlds from ARM TrustZone [34], but replaces security with
data non-volatility. Two worlds exist within a CAMEL-instrumented program: a non-volatile
world across tasks (and for recovery) and a volatile world within a task. To implement this
selective mixed-volatility world abstraction on top of SRAM, we leverage recent work on
time-dependent non-volatility [44]. Instead of creating a wholly-non-volatile SRAM, which
§2.6 shows is not a viable solution, CAMEL reserves non-volatility for the non-volatile world
alone. Treating within-task data as volatile makes CAMEL performant on Flash-based de-
vices. Fine-grain idempotence analysis coupled with differential state analysis allows CAMEL
to efficiently update and transition between worlds. We validate CAMEL’s ability to ex-
tend long-running computation across frequent power cycles using Flash- and FRAM-based
MSP430 microcontrollers and a superset of benchmarks from previous work. Experiments
show that CAMEL provides practically unbounded deployment lifetime, while reducing av-
erage run-time overhead by 7x–42x over previous systems running on Flash-based devices.
Compared to a naïve software-only variant of a recent Flash-based intermittent computation
system, CAMEL improves performance by up to 455%. Even on FRAM devices, CAMEL’s
advanced compiler analyses cut run-time overhead in half compared to the state-of-the-art.
This paper makes the following contributions:
1. We expose that existing continuous checkpointing approaches have poor performance
and eventually kill Flash-based systems due to checkpoint-induced Flash memory
writes/erases (§2.3.1).
2. We propose the notion of the controlled-volatility worlds to enable high-performance
programmer-guided intermittent computation on Flash devices (§2.4.4).
8
3. We present a NVM-invariant performance improvement: reusable differential buffers
(§2.4.5).
4. We expose to the designer and quantify the trade space between pre-deployment effort
and run-time overhead (i.e., canary vs. CRC) (§2.4.2).
5. We evaluate CAMEL against state-of-the-art programmer-guided intermittent compu-
tation systems; results show that CAMEL outperforms previous approaches in both
lifetime and performance, regardless of non-volatile memory type (§2.6).
2.3 Motivation
There exists a succession of programmer-guided intermittent computation systems, each
refining the interface exposed to programmers and reducing run-time overhead. Why is
another approach needed? This section answers this question through analysis that shows
that by ignoring Flash-based energy-harvesting platforms, we exclude the most
ubiquitous, most available, lowest cost, and highest performance systems from
the benefits of intermittent computation. Experiments with existing approaches show
that due to the performance and lifetime consequences of Flash writes/erases and the high-
frequency checkpoints endemic to continuous checkpointing, a new approach is required.
Lastly, we show that achieving suitable performance is more challenging than a direct
extension of previous work targeting Flash devices. This analysis motivates a new,
in-place checkpointing, approach to programmer-guided intermittent computation: CAMEL.
9
Figure 2.1: Flash/device lifetime for existing programmer-guided intermittent computationapproaches. Incessant checkpointing to Flash quickly renders the device unusable.
2.3.1 Why Intermittent Computation on Flash Devices?
Frequent checkpointing allows programmer-guided approaches to remove the requirement
of special-purpose voltage monitoring hardware. Frequent checkpointing also means that
the performance of existing programmer-guided systems depends on the performance of
the non-volatile memory (NVM) that it commits checkpoints to. For several decades, the
only mass-market option for NVM in energy-harvesting-class devices was Flash memory. In
the last five years, a new NVM emerged: Ferroelectric Random-Access Memory (FRAM).
Following this trend, early energy-harvesting platforms used Flash-based microcontrollers
(e.g., WISP 4 [39] and Moo [48]), while the more recent energy-harvesting platforms use the
more esoteric FRAM-based microcontrollers (e.g., WISP 5 [33]). According to the WISP 5
developers, the impetus for the transition to FRAM-based devices is the lower cost of writes
compared to Flash.
10
While write latency is one metric to compare NVM technologies, other metrics become
important in a world where NVM writes/erases are no longer the limiting factor. Flash-
based devices provide several advantages over similar FRAM-based devices: flash devices
are more available and have a larger pool of developers and suppliers, since they have been
around for decades, compared to less than a decade for FRAM devices. This trend is
unlikely to change soon as Flash-based devices are more available today. Flash also provides
a performance advantage, as shown by comparing Drhystone results [44]. Even with the
same processor core, operating at the same clock frequency, FRAM requires memory access
wait-states when the clock surpasses 8 MHz, while Flash operates wait-state-free up to 25
MHz [15, 16]. Sub-linear power-frequency scaling in low-power microcontrollers enables more
energy-efficient operation at high clock speeds: for example, the MSP430F5529 consumes
360 µA/MHz at 1 MHz and 333 µA/MHz at 12 MHz [19]. Recent work highlights other
advantages of switching to a high-energy, high-efficiency operating point [5]. FRAM devices
require additional memory-access wait states as clock speeds increase, eliminating much of
the advantage of faster operation. Finally, Flash devices tend to contain more SRAM at
equivalent NVM sizes [18, 20].
Given the advantages of Flash, why do the most recent energy harvesting platform [33] and
programmer-guided intermittent computation systems [4, 27, 30] target FRAM? Despite the
availability and performance advantages of Flash, it is slow, high-energy writes/erases are
antithetical to the high-frequency checkpoints of continuous checkpointing systems. Pro-
gramming Flash (i.e., writing) is energy and time-intense as it requires collecting enough
charge to raise the voltage of a Flash cell high enough to force charge to flow across the
cell’s dielectric (e.g., from 2.2V up to 12V). Worse, this process is uni-directional. Thus,
to change any single bit of Flash requires copying a segment (512 B) to SRAM, erasing
the entire segment, updating the desired bits in SRAM, and writing the updated segment
11
back to Flash. The common-case nature of checkpointing in programmer-guided systems,
the cost of writing/erasing Flash memory, and Flash’s untimely failure eclipse any bene-
fit Flash offers. Without an alternative to checkpointing to Flash, the vast majority and
most performant microcontrollers will not support software-only intermittent computation.
The goal of this paper is to provide the most performant programmer-guided
checkpointing approach that works across popular NVM technologies.
2.3.2 Existing Programmer-guided Systems Kill Flash
Flash cells can endure only a limited number of write/erase cycles before they fail [17].
To understand how this impacts existing programmer-guided intermittent computation ap-
proaches, we evaluate how long each system takes to render the Flash—and therefore the
system—unusable. We use the benchmark set from §2.6, which is a superset of bench-
marks from previous work. We start by adapting each system for the limitations of Flash’s
write/erase granularity. As a consequence, for systems like Alpaca that use idempotence to
reduce NVM writes, the entire buffer must be updated in Flash environments, regardless.
We apply two types of wear leveling to maximize lifetime: (1) we pack as many whole-
checkpoints as will fit in a Flash segment. For example, the AR buffer is 164 bytes in DINO;
this allows for three buffers per Flash segment. (2) we employ optimistic wear leveling.
Figure 2.1 shows the time taken to exhaust the average-case Flash endurance if each ap-
plication runs continuously on the MSP430G2553 microcontroller. We determine Flash’s
lifetime by calculating how long it takes for Flash to reach its maximum write endurance
(100,000 for the MSP430G2553 [17]). For this we use each benchmark’s checkpoint size and
rate, Flash segment size, and the number of free Flash segments for each benchmark. We as-
sume constant operation and perfect wear-leveling—with no added cost or complexity. Most
12
configurations last for just hours, with the best-case lifetime is less than 40 hours. Thus,
existing approaches kill Flash devices quickly.
2.3.3 SRAM’s Time-dependent Non-volatility
Many existing intermittent computation works discuss SRAM as if it loses its state com-
pletely as soon as the microcontroller stops computing. We observe that, due to capacitance
in the system, the voltage of a system’s power rail gradually reduces from the microcon-
troller’s brown-out voltage to 0V. Due to the difference in the microcontroller’s brown-out
voltage (e.g., 1.6V) and SRAM’s data retention voltage (�0.4 V [13, 35]), SRAM scavenges
the otherwise wasted charge to retain state. We refer to this as SRAM’s time-dependent
non-volatility: for a period after computation ceases, SRAM acts as a non-volatile memory.
This presents an opportunity to leverage SRAM’s time-dependent non-volatility to serve as a
non-volatile checkpoint storage location—as long as SRAM retains data perfectly for longer
than the off time.
To verify this opportunity, we first quantify how long SRAM provides perfect data retention
for. For this experiment, we use a Flash-based MSP430 development board [21] that is
representative of energy harvesting-class devices. The literature indicates that two factors
dominate the discharge time of a capacitor: capacitor size and temperature. To explore
the impact of these variables on SRAM’s data retention time, we modify the development
board, replacing its 10µF decoupling capacitor with 47µF, 100µF, and 330µF versions. We
select the 47µF capacitor as it represents what the most popular energy harvesting devices
use [33, 39]. We use larger capacitor sizes to show how system designers can tune the
retention time through the capacitor. To control temperature, we perform the experiments
in a Test Equity 123H thermal chamber, varying temperature between 20℃ and 50℃.
13
20Office
25 30 35 40 45 50 55Death Valley
Temperature ( C)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
100%
SR
AM d
ata
rete
ntio
n tim
e (s
econ
ds)
Thermal: 14s RF: 10s Piezoelectric: 2sSolar: 300s
47 F100 F330 F
Figure 2.2: The maximum time SRAM before a single bit fails in SRAM for across tempera-ture changes for three capacitor sizes. The horizontal bars represent the maximum off-timesfrom our meta-analysis of off-times reported in the energy harvesting literature.
Because SRAM fails bi-directionally, in a board- and noise-dependent pattern [11], for each
temperature/capacitor combination, we perform five trials where we write all 1’s and five
where we write all 0’s, checking for data loss at each trial. Figure 2.2 shows the retention time
of our MSP430 microcontroller across a range of temperatures and the three energy storage
capacitor sizes. These results show that—even without system designer awareness of SRAM’s
time-dependent non-volatility—current energy harvesting platforms provide relatively long
data retention times.
14
2.3.4 Intermittent Off Times are Short
Given that SRAM provides perfect data retention for between 50 seconds and almost 4 hours,
the next question to answer is how long unexpected off-times1 are for the most common
energy sources. To answer this question, we perform a meta-analysis of the energy harvesting
literature. The goal is to identify common energy sources and, for each source, a realistic
upper bound for off times. This task is complicated by the fact that previous work focuses
on on-times due to its reliance on the long-term data retention guarantees of non-volatile
memories. Fortunately, by looking at on-times and the frequency of power-on events, we
are able to deduce approximate off times. We add the off-times for four energy sources
as horizontal lines in Figure 2.2: RF [8, 28, 37], Thermal [28], Piezoelectric [28, 40], and
Solar [8, 28]. When a given capacitor’s line is above the horizontal line, the capacitor
provides enough perfect data retention time to support operation at the temperature and
below. To summarize the results of our meta-analysis: off-times for most sources are
much shorter than the data retention time provided by existing energy harvesting
platforms. In this paper, we design and implement a system that reliably uses SRAM
as a low overhead, long lifetime, non-volatile memory for the short off times
common to intermittent computing, falling back to existing checkpointing to support
longer and expected power-off events.
15
ProgrammerDecomposition
CompilerAnalysis
Program Execution Recovery
InstrumentedExecutable
Uninstrumentedcode
Failure
CRC/Canary Checkpoint
CRC/CanaryVerificationSRAMFLASH
Commit
Task dividedcode
Write
Read
Restore/Restart
Completion
Figure 2.3: Interaction amongst components within CAMEL.
2.4 Design
We develop CAMEL, a programmer-guided, continuous checkpointing system with the goal
of enabling long-life, high-performance intermittent computation on Flash-based devices.
CAMEL avoids continuously writing program state to non-volatile memory by preserving
in-place SRAM data using differential reusable buffers. Our differential buffer model al-
lows checkpointing just-enough data required to restart program state as opposed the entire
SRAM as implemented previously [44]. We maintain semantically correct execution by ensur-
ing the in-place data remains consistent across power cycles and that tasks always re-execute
with known-good data. CAMEL is an amalgamation of three components, working together
to guarantee the correct execution of a program on harvested energy. These components
are: (1) CAMEL Recovery Routine; (2) CAMEL Tasks; and (3) CAMEL Compiler.
1We differentiate between expected and unexpected off times. The challenge for intermittent computationis dealing with unexpected power-cycles and their off times; thus that is our focus. In contrast, solar-poweredsystems experience long off times at night, but this is (predictable) power loss akin to turning-off yourcomputer—not intermittent computation.
16
2.4.1 System Overview
Figure 2.3 gives a high level overview of how different components interact to make CAMEL
function. The CAMEL programming model allows the programmer to ensure forward progress
of applications on any energy harvesting platform by decomposing source code into a set of
individually re-executable tasks. Tasks manipulate data in the differential buffer to perform
useful computation. The CAMEL compiler analyzes how tasks interact with shared data
in the differential buffers to ensure in-place SRAM data is consistent at run time, despite
re-executions after power failures. CAMEL performs idempotence analysis and produces a
ready-to-run executable that can be flashed to a board of the programmer’s choosing. CAMEL
implements the volatile and non-volatile world concept using two differential, swappable
buffers—at any given point in execution, a volatile world and a non-volatile world exist.
Tasks interact with data exclusively in the volatile world, whereas recovery pulls
data exclusively from the non-volatile world. Between tasks, CAMEL atomically swaps
which buffer represents each world—locking-in forward progress by rendering the updated
buffer effectively non-volatile (§2.4.2). After the swap and before the next task, CAMEL
resolves the differences between the up-to-date newly-non-volatile world and the outdated
newly-volatile world by copying only the data that the preceding task modified. Following a
power failure, CAMEL invokes the recovery routine which continues execution either (1) from
the most recent checkpoint in the common case when SRAM retains its data, or (2) from the
beginning of the program, when an uncommonly long power failure causes SRAM to lose its
data.
17
2.4.2 Detecting Unexpectedly Long Off Times
SRAM transitions from non-volatility to semi-volatility, gradually approaching full-volatility
as supply voltage falls. For any stage beyond non-volatility, the SRAM cells begin losing
state, jeopardizing recovery. We employ two methods to detect unexpectedly long off times.
Canary Values: During a power failure, the SRAM cells that fail first and the direction
of failure are decided by manufacturing-time process variation; hence each device exhibits
unique yet temporally consistent failure patterns [11, 12]. We leverage this predictable failure
pattern for a low-overhead check of SRAM data retention by writing a pre-determined value
to the canary memory and checking for it after a power failure. If the first-to-fail SRAM
cells retain their canary values, then we know all of the SRAM data is intact and can
restart from a checkpoint; otherwise, data may be corrupted and we must restart execution
from the beginning. SRAM canary values require chip characterization for two purposes:
1) identifying the cells that fail first and 2) identifying the value those cells fail to, which
prevents silent failures from the cell failing into the chosen canary value.
Pre-deployment characterization works for three reasons: (1) given a device, SRAM cells fail
to retain state in a mostly total ordered fashion—especially the tail cells [12]; (2) SRAM cell
failure ordering is preserved across temperature and voltage fluctuations [9]; and (3) The
weakest cells have a reliable power-on state [45]. Thus, to determine, for a given device, the
location and values of the canary cells, the user performs a binary search of off-times, looking
for the first cells to fail at the shortest off-time. The cells are set to all 1’s, then the power is
disconnected, then reconnected after the desired off time, then the SRAM state is read-back,
looking for failed cells. The user does the same for all 0’s. If a cell fails for either case, then
it is marked as a failure for that time. For our experiments, we perform 3 such (Bernoulli)
trials at each time step to eliminate noise. We then store the address(es) of the first-to-fail
18
SRAM cell(s) in the non-volatile memory and the value that exposes their failure. The
checkpoint routine writes the value to the address(es), while the recovery routine validates it
before resuming execution. Programmers can always use multiple canary cells for increased
resilience. Canary values guarantee data integrity for the cost of a few memory comparisons,
but pre-characterization of each chip may be impractical for some deployments.
Algorithm 1 Software CRC routine.1: MASK ← 0xFF002: CRC ← 0x03: P_TABLE ← &CRC_TABLE4: P_BUF ← SAFE5: while P_BUF ̸= SAFE_END do6: INDEX ← CRC[7:0]7: INDEX ← *P_BUF xor INDEX8: P_BUF ← P_BUF + 19: INDEX ← INDEX + INDEX
10: INDEX ← INDEX + P_TABLE11: CRC ← CRC and MASK12: CRC ← *INDEX xor CRC13: end while14: return CRC
Cyclic Redundancy Checks (CRC): To verify data integrity without chip characteri-
zation, we provide a second implementation based on Cyclic Redundancy Checks. The basic
algorithm is shown in Algorithm 1. CRCs are common mechanisms for communication sys-
tems and applications that need to verify the integrity of received data with high confidence.
Hardware support for CRCs is also common to low-powered micro-controllers that send and
receive data; both of our evaluation devices include hardware CRC engines; software CRC
implementations also exist.
The CRC algorithm divides the data by a predetermined generator polynomial using repeated
shifts and XOR operations. The output of the CRC algorithm is the remainder of this
division, which is stored alongside the data in volatile memory. The state of the application
data in the SRAM changes after every task, hence we recompute the CRC between tasks.
19
To verify the integrity of the data after recovery, we recalculate the CRC over the trusted
in-place application data and compare the resulting remainder to the one previously stored
in memory. The two remainders must match to conclude that data remains integral.
CRCs guarantee up to G bits of error detection, where G is determined by the variant of
CRC used; a 16-bit CRC detects up to 3 flipped bits whereas a 32-bit CRC detects up to
5 flipped bits. Both CRC variants additionally detect all odd-bit errors. For other errors,
CRCs provide probabilistic error detection with a chance of missing an error of 1/2m, where
m is the width of the CRC [24]. For a multi-bit error to go undetected, the checksum of
the corrupted and un-corrupted must be the same. The probability of undetected data
corruption is further reduced because there is a 50% chance that a failing cell will fail into
the value it is already holding; thus, CRCs provide a high-confidence general solution to
verify SRAM’s data integrity.
2.4.3 Bimodal Recovery Routine
As shown in Figure 2.3, following a power failure during program execution, the recovery
routine passes control to the program from either the start of either the main function or
the last executed task. We arrive at this decision by re-computing the CRC or checking the
canary value, depending on the variant of CAMEL deployed.
To resume execution from the last in-place checkpoint (1) the recomputed CRC should
match the one stored in the SRAM or (2) the canary value must be correct, indicating that
the data in the non-volatile world was preserved over the power cycle. After passing the
integrity check, CAMEL copies the data in the non-volatile world over the volatile world,
ensuring that the first task begins with correct data. CAMEL restores control to the program
at the beginning of the last partially-completed task by copying the saved register values back
20
Instrumented Code
struct { int x; int y; int z;}global;
task_sample() { int i = getReading(); GV(x) = i;}
task_transform() { int offset = GV(y) + GV(z); GV(x) = GV(x) + offset;}
main() { while (1) { task_sample(); task_transform(); }}
main() task_sample() int i = getReading() GV(x) = i task_transform() int offset = GV(y) + GV(z) FAILURE
Execution
RECOVERY
task_transform() int offset = GV(y) + GV(z) GV(x) = GV(x) + offset
main() { int x = 0; int y = 2; int z = 5; while (1) { x = getReading(); int offset = y + z; x = x + offset; }}
Source Code
(a) (b) (c)
Figure 2.4: (a) Shows unmodified source code (b) shows the task divided code according tothe conventions described §2.4.4 (c) shows the execution of the code in (b) after it has beeninstrumented by the compiler.
to the register file, restoring the saved program counter value at the last checkpoint at the
end. In the uncommon case—when SRAM fails to retain data due to an unexpectedly long
power failure—CAMEL passes control to the beginning of the program and restarts execution.
2.4.4 CAMEL Tasks
The CAMEL programming model is task-based, providing the programmer with an interface to
divide source code into small, reusable and atomic tasks. This facilitates the implementation
and management of our volatile and non-volatile worlds. This division enables tracking intra-
task idempotence [46] by the compiler. We label a section of code as idempotent if no variable
undergoes a Write-after-Read dependency [25] within that section. A variable is subject to
the Write-after-Read dependency [25] when it is first read and then later written in the
21
same task. The CAMEL compiler identifies this sequence by a load followed by a store to the
same memory location. Variables within a task are tracked to ensure the consistency of the
differential buffers upon task entry or re-entry after a power failure.
Programmers define functions serving as volatile tasks using the task_ keyword to mark
them for tracking by the compiler. In accordance to our task-based model, CAMEL expects
all variables that are to be used by multiple tasks or reused by multiple executions of the
same task to be declared as task-shared variables. All task-shared variables are declared as
part of a global structure which serves as the buffer for the volatile and non-volatile words.
Task-shared variables need not be passed to every task individually; instead, they are directly
accessible by the use of the GV() keyword.
After dividing the application into different worlds using tasks, the flow of the program must
be described in main—each task must be called in an order that would result in an identical
execution to that of the unmodified version of the program. We limit the use of main to
calling tasks and read-only conditionals that determine the next task, as CAMEL does not
track idempotency outside of tasks.
2.4.5 CAMEL Compiler
Our static analysis ensures (1) idempotency of tasks across power cycles and (2) consistency
of the differential shared buffers across tasks. Any writes made to the buffer by a task will
persist across power failures, causing the re-execution of the task after recovery to yield
different results than expected of it. The compiler statically tracks data in the volatile world
and inserts code between tasks to ensure data idempotence upon entry or re-entry in a task
after a power failure. This ensures the system-level atomicity of tasks—the results of a
task are never committed to the non-volatile world until the task is complete. To achieve
22
GV(curr) = GV(prev) + 4; . . .GV(prev) = GV(curr); 16 16
16 20
(1) (2)
curr
Unsafe
16
16
safe
16
16
prev
Restart (2)
(3)
Start (1)
undo-logging (3)
(1)(2)(3)
prev
curr
Volatile
Non-Volatile
Figure 2.5: (1) Shows the start of a task (2) Shows a power failure midway execution of a task(3) shows undo-logging before any tasks begins execution. The state of the non-volatileand volatile buffers is shown after each of the three steps.
idempotency and atomicity of a task, we implement a differential double buffer solution,
using the difference between the two buffers to ensure forward progress as well as re-execute
a task in case of a power failure. At any given point in the program, we have two live copies
of the buffer termed volatile and non-volatile. Tasks work on global variables in the volatile
buffer and do not interact directly with the non-volatile buffer. The non-volatile buffer serves
as the fail-safe against inconsistencies in memory due to power failures. CAMEL calculates
the CRC of the checkpoint registers and the non-volatile buffer, which is copied over the
volatile working buffer in the case of a power cycle to prevent memory inconsistencies. We
refer to this process as undo-logging, whereby we undo all non-idempotent variables changed
by the task in the volatile buffer. We illustrate this process in Figure 2.5. Undo-logging
takes effect before entry into a task, undoing all changes made in the volatile buffer by a
partially-executed task interrupted by a power failure.
The successful execution of a task is marked by a commit, which involves swapping the
volatile and non-volatile buffers and re-calculating the CRC on the updated non-volatile
buffer. Swapping the two buffers is essential because the volatile buffer contains updated
23
1. struct {2. int x;3. int y;4. int z;5. int temp;6. int result;7. };8.9. void task_compute() {10. GV(temp) = GV(x) + GV(y);11. GV(result) = GV(result) + GV(temp);12.}
Figure 2.6: Shows how tasks use data in the differential buffers. The only non-idempotentvariable is result since it undergoes write-after-read. It is first read in line 11 and also writtento. This sequence of instructions in assembly would result in a write-after-read violation.
state after the successful execution of a task. Crucially, the buffer swap can be implemented
as a pointer re-assignment, which reduces redundant data movement: instead of copying
data from a dedicated non-volatile buffer, modifying it in a private volatile buffer, and re-
committing it to the non-volatile store, tasks work directly on the data in the volatile world,
which is later rendered non-volatile by CAMEL as part of the commit process. We discuss
enforcing the atomicity of the commit in §2.5.3 to prevent incorrect execution stemming from
an interrupted commit. In order to ensure correct forward progress, the CAMEL compiler
resolves differences between the two worlds to keep program state consistent between tasks.
The compiler establishes idempotence of tasks by only undo-logging variables that undergo
the write-after-read dependency in a task. These variables cause memory inconsistencies and
incorrect execution if a power failure and re-execution occur after the write and before the
commit, as the preceding read will read a different value from the last execution. We refer
to these variables as non-idempotent; non-idempotent variables must be undo-logged before
a task to ensure the task’s idempotence.
Maintaining buffer consistency requires identifying the difference between the volatile and
24
non-volatile buffers after the commit following every task. Each changed variable must be
updated in the volatile buffer to ensure the two buffers remain consistent between tasks. We
resolve the volatile/non-volatile difference using the functions defined below.
Data Types: The compiler is packaged with the functionality to copy different types of
variables from the non-volatile buffer to the volatile buffer. We cover the three types of
variables supported by C: scalars, compounds, and unions.
The copy_scalar method copies different types of scalars from the non-volatile buffer to the
volatile buffer. For contiguous variables, we implement several different methods to efficiently
handle the required logging. The compiler can choose from three different mechanisms
while logging arrays. The most basic mechanism is logging the entire array, referred to as
copy_array, which is essentially a memcpy over the entire array. copy_array_scalar provides
the functionality to log only one index of an array given that the index is stored in a scalar
which is part of the differential buffers. Finally, we implement the copy_array_scalar_local
mechanism for when tasks use local variables to index into and modify global shared arrays.
This method saves the value of the local variable when it is used to index into the shared
array and uses it to perform only the required copies during logging. As structures and
unions are also contiguous variables, much like arrays, we reuse the mechanisms developed
for arrays for both of these types.
2.5 Implementation
We implement the compiler portion of CAMEL as a LLVM pass [26]. LLVM’s ability to
generate a detailed intermediate representation of code written in C using the Clang frontend
proves beneficial for us in functionally verifying CAMEL. We use LLVM version 10.0.0 to
25
Source Code
LLVM IR Instrumented IR
Executable
ClangStatic
AnalysisCompiler
Modification
msp430-gcc
Set Formation
read first idem
writefirst
reads WritesGlobal Buffer
Figure 2.7: Shows the pipeline for the generation of a CAMEL-certified executable.
develop our compiler pass. The IR is then compiled to msp430 native assembly by the
LLVM compiler using the target=msp430 tag. The final step uses msp430-gcc to generate
an executable, ready to be flashed on a board.
2.5.1 Compiler Analysis
The compiler’s aim is to populate sets of read and written variables to find non-idempotent
memory accesses. Figure 2.7 illustrates the pipeline of our analysis from source code to an
executable. Our pass statically analyzes the structured, architecture independent LLVM IR
generated using the task-divided code, written by the programmer using the conventions
highlighted in §2.4.4. LLVM provides interfaces to traverse, interact and change the IR.
Our pass analyzes every function declared with the prefix task_. We focus our analysis on
instructions in IR that are directly involved in interacting with memory locations, namely
load, store and memcpy. Furthermore, we are only interested in said instructions if their
operands are a part of the volatile world global buffer as only that buffer is impacted by
task execution.
Our pass begins by performing intra-module static analysis on the LLVM IR, examining all
function declarations to determine whether they are tasks. Once a task is identified, our pass
26
traverses the control-flow graph of the function, searching for loads, stores and memcpys.
After identifying the instructions of interest in the control-flow graph, we backtrack from the
operands of the instructions to their first use in a task. At this stage, we only add variables
backed by the global world buffer to their respective read/write set. In addition to a set of
read and written variables, we maintain a set of read-first variables—variables that are read
before they are written in a task. Once all sets are populated, we take the intersection of the
read-first and write sets. This produces a set with the variables subject to a write-after-read
dependency, which Figure 2.6 demonstrates. Note that our analysis is context-insensitive:
when the compiler cannot predict which branch of a conditional statement will execute and
one of the path would mark the variable as read-first, the compiler considers the variable
read-first regardless of the execution path. This static analysis guarantees the detection of
all idempotent violations within a task by pessimistically analyzing all execution paths.
2.5.2 Compiler Modifications
The compiler inspects main() to locate task call sites. It proceeds to insert 1) code to undo-
log data before a task call site and 2) code to copy data between the volatile and non-volatile
world to ensure buffer consistency after successfully executing a task. We copy variables in
the write-after-read-vulnerable set from non-volatile to volatile using the functions imple-
mented in §2.4.5.
For arrays, the user may choose to update a specific index of the array using a variable
defined in the volatile world buffer or a local, task-defined variable. The compiler can
insert logging code for any of these two variants. If the index is part of the volatile world
buffer, the compiler loads the value of the variable from the buffer, uses the built-in LLVM
getelementptr instruction to get the array from the buffer and logs the variable. If however,
27
the index is a local variable (i.e., not part of the buffer), we insert code in tasks to store the
value of the index in the volatile world buffer at the time of change. We then use this global
variable to log the specific index of the array. For structures, we choose to log the entire
structure rather than specific elements using memcpy() since our benchmarks use structures
which consist of mostly two or three scalars. Hence, logging the entire structure is not
significantly costlier than logging a scalar within the structure.
2.5.3 Recovery
The recovery component has two major elements: the commit and the recovery functions.
Algorithm 2 describes our commit procedure. Volatile and non-volatile worlds are imple-
mented as pointers to global buffers, which allows us to swap the values of each pointer to
swap world views. To ensure the atomicity of commits, the decision of which pointer points
to which buffer is determined by a flag value that is inverted at the end of each commit.
The commit procedure saves all registers to a protected region in the non-volatile buffer
then updates the canary value or re-calculates the CRC, enabling the recovery routine to
correctly verify SRAM’s data. We implement the function to save registers and calculate
the CRC in native MSP430 assembly; the argument to the SAVE_REGISTERS function is the
memory location to place the registers in. When the CRC is used, it is calculated over the
saved register file and non-volatile world buffer, excluding the CRC result itself.
Algorithm 2 Task Commit Routine.1: NON_VOLATILE ← FLAG ? &BUF1 : &BUF22: VOLATILE ← FLAG ? &BUF2 : &BUF13: SAVE_REGISTERS(NON_VOLATILE->reg_file)4: Guard_BUFFER_REGS_integrity(CRC_MODE)5: FLAG ← not FLAG
We modify the MSP430 reset vector to point to the recovery function when the device
regains power. The recovery function passes control to the program by either restarting
28
from the beginning of the program or from the last in-place checkpoint in the SRAM, based
on whether the SRAM integrity check passes. The recovery routine first reads the flag
value, which determines which global buffer represents which world. Then, depending on
the integrity check strategy, the routine either recomputes the CRC over the non-volatile
world buffer or validates the canary value’s integrity. If the non-volatile world’s contents
are integral, recovery commences: 1) the non-volatile world is copied to the volatile world;
2) the platform is re-initialized; 3) register values are restored from the non-volatile world
buffer; and 4) the program counter is restored, resuming execution from the last in-place
checkpoint.
2.5.4 Correctness
Correctness was a first-class part of our design and implementation processes. Our correct-
ness measure validate that our CAMEL implementation: 1) generates instrumented programs
capable of running on harvested energy; 2) generates programs that result in equivalent fi-
nal states as the uninstrumented programs on continuous power—regardless of power cycle
frequency; and 3) the CRC and Canary strategies both detect unexpectedly long off times
that corrupt SRAM. To obtain a golden reference of what to expect from the compiler, we
manually instrument all benchmarks and manually compare the generated assembly against
the CAMEL-instrumented assembly. Our comparison shows that there are no differences be-
tween the two, meaning CAMEL is capable of inserting fault-free code to log data used across
different benchmarks. Our second line of defense is a set of regression tests that capture
corner cases, the data types available in C, and bugs in earlier versions of CAMEL. Third,
we conduct 10 trials of execution for every benchmark to ensure correctness. In each trial,
we execute the uninstrumented and instrumented benchmarks, semantically comparing the
final state of the system after completion. While executing the instrumented programs, we
29
introduce approximately 20 random on- and off-times, reflective of real-wold energy harvest-
ing traces [8]. Finally, we simulate the uncommon case of extended off times to validate the
effectiveness of the CRC and the canary.
2.6 Evaluation
We evaluate CAMEL against the only existing in-place SRAM-based system [44] and other
programmer-guided, task-based systems [4, 27, 30]. For the competing systems, we use their
publicly-available implementations, without modification. However, for [44], we adapt it
to a continous checkpointing version of itself to fairly evaluate against CAMEL. To compare
performance across these systems, we evaluate each on benchmarks that are used to evaluate
previous programmer-guided systems. Our evaluation demonstrates that CAMEL:
• enables long-life, hardware-support-free intermittent computation on Flash-based de-
vices
• outperforms existing systems on both Flash and FRAM devices
We implement CAMEL on the MSP430 platform using both Flash and FRAM-based devices.
We evaluate CAMEL on the MSP430F5529, a Flash-based devices, and show that CAMEL
enables long-life, high-performance computation using continuous in-place checkpointing on
a wider range of devices than past work by eschewing common-case NVM writes. Next,
we evaluate CAMEL on the MSP430FR6989, an ultra-low-power MCU containing FRAM
as its NVM. Implementing CAMEL on the MSP430FR6989 allows direct comparison between
CAMEL prior work. We choose these devices because they are representative of the capabilities
and limitations of microcontrollers found in deployed energy harvesting systems [33, 39, 48].
30
Additionally, to facilitate reproducibility, they are available from Texas Instruments as part
of development boards.
We test CAMEL on five benchmarks developed in past work [4, 27, 30] to represent the types
of applications found in energy harvesting systems:
• Activity Recognition (AR): AR uses simulated samples from a three-axis ac-
celerometer to train a nearest neighbor classifier to determine whether a device is
stationary or moving.
• Bit Count (BC): BC uses seven different algorithms to count the set bits in a given
sequence.
• Cold-Chain Equipment Monitoring (CEM): CEM simulates input data from a
temperature sensor and later compresses the data using LZW compression [43].
• Cuckoo Filter (CF): A Cuckoo filter is a data structure used to efficiently test for
set membership [6]. This benchmark stores random data in a Cuckoo filter and later
queries the filter to recover the data.
• Data Encryption (RSA): RSA is a widely used public-key cryptosystem [38]. It
encrypts a given string using a user-defined encryption key that is stored in the memory.
2.6.1 Experimental Setup
Our experimental setup draws motivation from previous work [8] on emulating environmental
conditions for real-world energy harvesting use cases in experimental and in-lab setups. We
run our benchmarks on actual hardware, MSP430F5529 and MSP430FR6989, connected to
a variable voltage supply to emulate intermittent computation on harvested energy. For
31
AR BC CEM CF RSA avg.DINO[27] 32.02 39.48 11.53 11.26 16.86 22.23Chain[4] 28.98 78.75 11.90 11.28 23.63 30.91Alpaca[30] 22.29 40.15 10.79 10.56 13.05 19.37TOTALRECALL[44] - - - - - ∞CAMELCRC 20◦C - - - - - 1126k yrsCAMELCRC 55◦C - - - - - 256k yrsCAMELcanary - - - - - ∞
Table 2.1: Deployment lifetime (in hours) for existing programmer-guided systems on Flash-based devices with expected time until a silent data corruption for several CAMEL configu-rations.
voltage control, we use an in-house power control platform that is capable of delivering
arbitrary voltage that can mimic arbitrary energy traces; this allows for both randomization
and replay of energy availability scenarios. We flash each benchmark discussed previously
on the desired device and connect the device to our power controller. We instrument the
benchmarks to drive the GPIO to high on successful completion of the benchmark and
connect it to a low-powered LED to signify the successful completion and hence, correct
execution under harvested energy. Furthermore, we conduct all experiments at 20◦C to
reduce the impact of temperature on SRAM’s data retention time. As in §2.5.4, we perform
10 trials of each benchmark to filter randomness.
2.6.2 Time to death—Flash Failure
We extend the experiments in §2.3.2 to include CAMEL. We consider the comparable time
to system failure with CAMEL to be the first Silent Data Corruption (SDC) stemming from
the CRC failing to detect an error. The time until SDC varies with power-cycle frequency
and length; the worst-case for CAMEL is repeated power-cycles just long enough to produce
corruption that might evade the CRC check. We compare the system lifetimes in Table 2.1
32
and observe that CAMEL extends platform lifetime to well beyond typical deployment times.2
All implementations of CAMEL avoid premature failure by avoiding Flash checkpoints. In the
Canary implementation of CAMEL, platform lifetime is unbounded—carefully-chosen canary
locations prevent all SDCs stemming from SRAM data volatility.
We determine Flash’s lifetime on the state-of-the-art by calculating how long it would take
for Flash to reach its maximum write/erase endurance (100,000 write/erase cycles [17] for
MSP430).3 Like we did in §2.3.2, Table 2.1 values are calculated using optimistic conditions
(e.g., perferct wear-leveling for free) given each benchmark’s checkpoint size, checkpoint
rate, Flash segment size, and the number of free Flash segments for each benchmark. The
checkpoint size is simply the size of data that is written to the NVM at the time of taking
the checkpoint. The checkpoint rate is obtained by running the unmodified C application
and calculating the CPU cycles to completion on the FRAM-based MSP430FR6989 board.
We then divide the CPU cycles with the number of checkpoints each system makes for each
benchmark which is listed in Table 2.2. The Flash segment size comes from the MSP430
documentation [17] and is important, because only entire segments of the Flash can be
erased and written. Note that TOTALRECALL [44] has an unbounded Flash lifetime since it
only utilizes the SRAM for checkpoints.
2.6.3 Run-time Overhead
To evaluate CAMEL against its predecessors in terms of run-time, we measure the CPU cycles
to completion for each benchmark using the MSP430FR6989 and model the overhead each
system would incur on a Flash-based device. We evaluate all systems and benchmarks on
continuous power rather than harvested energy as we aim to categorize the overhead of the2CAMEL lifetime does not vary with program behavior; therefore, we only model CAMEL related values
once.3TI’s Flash endurance estimates match what we see experimentally.
33
Figure 2.8: CAMEL run-time overhead for Flash-based devices The global buffer size for eachbenchmark is stated in parentheses on the x-axis.
Figure 2.9: CAMEL run-time overhead for FRAM-based devices. The global buffer size foreach benchmark is stated in parentheses on the x-axis.
34
in-place checkpoint as it is the most expensive and recurring operation in contrast to the
recovery routine. Figures 2.8 and 2.9 present the overhead incurred by CAMEL along with
the overhead for systems we evaluate against. Our numbers are similar to what one would
see when executing on harvested energy.
CAMEL vs TOTALRECALL: We compare CAMEL to the only SRAM-based system developed
to date, TOTALRECALL [44]. We adapt TOTALRECALL from a just-in-time to a continuous
checkpointing system for a fair comparison with CAMEL and other continuous checkpointing,
task-based approaches we evaluate against. While the canary is easily adaptable, the CRC
proves to be much more of a challenge since every potential write to the memory violates
the existing CRC. A correct solution provides the abstraction of concurrent, atomic memory
writes and corresponding CRC update. To fulfill this abstraction, we record both the cur-
rent and future value of a soon-to-be-overwritten memory location as part of the checkpoint.
Upon recovery, the system uses both values and the CRC over memory to construct two
potential CRCs. No matter where the power cycle occurs, one of the CRCs will match if
memory contents are integral. This naive approach results in a large number of checkpoints
being taken by the continuous TOTALRECALL extension, as can be seen in Table 2.2. We im-
plement both of these systems and evaluate their run-time overhead using our benchmark set.
These run-time overheads are over an order-of-magnitude worse than existing approaches.
Observe that CAMEL outperforms TOTALRECALL by 3x and 4x for the canary and CRC vari-
ants of the systems respectively. We identify that the cause of the poor performance
is a surplus of non-volatility. CAMEL introduces the notion of selective non-volatility
through the introduction of the volatile and non-volatile worlds (tasks) to ensure that not
every part of the SRAM is treated as non-volatile between power-offs, hence, resulting in a
significant decrease in checkpointing overhead.
35
CAMEL vs task-based state-of-the-art: We compare CAMEL against the state of the art
on both platforms. For Flash-based systems (Figure 2.8), CAMEL performs significantly bet-
ter than each of the systems we evaluate against. CAMELcanary performs 50x better while
CAMELcrc performs 7x better than previous systems on average, highlighting the run-time
advantages of avoiding NVM writes on Flash platforms. We isolate the effects of CAMEL’s
differential buffer design on execution time by running it on the FRAM platform, where
CAMEL does not have to calculate a CRC to ensure non-volatile shared data integrity. The
results in Figure 2.9 indicate that CAMEL’s buffer design to reduce data movement overhead
yields a significant improvement over the state of the art, finishing each benchmark on av-
erage twice as fast as the next best system (Alpaca).
State-of-the-art—FRAM vs Flash: The state-of-the-art exhibit different run-time over-
heads when evaluated on boards with different choices of persistent memory (apart from
TOTALRECALL [44]). Systems showcase opposite trends for some benchmarks on Flash-based
devices when compared to FRAM-based devices. This is exhibited in the numbers for CF in
Figure 2.8 where DINO [27] performs better than it’s successor, Alpaca [30]. However, we
observe a different trend in Figure 2.9. This change is due to the different number of NVM
checkpoints made by the two competing systems as can be seen in Table 2.2. FRAM writes
are significantly less costly than Flash writes, hence a surplus of checkpoints on FRAM-based
boards will not affect the run-time overhead by a large percentage. However, on Flash-based
boards, a larger number of checkpoints result in a more significant difference in run-time
performance due to the time cost of Flash writes.
Trade-off—CAMELcanary vs CAMELcrc: Due to a less costly data corruption detection mech-
anism, CAMELcanary performs better than CAMELcrc for every benchmark. Computing the
CRC over the non-volatile differential world (buffer) after every task makes the run-time
36
AR BC CEM CF RSA avg.DINO [27] 1136 717 259 324 1830 788Chain [4] 2008 717 231 452 315 744Alpaca [30] 2008 717 225 452 315 743TOTALRECALL [44] 124k 18k 2272 6720 27k 36kCAMEL 1999 709 114 385 254 692
Table 2.2: Checkpoints each system makes per benchmark. We can see that each systemmakes a comparable number of benchmarks hence the difference in run-time and binary-sizeoverheads cannot be a result of different number of checkpoints.
impact of each checkpoint heavily dependent on the size of the buffer. Canary values trade
off run-time overhead for pre-deployment effort: by characterizing each device and deter-
mining which SRAM cells fail first, users can reduce the CAMEL integrity check from a
CRC calculation to a handful of canary data comparisons. Our evaluation indicates that
CAMELcanary finishes benchmarks approximately 5 times faster than CAMELcrc.
Commit routine: The commit routine which follows every task is a main source of run-time
overhead. The commit can either deploy with the canary or the CRC as the sentinel value,
depending on the variant of CAMEL in use. The commit coupled with canary values results
in a constant run-time overhead across all benchmarks (∼20 CPU cycles) since CAMELcanary
only stores and compares against a pre-determined value on run-time. However, since the
CRC guards the differential buffers by computing a value over the data in them, the run-time
overhead for CAMELcrc is a function of the size of the buffers. The average run-time overhead
incurred by the commit across all benchmarks 2007 CPU cycles.
2.6.4 Binary size overhead
We compare the binary size overhead between CAMEL and state-of-the-art in Figure 2.10.
CAMELcanary produces smaller binaries when compared to CAMELcrc, because the canary re-
37
Figure 2.10: CAMEL binary size increase compared to current state-of-the-art.
covery routine takes up less lines of code than the CRC.
Our evaluation shows that CAMELcanary reduces binary size when compared to past work,
while CAMELcrc’s binary overhead impact is comparable (approximately 1% larger than Al-
paca [30]).
The commit routine which accompanies every CAMEL task can be implemented as an inline
or a naked function. Figure 2.10 shows the overhead incurred by both of these versions. The
trade-off for both approaches is run-time for binary size. Both approaches produce binaries
that scale linearly with the number of tasks in a benchmark; the inline version produces
larger but faster executables, while the naked function approach produces smaller but slower
executables. However, it is noteworthy that a naked function call incurs a fraction of the run-
time overhead of a regular function call. It increases the run-time overhead of the commit
routine by a constant number of CPU cycles (∼5), resulting in a negligible change in the
overall run-time performance.
38
2.6.5 Automatic Systems
Automatic continuous checkpointing intermittent computation systems [31, 46] are an alter-
native to programmer guided systems that remove all burden from the programmer, but also
remove all power from them as it is impossible to control forward-progress-level-atomicity.
In addition to removing all programmer control, fully-automatic approaches exacerbate the
problems that CAMEL addresses: they have a higher rate of checkpoints that kill Flash even
faster than programmer-guided approaches. Therefore, the best-case Flash performance for
these systems is much worse than the worst-case listed in Table 2.1.
Because it is not the focus of the paper, we exclude extensive overhead results for automatic
systems, but our experiments show that CAMEL is approximately 4x better than Ratchet [46]
on FRAM-based boards. Extending this to Flash-based boards results in even higher over-
head for Ratchet [46], because of time cost of Flash writes/erases. As for Chinchilla [31] (an
extension of Ratchet that aims to dynamically elide checkpoints), given that its run-time
overhead is comparable to Alpaca [31] and CAMEL performs 2x better than Alpaca [30] on
FRAM-based devices (Figure 2.9), we expect a similar margin for Chinchilla.
2.7 Related Work
While CAMEL shows the possibilities of combining time-dependent non-volatility [44] with
programmer-guided continuous checkpointing, other checkpointing approaches exist.
One-time checkpointing approaches backup volatile state to non-volatile memory in a just-in-
time manner, i.e., with just enough energy remaining to write the checkpoint [1, 2, 22, 37].
This requires the ability to measure the amount of energy stored in the system’s energy
storage capacitor. This increases system complexity and cost, as well as increasing its overall
39
energy usage—limiting the energy available for useful computation [46]. While one-time
checkpointing approaches minimize run-time overhead, the need for hardware support limits
its deployment.
Continuous checkpointing approaches eschew a single, large, just-in-time, checkpoint of the
entire volatile program state for many small checkpoints of only the essential volatile pro-
gram state to resume execution. Continuous checkpointing enables intermittent computation
without special hardware support at the cost of decreased performance. No matter if archi-
tecture, programmer, or compiler driven, all existing continuous checkpointing approaches
are incompatible with Flash devices due Flash’s endurance, performance, and power limita-
tions [17].
Continuous checkpointing fits naturally with sequential hardware design if you consider the
flip-flops non-volatile state and the combinational logic between them volatile state. Ide-
tic [32] employs this model to support intermittent computation at the circuit level using
existing high-level synthesis tools. While this works for simple applications that readily
map to hardware circuits, it is not generalizable. Conventional processor pipelines are also
compatible with continuous checkpointing if you consider pipeline registers as non-volatile
state and the operations between stages as volatile state. Non-volatile processors [28, 29]
leverage this observation by implementing pipeline registers with non-volatile flip-flops (e.g.,
FRAM). Recent work improves performance [7] by allowing for approximate results. An
alternative to non-volatile processors is Clank [10], which enforces dynamic memory idem-
potency. Architecture-driven continuous checkpointing approaches tend to be at an extreme,
with a large number of small checkpoints.
CAMEL builds on existing programmer-guided intermittent computation systems: combin-
ing Chain’s [4] expressive programming interface with idempotence analysis as used by Al-
paca [30]. From this, CAMEL introduces the idea of of swappable mixed-volatility worlds
40
backed by a differential analysis that allows data reuse across tasks. This improves perfor-
mance regardless of NVM type by reducing redundant data copying. To enable programmer-
guided approaches on Flash devices, CAMEL bifurcates program data into a non-volatile world
between tasks and a volatile world within a task.
Ratchet [46] and Chinchilla [31] replace programmer reasoning with compiler analysis to
produce an automatic approach to supporting intermittent computation—at the cost of
removing the abstraction of forward progress atomicity from the programmer. Ratchet de-
composes programs into restartable units using idempotence analysis while Chinchilla [31]
builds on Ratchet with a smart timer and basic-block-level energy estimation to elide check-
points at run time. While Chinchilla eliminates up to 99% of Ratchet’s checkpoints, it too
quickly kills Flash devices.
2.8 Conclusion
This paper exposes and addresses the unexpected lifetime and performance limitations of cur-
rent programmer-guided approaches to intermittent computation on both Flash and FRAM
devices. The improvements center on the abstraction of two worlds that co-exist during
program execution: a non-volatile world that contains the data that tasks use to communi-
cate between each other and that is used for post-power-cycle recovery and a volatile world
that contains data used by a task. The non-volatile world’s state is protected from corrup-
tion on Flash-based systems by either a CRC or a canary location. The proposed approach
also advances performance on FRAM-based platforms by minimizing the data copied when
transitioning between worlds, when going between tasks. The result is the first programmer-
guided intermittent computation system that runs on both Flash and FRAM devices while
providing the highest performance on both.
41
Chapter 3
SABLE: A Compiler-only Alternative to
CAMEL
In this chapter we present an alternative to CAMEL’s §2 programmer-guided, continuous
checkpointing model, SABLE. Unlike CAMEL §2, SABLE does not require any intervention
from the programmer and is entirely dependent on compiler analysis and modification to
store and restore program state before and after a power failure. Like CAMEL §2, SABLE
relies on continuous run-time checkpoints to take frequent snapshots of the program state
that can later be used to continue execution. SABLE exploits the SRAM’s time-dependent
non-volatility, an idea that CAMEL §2 borrows from [44]. However, extending the SRAM’s
time dependent non-volalility to a continuous, task-based approach required the employment
of significant novel techniques as discussed in §2.
SABLE draws motivation from Ratchet [46] which is the only wholly automatic, compiler-
based intermittent computation model. Furthermore, extending the SRAM’s time-dependent
non-volatility [44] to a continuous checkpointing system without programmer reasoning re-
quires refining and developing new checkpoint and recovery routines for SABLE.
The following will be explored in the coming sections
1. the trade-space between programmer-guided and compiler-based approaches to iden-
tify how one approach compares to the other in terms of design, implementation,
42
complexity and run-time overhead §3.1.
2. We then describe SABLE design and implementation techniques employed to bring
SABLE to life §3.2 §3.3.
3. Lastly, we showcase some preliminary numbers that exhibit SABLE run-time and binary
size overhead §3.4.
3.1 Programmer-invention vs Compiler-analysis — trade-
off
In this section, we explore the trade space between programmer-invention and compiler-
analysis in terms of intermittent computation on energy harvesters. Programmer-intervention
requires a programmers to decompose the structure of an application using a set of pre-
defined rules, similar to those defined in §2.4.4. Such decomposition aims to lower the
complexity of the analysis that needs to performed to enable continuous checkpointing and
decrease the run-time checkpointing overhead. However, the degree to which the overhead
is lowered depends on the nature of the application and how well a programmer can rea-
son about its decomposition. Being too optimistic about decomposing an application can
be a hindrance to forward progress. If a task requires more energy that can possibly be
available in a single power cycle, the application will fail to complete. However, being too
pessimistic about application decomposition can lead to a higher degree of checkpointing
overhead. Hence, decomposing an application in to its most efficient version can prove to be
quite a challenge for programmers.
Using a wholly compiler-based approach can help solve the aforementioned challenges. Compiler-
analysis is absolute and requires no intervention from the programmer, making such a system
43
easy to deploy without having to modify any read-world applications. However, performing
analysis on raw source code can prove to be more difficult since programming languages
consist of a large number of constructs and it would be difficult and time-consuming to
handle all edge cases which are otherwise removed by application decomposition. Further-
more, compiler-based approaches tend to have a higher overhead than programmer-guided
approaches. In programmer-guided approaches, programmers can reason about whether a
checkpoint is needed or not but a compiler will insert checkpoints according to a pre-defined
set of rules.
3.2 SABLE Design
We design SABLE as a series of small systems, each system building upon the previous one to
arrive at the most efficient solution. All of our systems either make use of the canary values
or the CRC to validate data retention in the SRAM upon recovering from a power failure. In
this section, we will introduce CAMEL four independent modules that can be used to support
intermittent computation and the challenges we faced while coming up with their design.
3.2.1 Naive Canary
This is the first iteration of our attempt at using the canary values with a compiler-based,
continuous checkpointing system without programmer-invervention. SABLE’s naive canary
variant places a checkpoint before every memory altering instruction, which we identify to be
only the store and memcpy instructions in the LLVM IR. Like in CAMEL§2, the checkpoint
stores the state of the register file in the SRAM. These values can then be used to recover
the state of the register file upon regaining power. Since a checkpoint is placed before every
44
memory altering instruction, this ensures that the state of the SRAM remains consistent
across power cycles, ensuring correct execution of long-running applications.
SABLE detects unexpectedly long off-times by ensuring the canary value remains in place 2.4.2
when the device reboots. This, coupled with hardware enrollment and pre-characterization
§2.4.2, provides a high degree of corruption detection with low checkpointing and recovery
run-time overhead.
3.2.2 Idempotent Canary
Idempotent canary is the final and most efficient variant of SABLE to make use to the canary
values. We reuse CAMEL’s idempotent memory analysis for this variant.
As an extension of our naive implementation of the canary version, we optimize and extend
our solution to improve performance. Using the idea of idempotence [25] of an instruction
we reduce the total number of checkpoints that need to be inserted in our module, thereby
reducing the run-time checkpointing and recovery overhead.
Idempotent canary only places a checkpoint before a memory altering instruction to a vari-
able that is subject to the write-after-read dependency§2.4.4. Variables that are only written
in the scope of a function cannot cause a program to showcase inconsistent behaviour. A
variable can only cause inconsistency if it is read first before being written[25]. This is similar
to what we described in §2.4.4. Hence, we can reduce the total number of checkpoints by
only considering stores to variables that are undergoing the write-after-read dependency.
To compare Naive and Idempotent canary, we summarize the number of checkpoints placed
by each approach in Table 3.1. We can see at least a reduction of at least 33% in the number
of checkpoints placed in our benchmarks. This shows the degree of efficiency going from
Naive to Idempotent canary.
45
AR BC CEM CF RSANaive 92 75 64 64 158Idem 34 50 14 13 65
Table 3.1: Summary of checkpoints placed in Naive and Idempotent variants of the canaryversion.
3.2.3 Naive CRC
The naive CRC implementation is the first variant of SABLE to make use of the CRC [24]
to detect unexpectedly long off-times. One might think adapting the CRC calculation to an
application that is continuously taking checkpoints is a trivial task, but that is not the case.
Much like CAMEL §2, we rethink our approach of how to calculate a recovery value which
is not rendered useless whilst the program is executing. CAMEL employs a double buffer
routine, ensuring that one of the buffers remain in a state which corresponds to the CRC
value stored in memory. However, doing that without programmer intervention can prove
to be tricky.
Naive CRC follows the naive canary implementation by placing checkpoints before every
memory altering instruction to store a snapshot of the program which can be used to resume
execution. As described in §2.4.2, the CRC calculates a recovery value over the program
stack and the entire SRAM. Upon regaining power, the recovery routine recalculates this
value and asserts against our pre-calculated value. If the assertion is successful, we restore
program state. Otherwise, we restart the execution of the program.
The caveat here is that each memory altering instruction renders the CRC useless since the
it changes the program stack. We employ a simple fix to adapt our CRC to ensure it stays
alive before the next checkpoint is taken. At every checkpoint, SABLE calculates the CRC
without taking into account the memory locations that are written by the memory altering
instruction after the checkpoint. This ensures if power is lost after a checkpoint and its
46
succeeding store are executed, our stored CRC value remains alive until the SRAM begins
losing state [44]. One may argue that skipping the memory location that is written in the
CRC calculation would mean that SABLE would not detect the corruption of that specific
memory location. However, we do not care about that specific memory location since it
is being written. Consider the case where the specific memory location corrupted before
regaining power. On power on, the same memory location is going to have a value written
back to it, hence rewriting the corrupted data with new data. This is possible because
MSP430 is a register to memory architecture [14], which means that a memory location that
is being written to another memory location must have its contents stored in a register before
they can directly be written to the destination memory location.
3.2.4 Batch CRC
We extend naive CRC to reduce the number of checkpoints placed by SABLE and thereby
reduce the run-time checkpointing overhead. The general idea this version of SABLE remains
constant — placing checkpoints before stores. However, instead of placing a checkpoint
before every memory altering instruction, we batch these instructions and skip the memory
locations written by all of these instructions. This allows us to merge multiple naive CRC
calculations into one, optimal CRC calculation. However, a number of challenges needed to
be addressed before we could develop Batch CRC.
The first problem to be solved before implementing batched CRC is to figure out an opti-
mum number of stores to batch together into one CRC calculation. To do this, we analyze
the number of stores per idempotent section and count the occurrence of each number of
stores in the program.An idempotent section is the code between two non-idempotent stores.
We summarize our results in the cumulative probability distribution curves above for our
47
0 2 4 6 8 10 12Number of stores per Idempotent region
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Prob
abilit
y
ARBCCEMCFRSA
Figure 3.1: Cumulative Density Curve for SABLE batch CRC which helps in determining theideal number of stores to batch
benchmarks in Figure 3.1. This helped us decide an optimum number to choose for our
batching (4).
We isolate all functions that are part of the program source by placing checkpoints at the
start of the function and before every return instruction. This ensures that the number of
batched stores is always 0 when we enter or exit a function that has been declared and defined
as a part of our module. Furthermore, this ensures that our analysis is intra-procedural only.
We perform an intra-procedural control-flow analysis to batch stores pessimistically in all the
possible execution paths taken by a specific procedure. Pessimistic analysis enables taking
48
AR BC CEM CF RSANaive 92 75 64 64 158Batch 70 77 30 32 108
Table 3.2: Summary of checkpoints placed in Naive and Batch variants of the CRC version.
into account dynamic execution information, ensuring we never exceed the total number
of stores to be batched. We traverse the control-flow graph from the function exit block
backwards to batch stores.
In dealing with loops (for and while), we isolate the loop to ensure that there are 0 numbers
of stores batched at the start of the loop. We then treat the loop as a continuous block of
instructions. This ensures that the maximum number of batched stores in a loop will not
exceed our maximum batch number.
We only batch stores between non-idempotent stores, till a maximum of 4 stores are batched
or we encounter a call to a procedure that is defined inside our module.
The result in checkpoint reduction when we go from Naive CRC to Batch CRC is summa-
rizied in Table 3.2. This shows that even though batch CRC is not as efficient as reducing
checkpoints as idempotence memory analysis, it still manages to reduce the number of check-
points by 20% at minimum. However, we can also observe that there was no reduction in
the BC benchmark. This is largely due to the nature of BC application. The BC applica-
tion is filled with idempotent violations. According to the definition of batch CRC, we must
checkpoint before every idempotent violation and only batch between idempotent violations.
3.3 SABLE Implementation
We implement SABLE as a series of independent LLVM passes that can be incorporated in
the compilation pipeline of any C application which targets the MSP430 architecture. The
49
LLVM passes traverse the IR of the module under inspection function by function. We locate
stores in each function as they are the insertion points of our checkpoints in all variants. The
LLVM infrastructure provides API to check the type of every instruction, which we utilize
to locate all memory altering instructions. For non-naive versions of SABLE, we also make
use of the memory idempotence analysis developed as part of CAMEL. This analysis lists all
possible memory altering instructions that can potentially cause an idempotent violation.
The details of this analysis can be found in §2.4.5. Lastly, for Batch CRC, we traverse the
LLVM IR, as described in §3.2.4, we must figure out the maximum number of stores at the
entry point of every basic block in the control-flow graph of a function. This is done by
traversing the control-flow graph multiple times using a breadth-first search traversal and
calculating the maximum number of stores at the entry of every basic block. This is repeated
until the point of convergence — the maximum value at the start of every basic block does
not change for any basic block. Our analysis draws inspiration from the data-flow analysis
framework for computing data-flow problems. We use the data-flow analysis framework to
do a static run-time control-flow analysis that makes sure applications execute faultlessly at
run-time.
3.4 SABLE Evaluation
Since SABLE is currently under-development, we only evaluate SABLE across its different
variants of itself to show the effects of our compiler optimization techniques. We evaluate
the four different systems that we built using two metrics: run-time overhead and binary
size overhead.
Table 3.3 shows how the run-time overhead differs for every SABLE variant. These numbers
are attained by running the unmodified benchmark to completion and measuring the CPU
50
Naivecanary Naivecrc Idemcanary Batchedcrc
AR 1.30 6.59 1.15 4.77BC 5.11 33.1 4.33 36.5CEM 3.14 80.6 1.53 7.95CF 4.61 50.5 2.55 29.0RSA 4.39 80.0 3.59 37.22avg. 3.71 50.18 2.63 23.1
Table 3.3: Relative run-time for different SABLE implementations
Naivecanary Naivecrc Idemcanary Batchedcrc
AR 1.11 1.23 1.07 1.09BC 1.45 1.89 1.38 1.47CEM 1.35 1.75 1.26 1.31CF 1.20 1.43 1.13 1.17RSA 1.48 2.10 1.29 1.42avg. 1.32 1.68 1.21 1.29
Table 3.4: Relative binary size for different SABLE implementations
cycles taken. We then attain the CPU cycles each benchmark takes on every SABLE variant
and divide these by the CPU cycles of the unmodified benchmark. The overhead decreases
as a function of the number of statically placed checkpoints in every benchmark. We can
also observe that our final solutions (Batch CRC, Idempotent Canary) perform significantly
better than their naive predecessors.
Similarly, for binary size, we can see a downward trend from the naive implementations
of SABLE to their final, efficient solutions. This aligns with the fact that the number of
checkpoints decrease as we go from our naive to final solutions for both, the CRC and the
canary versions.
51
Chapter 4
Conclusion
In this research we present two different continuous-checkpoint based intermittent computa-
tion models. We build our systems to utilize the time-dependent non-volatility of the SRAM
§2.3.3 which is an idea first explored by TotaRecall [44]. Our systems employ novel compiler
analysis techniques to arrive at the most optimum solution. We then showcase in §2.6.2
why our systems are a necessity. Our systems provide long lifetimes for Flash-based devices,
which the current state-of-the-art fail to do.
The current state-of-the-art continuous checkpointing systems cannot be adapted to support
the Flash as they will end up rendering it useless within a matter of hours. This is due to the
high volume of NVM checkpoints taken by these systems to store snapshots of a program.
Our systems are the first continuous checkpointing models that ensure low-overhead and
long-life intermittent computation on devices regardless of the choice of NVM. Our evaluation
shows that CAMEL does inadvertently extend the lifetime of Flash-based devices and does so
efficiently, minimizing all sources of overhead. Furthermore, SABLE, follows in the footsteps
of CAMEL. Though it is currently just a prototype, SABLE further aims to make energy
harvesting devices more accessible to programmers by ridding the programmer of the burden
to make altercations to the application source code. Both systems presented in this paper
have trade-offs, they both can be used for high-performance, NVM-invariant, software-only
intermittent computation without rendering a Flash-based device usless.
52
Bibliography
[1] D. Balsamo, A. S. Weddell, A. Das, A. R. Arreola, D. Brunelli, B. M. Al-Hashimi,
G. V. Merrett, and L. Benini. Hibernus++: A self-calibrating and adaptive system for
transiently-powered embedded devices. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 35(12):1968–1980, March 2016.
[2] Domenico Balsamo, Alex Weddell, Geoff Merrett, Bashir Al-Hashimi, Davide Brunelli,
and Luca Benini. Hibernus: Sustaining Computation during Intermittent Supply for
Energy-Harvesting Systems. In IEEE Embedded Systems Letters, 2014.
[3] BBC News. Samsung confirms battery faults as cause of Note 7 fires, January 2017.
https://www.bbc.com/news/business-38714461.
[4] Alexei Colin and Brandon Lucia. Chain: Tasks and channels for reliable intermit-
tent programs. In International Conference on Object-Oriented Programming, Systems,
Languages, and Applications, OOPSLA, pages 514–530, October 2016.
[5] H. Desai and B. Lucia. A power-aware heterogeneous architecture scaling model for
energy-harvesting computers. IEEE Computer Architecture Letters, 19(1):68–71, 2020.
[6] Bin Fan, Dave G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher. Cuckoo
filter: Practically better than bloom. In Proceedings of the 10th ACM International on
Conference on Emerging Networking Experiments and Technologies, CoNEXT ’14, page
75–88, New York, NY, USA, 2014. Association for Computing Machinery.
[7] K. Ganesan, J. San Miguel, and N. Enright Jerger. The what’s next intermittent com-
53
puting architecture. In IEEE International Symposium on High Performance Computer
Architecture, HPCA, pages 211–223, Feb 2019.
[8] Josiah Hester, Timothy Scott, and Jacob Sorber. Ekho: Realistic and repeatable ex-
perimentation for tiny energy-harvesting sensors. In Proceedings of the 12th ACM
Conference on Embedded Network Sensor Systems, 2014.
[9] Josiah Hester, Nicole Tobias, Amir Rahmati, Lanny Sitanayah, Daniel Holcomb, Kevin
Fu, Wayne P. Burleson, and Jacob Sorber. Persistent clocks for batteryless sensing
devices. ACM Transactions on Embedded Computer Systems, 15(4):77:1–77:28, August
2016.
[10] Matthew Hicks. Clank: Architectural support for intermittent computation. In Inter-
national Symposium on Computer Architecture, ISCA, pages 228–240, 2017.
[11] D. E. Holcomb, W. P. Burleson, and K. Fu. Power-Up SRAM State as an Identifying
Fingerprint and Source of True Random Numbers. IEEE Transactions on Computers,
58(9):1198–1210, September 2009.
[12] Daniel E. Holcomb, Amir Rahmati, Mastooreh Salajegheh, Wayne P. Burleson, and
Kevin Fu. Drv-fingerprinting: Using data retention voltage of sram cells for chip identi-
fication. In Proceedings of the 8th International Conference on Radio Frequency Identi-
fication: Security and Privacy Issues, RFIDSec’12, pages 165–179, Berlin, Heidelberg,
2013. Springer-Verlag.
[13] G. Huang, L. Qian, S. Saibua, D. Zhou, and X. Zeng. An efficient optimization based
method to evaluate the drv of sram cells. IEEE Transactions on Circuits and Systems
I: Regular Papers, 60(6):1511–1520, June 2013.
54
[14] Texas Instruments. MSP430x2xx Family User’s Guide (Rev. J), 2013. http://www.ti.
com/lit/ug/slau144j/slau144j.pdf.
[15] Texas Instruments. Maximizing Write Speed on the MSP430 FRAM, 2015. https:
//www.ti.com/lit/an/slaa498b/slaa498b.pdf.
[16] Texas Instruments. MSP432P4xx SimpleLink microcontrollers technical reference man-
ual, March 2015. http://www.ti.com/lit/ug/slau356i/slau356i.pdf.
[17] Texas Instruments. MSP430 Flash Memory Characteristics (Rev. B), 2018. http:
//www.ti.com/lit/an/slaa334b/slaa334b.pdf.
[18] Texas Instruments. MSP430F5438A—MSP430F543xA, MSP430F541xA Mixed-
Signal Microcontrollers, September 2018. http://www.ti.com/lit/ds/symlink/
msp430f5438a.pdf.
[19] Texas Instruments. MSP430F552x, MSP430F551x Mixed-Signal Microcontrollers,
September 2018. https://www.ti.com/lit/ds/symlink/msp430f5529.pdf.
[20] Texas Instruments. MSP430FR5964—MSP430FR599x, MSP430FR596x Mixed-Signal
Microcontrollers, August 2018. http://www.ti.com/lit/ds/symlink/msp430fr5964.
pdf.
[21] Texas Instruments. MSP430G2553 LaunchPad Development Kit (MSP‑EXP430G2ET),
2018. http://www.ti.com/lit/ug/slau772/slau772.pdf.
[22] Hrishikesh Jayakumar, Arnab Raha, and Vijay Raghunathan. QUICKRECALL: A
Low Overhead HW/SW Approach for Enabling Computations across Power Cycles in
Transiently Powered Computers. In International Conference on VLSI Design and
International Conference on Embedded Systems, 2014.
55
[23] Joseph Kahn, Randy Katz, and Kristofer Pister. Next Century Challenges: Mobile
Networking for ”Smart Dust”. In Conference on Mobile Computing and Networking
(MobiCom), 1999.
[24] P. Koopman and T. Chakravarty. Cyclic redundancy code (crc) polynomial selection for
embedded networks. In International Conference on Dependable Systems and Networks,
2004, pages 145–154, June 2004.
[25] Marc de Kruijf, Karthikeyan Sankaralingam, and Somesh Jha. Static analysis and
compiler design for idempotent processing. In Conference on Programming Language
Design and Implementation, PLDI, pages 475–486, 2012.
[26] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program
analysis & transformation. In In International Symposium on Code Generation and
Optimization, CGO, pages 75–86, 2004.
[27] Brandon Lucia and Benjamin Ransford. A simpler, safer programming and execution
model for intermittent systems. In Conference on Programming Language Design and
Implementation, PLDI, pages 575–585, 2015.
[28] K. Ma, Y. Zheng, S. Li, K. Swaminathan, X. Li, Y. Liu, J. Sampson, Y. Xie, and
V. Narayanan. Architecture exploration for ambient energy harvesting nonvolatile pro-
cessors. In IEEE International Symposium on High Performance Computer Architecture,
HPCA, pages 526–537, Feb 2015.
[29] Kaisheng Ma, Xueqing Li, Karthik Swaminathan, Yang Zheng, Shuangchen Li, Yongpan
Liu, Yuan Xie, John Sampson, and Vijaykrishnan Narayanan. Nonvolatile Processor
Architectures: Efficient, Reliable Progress with Unstable Power. In IEE Micro Volume
36, Issue 3, 2016.
56
[30] Kiwan Maeng, Alexei Colin, and Brandon Lucia. Alpaca: Intermittent execution with-
out checkpoints. In International Conference on Object-Oriented Programming, Sys-
tems, Languages, and Applications, OOPSLA, pages 96:1–96:30, October 2017.
[31] Kiwan Maeng and Brandon Lucia. Adaptive dynamic checkpointing for safe efficient
intermittent computing. In USENIX Conference on Operating Systems Design and
Implementation, OSDI, pages 129–144, November 2018.
[32] A. Mirhoseini, E. M. Songhori, and F. Koushanfar. Idetic: A high-level synthesis ap-
proach for enabling long computations on transiently-powered ASICs. In International
Conference on Pervasive Computing and Communications, PerCom, pages 216–224,
March 2013.
[33] University of Washington. WISP 5 GitHub, April 2014. http://www.github.com/
wisp/wisp5.
[34] Sandro Pinto and Nuno Santos. Demystifying ARM TrustZone: A comprehensive sur-
vey. ACM Computing Surveys, 51(6), January 2019.
[35] Huifang Qin, Yu Cao, Dejan Markovic, Andrei Vladimirescu, and Jan Rabaey. Sram
leakage suppression by minimizing standby supply voltage. In Proceedings of the 5th
International Symposium on Quality Electronic Design, ISQED ’04, pages 55–60, Wash-
ington, DC, USA, 2004. IEEE Computer Society.
[36] Benjamin Ransford and Brandon Lucia. Nonvolatile Memory is a Broken Time Machine.
In Workshop on Memory Systems Performance and Correctness, 2014.
[37] Benjamin Ransford, Jacob Sorber, and Kevin Fu. Mementos: System Support for
Long-Running Computation on RFID-Scale Devices. In Architectural Support for Pro-
gramming Languages and Operating Systems (ASPLOS), 2011.
57
[38] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures
and public-key cryptosystems. Commun. ACM, 21(2):120–126, February 1978.
[39] A. P. Sample, D. J. Yeager, P. S. Powledge, A. V. Mamishev, and J. R. Smith. Design
of an rfid-based battery-free programmable sensing platform. IEEE Transactions on
Instrumentation and Measurement, 57(11):2608–2615, Nov 2008.
[40] Henry Sodano, Gyuhae Park, and Daniel Inman. Estimation of Electric Charge Output
for Piezoelectric Energy Harvesting. In Strain, Volume 40, 2004.
[41] Fang Su, Kaisheng Ma, Xueqing Li, Tongda Wu, Yongpan Liu, and Vijaykrishnan
Narayanan. Nonvolatile processors: Why is it trending? In Proceedings of the Confer-
ence on Design, Automation & Test in Europe, DATE ’17, pages 966–971, 3001 Leuven,
Belgium, Belgium, 2017. European Design and Automation Association.
[42] Mark Weiser. Ubiquitous computing. Computer, 10:71–72, 1993.
[43] Welch. A technique for high-performance data compression. Computer, 17(6):8–19,
1984.
[44] Harrison Williams, Xun Jian, and Matthew Hicks. Forget failure: Exploiting SRAM
data remanence for low-overhead intermittent computation. In International Conference
on Architectural Support for Programming Languages and Operating Systems, ASPLOS,
pages 69–84, March 2020.
[45] Harrison Williams, Alexander Lind, Kishankumar Parikh, and Matthew Hicks. Silicon
Dating. arXiv, abs/2009.04002, 2020. _eprint: 2009.04002.
[46] Joel Van Der Woude and Matthew Hicks. Intermittent computation without hardware
support or programmer intervention. In USENIX Symposium on Operating Systems
Design and Implementation, OSDI, pages 17–32, November 2016.
58
[47] X. Wu, I. Lee, Q. Dong, K. Yang, D. Kim, J. Wang, Y. Peng, Y. Zhang, M. Saliganc,
M. Yasuda, K. Kumeno, F. Ohno, S. Miyoshi, M. Kawaminami, D. Sylvester, and
D. Blaauw. A 0.04MM3 16NW wireless and batteryless sensor system with integrated
cortex-m0+ processor and optical communication for cellular temperature measurement.
In IEEE Symposium on VLSI Circuits, pages 191–192, June 2018.
[48] Hong Zhang, Jeremy Gummeson, Benjamin Ransford, and Kevin Fu. Moo: A Battery-
less Computational RFID and Sensing Platform. In Technical Report UMCS-2011-020,
2011.
59
top related