fann-on-mcu: an open-source toolkit for energy-efficient ...arm cortex-m and pulp processorsand...

15
1 FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things Xiaying Wang, Student Member, IEEE, Michele Magno, Senior Member, IEEE, Lukas Cavigelli, Member, IEEE, Luca Benini, Fellow, IEEE Abstract—The growing number of low-power smart devices in the Internet of Things is coupled with the concept of “Edge Computing”, that is moving some of the intelligence, especially machine learning, towards the edge of the network. Enabling machine learning algorithms to run on resource-constrained hardware, typically on low-power smart devices, is challenging in terms of hardware (optimized and energy-efficient integrated circuits), algorithmic and firmware implementations. This paper presents FANN-on-MCU, an open-source toolkit built upon the Fast Artificial Neural Network (FANN) library to run lightweight and energy-efficient neural networks on microcontrollers based on both the ARM Cortex-M series and the novel RISC-V- based Parallel Ultra-Low-Power (PULP) platform. The toolkit takes multi-layer perceptrons trained with FANN and generates code targeted to low-power microcontrollers. This paper also presents detailed analyses of energy efficiency across the different cores, and the optimizations to handle different network sizes. Moreover, it provides a detailed analysis of parallel speedups and degradations due to parallelization overhead and memory transfers. Further evaluations include experimental results for three different applications using a self-sustainable wearable multi-sensor bracelet. Experimental results show a measured latency in the order of only a few microseconds and power consumption of a few milliwatts while keeping the memory re- quirements below the limitations of the targeted microcontrollers. In particular, the parallel implementation on the octa-core RISC- V platform reaches a speedup of 22x and a 69% reduction in energy consumption with respect to a single-core implementation on Cortex-M4 for continuous real-time classification. Index Terms—Edge AI, TinyML, Machine Learning, IoT Low Power Devices, Wearable, Multi-layer Perceptron, Neural Networks, embedded systems. I. I NTRODUCTION M ACHINE LEARNING has been introduced into many tasks related to the Internet of Things (IoT) and mobile applications to address the major challenge of extracting relevant information from many sensors and data spread in the physical world. Because of its high efficiency in extracting actionable information from large amounts of noisy raw data, machine learning will play a critical role in future IoT devices and services. Recent results on machine learning models demonstrate impressive classification accuracy, in some cases even outperforming humans [1]. One essential feature of X. Wang, M. Magno, L. Cavigelli, and L. Benini are with the Integrated Systems Laboratory, ETH Z¨ urich, 8092 Z¨ urich, Switzerland (e-mail: xiay- [email protected]). L. Benini is also with the Department of Electrical, Electronic and Infor- mation Engineering, University of Bologna, 40136 Bologna, Italy. Manuscript received October XX, XXXX; revised August XX, XXXX. machine learning and in particular neural networks (NNs) is their flexibility that makes them suitable for a wide range of applications, including computer vision [2], natural language processing [3], biomedical [4], and several others [5]–[7]. Today, machine learning on IoT devices is applied with the traditional cloud computing paradigm where the whole data processing is performed in the cloud, and the IoT devices stream the data out in raw form or possibly after simple filtering and/or compression [8], [9]. However, the number of IoT devices is expanding rapidly, and the massive amount of collected data is hard to manage by central clouds. The reasons are the massive workload on the IoT network, the cost of the communication infrastructure, the required energy for data transmission, and, more generally, reliability, latency, and privacy concerns [10]. The new trend of IoT devices is to be “smart” to make decisions on their own, without streaming all the raw data to the cloud [11]. The edge computing paradigm is pushing the data processing to the edge of the IoT (comprising gateways and embedded end-devices) close to the sensors where the data is collected [9]. In many IoT applications, the computation can be distributed on different layers. For instance, the IoT device may perform the pre- processing of the data and transmit the intermediate results to the fog where the rest of the processing is performed [12]– [14]. This new generation of IoT devices is supplied by small-size batteries, which limits the energy and computational resources available. Thus, bringing intelligence to the edge is creating fascinating challenges for industrial and academic researchers [6], [8]. Lots of research efforts towards specialized hardware and optimized inference algorithms to run such NNs on power-constrained devices have been made over the last few years [15]–[17]. Today’s IoT devices host microcontrollers, especially from the ARM Cortex-M family, which are able to achieve power consumption in the order of mW and computational resources in the order of hundreds of MOPS [1], [18], [19]. The power consumption in the range of milliwatts is required for battery-operated devices to avoid frequent battery recharges. On the other hand, the computational resources of microcontrollers (MCUs) are often unable to perform on-board processing for complex algorithms and sensors for several application scenarios [4], [12], [20]. This results in very few examples of NNs that are running on milliwatt-powered microcontrollers, which are the most common compute en- gines available at the edge of the IoT [21]–[24]. To attenuate arXiv:1911.03314v2 [cs.LG] 3 Feb 2020

Upload: others

Post on 20-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

1

FANN-on-MCU: An Open-Source Toolkit forEnergy-Efficient Neural Network Inference at the

Edge of the Internet of ThingsXiaying Wang, Student Member, IEEE, Michele Magno, Senior Member, IEEE,

Lukas Cavigelli, Member, IEEE, Luca Benini, Fellow, IEEE

Abstract—The growing number of low-power smart devicesin the Internet of Things is coupled with the concept of “EdgeComputing”, that is moving some of the intelligence, especiallymachine learning, towards the edge of the network. Enablingmachine learning algorithms to run on resource-constrainedhardware, typically on low-power smart devices, is challengingin terms of hardware (optimized and energy-efficient integratedcircuits), algorithmic and firmware implementations. This paperpresents FANN-on-MCU, an open-source toolkit built upon theFast Artificial Neural Network (FANN) library to run lightweightand energy-efficient neural networks on microcontrollers basedon both the ARM Cortex-M series and the novel RISC-V-based Parallel Ultra-Low-Power (PULP) platform. The toolkittakes multi-layer perceptrons trained with FANN and generatescode targeted to low-power microcontrollers. This paper alsopresents detailed analyses of energy efficiency across the differentcores, and the optimizations to handle different network sizes.Moreover, it provides a detailed analysis of parallel speedupsand degradations due to parallelization overhead and memorytransfers. Further evaluations include experimental results forthree different applications using a self-sustainable wearablemulti-sensor bracelet. Experimental results show a measuredlatency in the order of only a few microseconds and powerconsumption of a few milliwatts while keeping the memory re-quirements below the limitations of the targeted microcontrollers.In particular, the parallel implementation on the octa-core RISC-V platform reaches a speedup of 22x and a 69% reduction inenergy consumption with respect to a single-core implementationon Cortex-M4 for continuous real-time classification.

Index Terms—Edge AI, TinyML, Machine Learning, IoTLow Power Devices, Wearable, Multi-layer Perceptron, NeuralNetworks, embedded systems.

I. INTRODUCTION

MACHINE LEARNING has been introduced into manytasks related to the Internet of Things (IoT) and mobile

applications to address the major challenge of extractingrelevant information from many sensors and data spread inthe physical world. Because of its high efficiency in extractingactionable information from large amounts of noisy raw data,machine learning will play a critical role in future IoT devicesand services. Recent results on machine learning modelsdemonstrate impressive classification accuracy, in some caseseven outperforming humans [1]. One essential feature of

X. Wang, M. Magno, L. Cavigelli, and L. Benini are with the IntegratedSystems Laboratory, ETH Zurich, 8092 Zurich, Switzerland (e-mail: [email protected]).

L. Benini is also with the Department of Electrical, Electronic and Infor-mation Engineering, University of Bologna, 40136 Bologna, Italy.

Manuscript received October XX, XXXX; revised August XX, XXXX.

machine learning and in particular neural networks (NNs) istheir flexibility that makes them suitable for a wide range ofapplications, including computer vision [2], natural languageprocessing [3], biomedical [4], and several others [5]–[7].

Today, machine learning on IoT devices is applied with thetraditional cloud computing paradigm where the whole dataprocessing is performed in the cloud, and the IoT devicesstream the data out in raw form or possibly after simplefiltering and/or compression [8], [9]. However, the numberof IoT devices is expanding rapidly, and the massive amountof collected data is hard to manage by central clouds. Thereasons are the massive workload on the IoT network, the costof the communication infrastructure, the required energy fordata transmission, and, more generally, reliability, latency, andprivacy concerns [10]. The new trend of IoT devices is to be“smart” to make decisions on their own, without streamingall the raw data to the cloud [11]. The edge computingparadigm is pushing the data processing to the edge of theIoT (comprising gateways and embedded end-devices) closeto the sensors where the data is collected [9]. In many IoTapplications, the computation can be distributed on differentlayers. For instance, the IoT device may perform the pre-processing of the data and transmit the intermediate resultsto the fog where the rest of the processing is performed [12]–[14].

This new generation of IoT devices is supplied by small-sizebatteries, which limits the energy and computational resourcesavailable. Thus, bringing intelligence to the edge is creatingfascinating challenges for industrial and academic researchers[6], [8]. Lots of research efforts towards specialized hardwareand optimized inference algorithms to run such NNs onpower-constrained devices have been made over the last fewyears [15]–[17]. Today’s IoT devices host microcontrollers,especially from the ARM Cortex-M family, which are ableto achieve power consumption in the order of mW andcomputational resources in the order of hundreds of MOPS [1],[18], [19]. The power consumption in the range of milliwatts isrequired for battery-operated devices to avoid frequent batteryrecharges. On the other hand, the computational resources ofmicrocontrollers (MCUs) are often unable to perform on-boardprocessing for complex algorithms and sensors for severalapplication scenarios [4], [12], [20]. This results in veryfew examples of NNs that are running on milliwatt-poweredmicrocontrollers, which are the most common compute en-gines available at the edge of the IoT [21]–[24]. To attenuate

arX

iv:1

911.

0331

4v2

[cs

.LG

] 3

Feb

202

0

Page 2: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

2

these computational limits, researchers are proposing newprocessing units to match the requirements of computationalresources required by on-board data processing with state-of-the-art machine learning algorithms. The two most promisingapproaches to improve the performance of ultra-low-powerprocessors are parallelism, low-power fixed-function hardwareaccelerators [25], and near-threshold technology [26]. Parallelarchitectures for near-threshold operation, based on multi-coreclusters, have been explored in recent years with differentapplication workloads [26] and low-power systems [27].

On the software side, the majority of approaches in cloudcomputing are using deep convolutional NNs, which areincredibly accurate and well-suited to classify image framesbut require a massive amount of memory and computationalresources. However, there are many application scenarios,especially dealing with low-bandwidth time-series recordingsfrom low-power sensors, where multi-layer fully-connectednetworks are just as effective [11], [19], [28], [29]. A well-known heuristic model is the multi-layer perceptron (MLP), adeep learning approach with multiple layers of interconnectedintricate memory modules. The Fast Artificial Neural Network(FANN) library is an open-source neural network library[29], which implements multi-layer artificial neural networks(ANNs) in C. To push FANN to its best in terms of energyefficiency on microcontrollers, it is essential that the imple-mentation of the MLP model is optimized for the hardwarearchitecture of the processors exploiting special instructions,parallelism, scratchpad memories, and hardware accelerators.

This paper presents FANN-on-MCU: an open-source frame-work for easy deployment of NNs trained with the FANNlibrary on both ARM Cortex-M cores and new parallel ultra-low-power (PULP) RISC-V-based processors [27]. The for-mer is the dominant processor core present in most MCUs,while the latter represents the forefront of open-source multi-core processors implementing the RISC-V instruction setarchitecture (ISA) and many custom instruction set exten-sions achieving high energy-efficiency and featuring widely-tunable performance for ultra-low-power embedded systems.Our framework offers automated deployments on MCUs withand without a floating-point unit, i.e., ARM Cortex-M0+/M4Fand PULP-based processors such as Mr. Wolf or GAP8 [30],[31].

We present both the framework and the tools to trainANNs as well as detailed performance measurements on bothARM Cortex-M and PULP processors and their comparisonwhen varying the computational complexity and the memoryfootprint of ANNs. Moreover, we evaluate the performanceof FANN-on-MCU on a designed wearable system, whichincludes both an ARM Cortex-M4 core and a PULP-based pro-cessor, with three real-world applications using three differentnetwork sizes. Experimental results demonstrate that a parallelimplementation on the PULP processor reaches up to 22×runtime speedup with a 69% reduction in energy consumptionwith respect to a single core implementation on Cortex-M4for continuous real-time classifications.

The main contributions of the paper are:• We present FANN-on-MCU: an open-source framework

based on FANN Library. The framework allows build-

ing optimized multi-layer artificial neural network-basedclassifiers on all ARM Cortex-M family processors bothwith (i.e., ARM Cortex-M4F and M7F) and without afloating-point unit (i.e., ARM Cortex M0-M3), and onthe very novel class of PULP processors based on theopen-source RISC-V instruction set.

• We provide extensive measurements and comparisons ofthe performance of our framework on both ARM Cortex-M and PULP processors implementing NNs of variablesizes, taking into account of the memory footprint. Sucha detailed performance analysis has not been presentedbefore in literature.

• We demonstrate optimizations to handle different sizesof networks considering the memory hierarchy of theunderlying embedded processor.

• We provide detailed analysis of parallel speedups anddegradations due to parallelization overhead and memorytransfers where a parallel ultra-low power processor, suchas the PULP-based Mr. Wolf, is used in an embeddedsystem.

• We implement fully connected networks for real-worldapplication scenarios such as hand gesture recognition,fall detection for elderly people, and human activityclassification using a multi-sensor wearable device basedon an ARM Cortex-M4 MCU and PULP processor.

• We present experimental results with in-field measure-ments of memory usage, accuracy, feasible network sizes,and power consumption.

• We have released FANN-on-MCU as open-source soft-ware1 to help engineers and academics to have a powerfuland easy-to-use tool to deploy ANNs on ultra-low-powerIoT devices.

We have organized the remainder of this paper as follows. InSection II, we shortly summarize the concept of MLPs, theFANN library, and FANNTool, and provide an overview ofrelated work on deploying NNs at the edge. In Section III, weprovide an overview of our target platforms, the ARM Cortex-M series, and the RISC-V-based PULP series MCUs, togetherwith a description of an application testbed we designed,named InfiniWolf, which includes both an ARM Cortex-M processor and a PULP processor. Then in Section IV,we describe our automated deployment toolkit and highly-optimized implementation for ARM Cortex-M targets andparticularly also PULP-based systems. We extensively evaluateits performance in Section V on the supported MCU familiesand show results for several application showcases in Sec-tion VI using our designed wearable platform. We discuss thelimitations and future work in Section VII before concludingthe paper in Section VIII.

II. RELATED WORK

In the following subsections, we first provide a summaryof MLPs and introduce the FANN library and FANNTool totrain the networks which we then automatically deploy ondifferent MCUs. Subsequently, we discuss currently availableframeworks and libraries for deploying NNs on MCUs.

1Available at https://github.com/pulp-platform/fann-on-mcu

Page 3: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

3

......

Hidden layer(s)

Inp

ut

feat

ure

s

Ou

tpu

t cl

asse

sOutput layerInput layer

.

.

.

Bias𝑏𝑘

Σ 𝜎(∙)

𝑤𝑘1

𝑤𝑘2

𝑤𝑘𝑚

Weights

Inp

uts

to

th

e n

euro

n

𝑥1

𝑥2

𝑥𝑚

Summation

Activation function

Neuron output𝑦𝑘

(a)

......

Hidden layer(s)

Inp

ut

feat

ure

s

Ou

tpu

t cl

asse

s

Output layerInput layer

.

.

.

Bias𝑏𝑘

Σ 𝜎(∙)

𝑤𝑘1

𝑤𝑘2

𝑤𝑘𝑚

Weights

Inp

uts

to

th

e n

euro

n

𝑥1

𝑥2

𝑥𝑚

Summation

Activation function

Neuron output𝑦𝑘

(b)

Fig. 1. Multi-layer perceptron architecture (a) and building block (b).

A. Multi-Layer Perceptrons

A multi-layer perceptron is a type of feed-forward ANN,as illustrated in Figure 1a, which consists of three or morelayers of nodes: an input layer, one or more hidden layers, andan output layer. Each layer contains a fixed number of nodes,also called perceptrons or neurons, which (except for the inputnodes) compute a weighted sum of the previous layer’s nodesand a bias, followed by a non-linearity such as sigmoid orReLU functions. An illustration of the artificial neuron can beseen in Figure 1b. In mathematical terms, each node computes

x(`+1)k (x) = σ

(m∑i=1

w(`)k,ix

(`)i + bk

), (1)

where x(`)k is the output of kth node in `th layer, xi are the

elements of the input vector x with dimension m, w(`)k,i is the

weight of the ith input element to the node, and finally, σ isthe activation function.

The entire network is typically trained end-to-end by opti-mizing its parameters (weights and biases) using backpropa-gation and stochastic gradient descent, such that it maps thesamples from the dataset to the labels to the best of its abilities(measured by a loss function). The MLP is specified by thenumber of hidden layers, the number of nodes within eachlayer, and which non-linearities are used (e.g., sigmoid).

MLPs are commonly used as the classifier after applyingsuitable non-learnable feature extractors due to their relativelymoderately resource demand [32]. In fact, MLPs are much lessdemanding than deep convolutional networks in terms of mem-ory and computing requirements, and yet are widely used andvery effective in many application areas such as medical dataanalysis and in general applications where hand-engineeredfeature extractors are preferred for interpretability [33].

B. Fast Artificial Neural Network (FANN) Library andFANNTool

FANN [29] is an easy-to-use, mature, and well-documentedframework to train and perform inference on multi-layerperceptrons. It is all written in C and has bindings to manylanguages, such as Python, MATLAB, Rust, etc. Due to itspopularity, many graphical tools to aid training ANNs andselecting the right architecture and hyperparameters have be-come available. FANN also includes an automatic hyperparam-eter tuner and can optimize ANNs for fixed-point inference.The library provides cascade training which automatically

optimizes the number of hidden layers and neurons in an ANN,by starting with an empty neural network and then addingneurons one by one, while it trains the neural network. [29].

FANN uses a simple file format for storing the datasetand the trained ANN model. It does not support training onGPUs, which simplifies the framework and is not required fornetworks of a size range suitable for deployment on low-powerMCUs [34]. The CPU implementation is highly optimizedwith features such as cache optimizations and approximationsof various activation functions as step-linear functions out-of-the-box.

To facilitate the usage of the library, the FANN communityprovides the FANNTool [35]. Its graphical interface allows foreasy modification of the network architecture, activation func-tion, training method, weight initialization, and monitoring ofthe training progress. It further supports the fully-automatedselection of the network’s hyperparameters by iteratively test-ing all the available options present in FANN [35].

C. Frameworks for Deploying NNs on MCUs

With the growing attention to NNs, many frameworks havebeen developed over the last few years, such as PyTorch, Ten-sorFlow, and Caffe2, with a focus on NNs training and cloud-scale deployment in GPU-accelerated data centers. However,only very recently, some focus has been put on low-poweredge inference on MCUs. To the best of our knowledge, thereare only two frameworks that are effectively working and havereached popularity: TensorFlow Lite for Microcontrollers andST Microelectronics’ STM32Cube.AI.

• STM32Cube.AI can take trained models from Keras,TensorFlow Lite, and others to generate optimized codeto run them on a wide range of MCUs of the STM32series [36]. For Keras, it also supports quantized modelsto reduce the model size and speed up the computation.

• TensorFlow Lite for Microcontrollers is not vendor-locked and much more generally supports platformsbased on ARM Cortex-M and RISC-V [37]. It can exportmodels from TensorFlow, its runtime takes up 16 kBon a Cortex-M3, and it comes with implementationsfor floating-point layers as well as 8 bit weights andactivations with 32 bit accumulation.

These two frameworks highlight the commercial interest indeploying NNs on milliwatt-range edge devices. In contrastwith those two commercial tools, the toolkit proposed in thispaper provides a solution to deploy optimized MLPs on bothARM Cortex-M family (not limited to STMicroelectronicschips as with Cube.AI) and the PULP family, which is theleading-edge processor family based on RISC-V ISA. Withrespect to TensorFlow Lite for microcontrollers, which is stillunder development, our toolkit supports parallel processing ofANNs on RISC-V-based processors and provides optimizedcode for the deployment of microcontroller systems with andwithout a floating-point unit.

In parallel, ARM has been developing CMSIS-NN [38], anadditional library for their Cortex Microcontroller SoftwareInterface Standard (CMSIS), in order to provide optimal-performance implementations of layers commonly present in

Page 4: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

4

deep NNs with speedups in the order of 4.6× over a simplebaseline implementation. Our toolkit for the ARM Cortex-Mfamily works on top of the CMSIS library to generate highlyoptimized code.

FANN-on-MCU, the toolkit proposed in this paper, has beendeveloped as an extension of FANNCortexM [34]. The latterfirst introduced the deployment flow shown herein, taking atrained MLP from FANN and exporting the code runnableon ARM Cortex-M MCUs with only minutes of engineeringeffort. In this work, we optimize the implementation, providemuch more detailed measurement results, introduce supportfor fixed-point models, and extended it with an optimizedbackend for deployment on the RISC-V-based PULP platformprocessors. Moreover, this paper shows the benefits of parallelprocessing for ANNs evaluating the performance with threedifferent application scenarios implemented in a workingprototype.

III. LOW-POWER PROCESSORS

As we mentioned in the previous section, one of the maincontributions of this paper is the FANN-on-MCU toolkit,which enables MLP on low power processors, in particularmicrocontrollers. Microcontrollers are the backbone of themajority of low-power smart devices for the IoT. In thefollowing, we provide a short overview of the ARM Cortex-M family of microcontrollers, and we introduce the Mr. WolfSoC we use as a representative of the RISC-V-based family ofPULP processors. Both processors have a power consumptionof milliwatts and are suitable for small-size (hundreds ofmAh) battery operation. Moreover, we describe InfiniWolf,an embedded platform we designed that incorporates both anARM Cortex-M and a PULP processor to enable low-powerreal-time wearable applications.

A. Low-Power Embedded Processing: ARM Cortex-M Family

The ARM Cortex-M (M0, M3, M4, M7) family featuresdifferent computational capabilities and operating frequenciesand has power consumption in the milliwatt range. The typicalfrequency is 16 MHz for the M0 and up to 300 MHz for thenew and more powerful M7. The majority of those processorshave no floating-point unit, and only the ARM Cortex-M4Fand M7 implement a floating-point unit. This family of MCUsis characterized by an on-chip SRAM of a few kB (256kB-512kB) and a non-volatile flash memory with a maximum sizeof 1-2 MB. The flash memory is typically used to preload theprogram code and static data, while the SRAM is used forthe runtime code and main data memory. Thus, one of theconstraints to be taken into account is the size of the non-volatile memory. Moreover, as MCUs are designed with lowpower and low cost in mind, the operations per second and asmall memory footprint need to be taken into consideration.Finally, it is important to notice that some ARM Cortex-M cores have an integrated digital signal processing (DSP)instruction set. This is the case of all the ARM Cortex-M3, M4, and M7. The DSP instructions can be used toaccelerate many of the operations used for signal processingand data analysis (e.g., the Fast Fourier transform). Most

SoCDomain

Mr.Wolf

CAMIF

QSPI

SPI M

UART

I2C

HYPER

I2S

TIM

ER

PM

U IF

.

FIFO

FIFO

GPIO

PWM

CLK

IF.

DC/DCLDO

PORBOR

RTC

POWER GATING CLU

STER P

G

Power Manager

DB

GU

AXI Plug

APB Bridge

ROM

MULTIPORT INTERLEAVED

FC

RISC-V CPU

INTC

TX RX

CONFIG

DebugBridge

DBG.

I$ D

FLL

I CACHE

...

...HW

SYNCH

AXI Plug

DMA

PER

IPH

ERA

L IN

TER

CO

NN

ECT

CLU

STER

BU

S

uDMA

Cluster Domain

L2 Shared Memory L2 Private Memory Clock Generation

PAD

FR

AM

E

APB BUS

PAD

CO

NTR

OL

AN

D M

UX

2xFPU SHARED

RISC-V

Memory Bank Memory Bank Memory Bank Memory Bank

LOW LATENCY INTERCONNECT

Memory Bank Memory Bank

SRAM

RET

8

RET

9

RET

0

R

ET7

SRAM

RISC-VRISC-V

SRAM

SRAM

SRAM

SRAMSRAM

SRAMSRAM

SRAMSRAM

SRAM

SRAM SRAMSRAM

TCDM INTERCONNECT

L1 Memory

...

IO S

ub

syst

em

Fig. 2. Block diagram of Mr. Wolf.

notably, the Cortex-M4 and Cortex-M7 have integrated singleinstruction, multiple data (SIMD) instructions, and multiply-and-accumulate operations (MACs) that might be exploited toaccelerate computation in NNs.

The Cortex Microcontroller Software Interface Standard(CMSIS) is a vendor-independent hardware abstraction layerfor the Cortex-M processor series and defines generic toolinterfaces [30]. CMSIS enables consistent device supportand simple software interfaces to the processor and the pe-ripherals, simplifying software reuse, reducing the learningcurve for MCU developers, and reducing the time to marketfor a new device. ANNs have a high number of multipli-cations, so minimizing the computation time increases theefficiency of the solution. We extensively use the optimizedCMSIS floating-point and fixed-point multiplication functionarm_dot_prod, where we measured a decrease of theexecution time by 36%, which shows the effectiveness ofCMSIS. Among others, we have also used arm_fill andarm_copy that also gave an improvement in the range of 30%in execution time. Moreover, CMSIS optimizes the computa-tional time of many functions, such as DSP operations. TheDSP library includes over 60 digital signal processing relatedfunctions that are optimized for the Cortex-M processors. DSPfunctions can be handy for feature extraction (i.e., to performFast Fourier transforms) but are not required for this toolkit,which can run with optimized performance also on ARMCortex-M0 and other processors without DSP instructions.

B. Parallel Ultra-Low-Power Platform: PULP Processors

The PULP platform is an open-source, multi-core platformbased on the RISC-V ISA achieving leading-edge energy-efficiency and featuring widely-tunable performance withina power envelope of a few mW [27]. PULP aims to satisfythe computational demands of IoT applications, which requireflexible and fast processing of data streams generated bymultiple sensors, such as accelerometers, microphone arrays,low-resolution cameras, vital signs monitoring sensors. Asopposed to single-core MCUs, a parallel ultra-low-power pro-grammable architecture provides the ability to meet the com-

Page 5: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

5

Baseline

mv x5, 0mv x4, 100Lstart:

lb x2, 0(x10)lb x3, 0(x11)addi x10, x10, 1addi x11, x11, 1add x2, x3, x2sb x2, 0(x12)addi x4, x4, -1addi x12, x12, 1

bne x4, x5, Lstart

Post-incr. load and

store

mv x5, 0mv x4, 100Lstart:

lb x2, 0(x10!)lb x3, 0(x11!)addi x4, x4, -1add x2, x3, x2sb x2, 0(x12!)

bne x4, x5, Lstart

Hardware loop

lp.setupi 100, Lendlb x2, 0(x10!)lb x3, 0(x11!)add x2, x3, x2

Lend: sb x2, 0(x12!)

Packed SIMD

lp.setupi 25, Lendlw x2, 0(x10!)lw x3, 0(x11!)pv.add.b x2, x3, x2

Lend: sw x2, 0(x12!)

11 cycles/output

8 cycles/output

5 cycles/output 1.25 cycles/output

for (i = 0; i < 100; i++) {d[i] = a[i] + b[i];

}

Fig. 3. RISC-V ISA extensions of PULP.

putational requirements of applications in IoT domains wherelow latency and low energy consumption are the central keysfor solving tasks on miniaturized, battery-powered systems.Despite being an academic research platform, PULP offersthe maturity of a commercial device with OpenMP, OpenCL,and OpenVX support to enable agile application porting, de-velopment, performance tuning, and debugging. GreenWavesTechnologies produces commercial devices based on the open-source PULP platform to design ultra-low-power embeddedsolutions for image, sound, and vibration AI processing insensing devices [39].

In this work, we chose PULP Mr. Wolf processor [31] forits ultra-low-power scalable performance, designed explicitlyfor always-on AI-powered IoT applications. We report itsblock diagram in Figure 2. Mr. Wolf features a hierarchicalarchitecture with a small RISC-V core in the so-called fabriccontroller (FC) subsystem. It is coupled with an autonomousI/O subsystem for efficient data transfers from a broad rangeof peripherals exploiting a multi-channel I/O direct memoryaccess (µDMA) unit in the SoC domain. The small corecan offload compute-intensive tasks to a parallel eight-coreprocessing engine in the cluster domain, which is activatedonly on demand. 512 kB of L2 memory is available in the SoCdomain and it is divided into a shared L2 memory arrangedin four 448 kB memory banks for easy access from bothµDMA and processors, and a private L2 memory for the FCto store, for example, program, stack, private data, in orderto minimize bank conflicts. On the other hand, the Clusterdomain is equipped with a multi-bank L1 memory (sixteen4 kB SRAM banks) to serve the parallel access from the eightcores. The data transfer from/to the L2 memory to/from theL1 memory is handled by an autonomous DMA unit. Thememory organization of the whole system is designed in such away that the interaction and the access conflicts are extremelyminimized.

The small core in the FC is called IBEX [40] and im-plements the basic RV32IMC ISA, while the Cluster do-main comprises eight RI5CY cores with custom instructionset extensions for digital signal processing (DSP) includinghardware loops, post-incremental load and store instructionsand additional ALU instructions. Figure 3 shows the cyclereduction using the ISA extensions. We can see that with thepost-incremental load and store and the hardware loop, we cangain a 2× speedup with respect to the baseline RV32IMC ISA,additionally with packed SIMD instructions, we can achieveup to approximately 10× speedup. Finally, the operational

Fuel GaugeBQ27441

LDO1.8V

Solar Energy HarvesterBQ25570

TEG Energy HarvestingBQ25505

Solar Energy HarvesterBQ25570

TEG Energy HarvestingBQ25505

Fuel GaugeBQ27441

LDO1.8V

Solar Energy HarvesterBQ25570

TEG Energy HarvestingBQ25505

9-axis motionICM-20948

Pressure SensorBMP280

MicrophoneICS-43434

ECG AFEMAX30001

9-axis motionICM-20948

Pressure SensorBMP280

MicrophoneICS-43434

ECG AFEMAX30001

MCUMr. Wolf

MCU/BLEnrf52832

SPI

SPI

I2SI2C

Fig. 4. Block diagram of InfiniWolf and the smart power unit that is able toharvest energy from dual sources.

Fig. 5. InfiniWolf prototype used to carry out experimental measurements.

frequency of the SoC can be configured between 32 kHz and450 MHz, while for the cluster, the range is 32 kHz–350 MHz.

In this work, we fully exploit the memory hierarchy ofPULP chips and the custom ISA extensions, i.e., the hardwareloop and the post-incremental load and store instructions, andmore significantly, we will apply the cluster parallelism toboost the computational capacity of the whole system.

C. Testbed Platform: InfiniWolf

To apply FANN-on-MCU to real-world applications, wedesigned a hardware platform and used it as a testbed. Thedesigned platform is called InfiniWolf, a wearable batteryoperated multi-sensor device that aims to work perpetuallywith small-size energy harvesting and can be worn as a smart-watch. Figure 4 shows the block diagram and the architectureof the designed platform, which features two processors, aNordic nRF52832 Bluetooth low energy SoC with an ARMCortex-M4 processor and Mr. Wolf. The Nordic SoC handlescommunication with a remote host and offers auxiliary supportfor small data processing if needed. It provides Bluetooth LowEnergy (BLE) 5 communication capabilities, performs powermanagement in various modes of operation (sleep, raw datastreaming, data acquisition, and processing), and keeps trackof the battery charging status. The dual-processor architectureof InfiniWolf allows local end-to-end processing (i.e., on-boardclassification using ML) with lower power and higher energyefficiency than streaming the data out for remote analysis [41].

Page 6: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

6

Moreover, this architecture allows lower latency in the rangeof µs with respect to wireless connectivity.

A dual-source energy harvester for solar and thermoelectricgenerator (TEG) [42] modules has been included in the design.The main goal is to achieve perpetual work when the energytransducers are deployed on a wrist band harvesting energyfrom light and body heat. The choice of two energy harvestingsources is motivated by increased flexibility and robustness ofthe energy intake, while form-factor is not compromised asthe two harvesters exploit different sides of the watch (top forsolar and bottom for thermal). Assuming InfiniWolf stayingin challenging indoor conditions for 6 hours, and the worst-case scenario for the TEG energy harvester, it will acquire21.44 J per day. This energy can be used to prolong the batterylifetime of InfiniWolf or to achieve an energy-autonomoussmart watch [43]. In the latter case, the energy acquired needsto balance the energy consumed during the classification andthe power consumption for the sleep mode. Thus, it is crucialboth limiting the activation time and the energy in activemode, as well as reducing the power consumption in sleepmode [43]. The smart power supply unit (PSU) includes a TEGenergy harvesting integrated circuits based on the BQ25505from Texas Instruments, while the BQ25570 integrated circuitsdeal with the solar energy harvesting and provide a 1.8 LDO.Finally, a fuel gauge integrated circuit (BQ27441) monitorsthe 120 mAh Li-Ion battery. The smartwatch includes a 9-axis motion sensor (Invensense ICM20948), a pressure sensor(Bosch Sensortec BMP280), a microphone (Invensense ICS-43432), and an ultra-low-power ECG/EMG and bioimpedanceanalog front-end (Maxim MAX30001) to acquire biomedicalsignals as well as a low power galvanic skin response (GSR)front-end. The wearable device can be worn on the user’s wristand periodically and opportunistically acquires informationfrom the sensors according to the available energy. A prototypeof the smartwatch is shown in Figure 5.

IV. OPEN-SOURCE FANN-ON-MCU FRAMEWORK

This section provides an overview of the steps for devel-oping an embedded machine learning system and illustratesFANN-on-MCU in detail.

A. Embedded ML System Development

Developing an intelligent sensor device with on-board pro-cessing capabilities encompasses many steps. We provide abrief overview in Figure 6. Development starts with specifyinga precise target application and which platform and sensors touse, collecting data, and annotating it with the desired labels.In the next step, the data should be pre-processed, identifyingand applying suitable feature extractors, optionally performingdata augmentation, normalizing the data, and preparing it in asuitable data format for training. Then the neural network hasto be specified and trained, the network’s hyper-parameters andits structure have to be explored (number of layers and nodes,which activations), promising networks have to be trained, andthe best network has to be identified. If the desired accuracycannot be obtained, the previous steps have to be revisited(collection of more data, different feature extractors, changing

Create/obtain dataset

• Specify target application and platform/sensors

•Collect data

• Label/annotate data

• Split the data into training, validation, and test sets

Preprocessing

• Identify and apply feature extractor

•Data augmentation (optional)

•Normalize data

•Convert to FANN format

Define and train NN

•Explore hyperparameters, network structure

•Train promising NNs

• Identify the best network

Deploy on device

•Convert NN to fixed-point (optional, device-dependent)

•Develop optimized implementation for the device

• Integrate NN w/ sensor read-out, preprocessing, etc.

•Measure performance, power

Acc

ura

cy t

arge

t n

ot

met

Too

larg

e n

etw

ork

Fig. 6. Process flow from concept to deployment.

the data augmentation for the training samples). Once thedesired accuracy has been achieved, the network must bedeployed on the device, converting the neural network tofixed-point (if no hardware floating-point support is available),developing an optimized implementation for the target device,co-integrating it with the sensor read-out and pre-processing,and measuring the resulting performance and power.

B. The FANN-on-MCU Deployment Framework

The open-source FANN-on-MCU framework allows multi-layer ANNs to be implemented and optimized on resource-constrained ultra-low-power MCUs by calling a single-line ofcommand. It automatically generates the optimized C codethat can be directly compiled for the selected platform. Inparticular, the framework takes into account the platform archi-tecture by exploiting the ARM M-series-specific instructions,and the custom XPULP instruction set extensions on PULP.The framework includes a script for automatic conversion ofthe trained network to a directly callable dependency-free Cfunction, including a test method that applies samples fromthe dataset for verification and benchmarking. Notably, thegenerated code files include all the parameters of the networkto overcome the need for file system support. This allows fora straightforward workflow:

1) Convert the data to the standard FANN format;2) Train a neural network using FANN and save it;3) Optionally convert the neural network to fixed-

point by rescaling the input data and callingfann_save_to_fixed;

4) Apply the conversion script to the ANN and dataset;5) Call the fann_type* fann_run(fann_type

*input) function from within your code;6) Build your code together with the generated files and

evaluate the resulting application.Our converter also supports fixed-point and floating-pointmodels and can thus run also efficiently on MCUs withouta floating-point unit, such as the tiny Cortex-M0 and M0+.We have released the code as open-source software online.

Page 7: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

7

The framework is straightforward to use: with a single-line command, one can generate the C code and header filesfor the desired platform with the selected data type, i.e.,floating-point or fixed-point. The conversion script takes intoaccount the processor family and evaluates the network sizeto automatically select the level of memory closest to theprocessing unit, still big enough to contain the whole network.Specifically, it estimates the required memory for the networkand the buffers according to

Em =(2 · Ldata buffer + 5 ·Nneurons+

Nweights + 2 ·Nfann layers) · sizeof(dtype),(2)

where Em is the estimated memory size, Ldata buffer is thebuffer length of one input sample multiplied by two con-sidering the eventual double buffering for continuous dataprocessing from sensors or other sources, Nneurons is the totalnumber of neurons in the network including the biases seen asan additional neuron to every layer, this constant is multipliedby five due to the storage of indexes to the first and lastconnections in the layer, the activation steepness, the typeof the activation function, and the output of every neuron,Nweights is the total number of the weights in the network,and Nfann layers is the total number of layers including theinput, the hidden, and the output layers, multiplied by two forstoring the indexes of the first and the last neurons in eachlayer.

According to Em and the selected processor, the frameworkautomatically stores the network parameters in the level ofcache that is closest to the processing unit and still containsthe network. For example, if an ARM Cortex-M processorwith 96 kB RAM available for data storage is selected and theestimated network size is 30 kB, the network parameters willbe automatically loaded into RAM. Additionally, for PULP-based processors, the framework is aware of the characteristicdouble domain feature of some PULP-based processors andthe DMA unit. For example, the following situations applyfor PULP Mr. Wolf:

• FC selected, Em smaller than the private L2 memory,then the network is stored into the private memory of theFC.

• FC selected, Em bigger than the private L2 memory, thenthe network is stored in the shared L2 memory.

• Cluster selected, Em smaller than the L1 memory, thenthe network is stored into L1 memory.

• Cluster selected, Em bigger than the L1 memory, thenthe network is stored into the shared L2 memory withautomatic DMA transfers exploiting the double buffering,i.e., while computing on one chunk of data, the nextchunk is transferred simultaneously. In this case, twotypes of transfers are possible:

– When the largest layer fits into the L1 memory, theDMA unit transfers the whole layer from L2 to L1(i.e., layer-wise transfers).

– When the largest layer does not fit into the L1 mem-ory, the DMA unit performs neuron-wise transfers,i.e., it transfers the weights for a single neuron at atime.

Float Fixed Float Fixed Float Fixed Float Fixed0

20000

40000

60000

80000

100000

120000

Num

ber o

f cyc

les

106186

92313

105150

91085

6111764014

10339 10725

121323

103527

117515

95512

89679

68888

30837

16406

FANNCortexMSTM32

FANN-on-MCUSTM32

FANN-on-MCUMr.Wolf

singlecore cluster

FANN-on-MCUMr.Wolf

multicore cluster

Performance Before and After the First Optimization Stepsw/o activations activations

Fig. 7. Runtime in number of cycles before and after the first optimizationsteps on the example network.

Generally, in PULP-based processors, the cluster domain isused for applications where a large amount of computations isrequired, while the FC is usually engaged in I/O handling orother kinds of scheduling. However, for an application scenariowhere, for example, a small network is used to detect the onsetand, once the onset is detected, a deeper network is used forclassification [44], both domains (SoC and Cluster) have itsown advantage: the FC continuously reads the sensory dataand executes the onset detection algorithm, while the clusterdomain is activated once the onset is detected to performthe classification with a deep NN. In this case, our proposedframework stores the small network into the private L2 mem-ory for the FC, while for the classification task, the DMAunit transfers the network using the double-buffering techniqueinto the L1 memory for the computation in cluster domain.With this configuration, access to memory is the fastest forboth computation domains. This way, we successfully meetthe two main requirements in the IoT domain: low powerand low latency, since we power on the powerful computationengine, i.e., the cluster domain, to perform fast computationonly when it is strictly necessary in order to save the overallenergy consumption. The whole process is automated in ourframework to alleviate the user’s workload.

V. PERFORMANCE EVALUATION

In this section, we evaluate the performance of FANN-on-MCU on a wide variety of network architectures with varyinginput and output sizes and number of hidden layers andhidden units, before demonstrating its efficiency for specificapplication showcases in Section VI.

A. Methods

In a first step, we analyzed the code provided by FAN-NCortexM, which supports only ARM Cortex-M, to findperformance bottlenecks and to do profiling using an examplenetwork with 5 input features, 2 hidden layers with 100neurons each, and 3 output classes. Each neuron is followed byan hyperbolic tangent as activation function. The evaluation is

Page 8: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

8

4 36 68 100132

164196

228260

292324

356388

420452

484516

548580

612644

676708

Output size

70867664461258054851648445242038835632429226022819616413210068364

Inpu

t size

24.4 254.9 478.4 701.8 925.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

23.3 243.6 457.0 670.4 883.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

22.2 232.2 435.6 639.0 842.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

21.1 220.8 414.2 607.7 801.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

20.1 209.4 392.8 576.3 759.8 943.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

19.0 198.0 371.4 544.9 718.4 891.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

17.9 139.6 350.1 513.6 677.0 840.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

16.8 131.1 328.7 482.2 635.7 789.2 942.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

15.7 122.6 307.3 450.8 594.3 737.8 881.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

14.6 114.1 285.9 419.4 553.0 686.5 820.0 953.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

13.5 105.6 264.5 388.1 511.6 635.1 758.7 882.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

12.4 97.0 243.1 356.7 470.2 583.8 697.3 810.9 924.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

11.3 88.5 221.8 325.3 428.9 532.4 636.0 739.6 843.4 947.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

10.3 80.0 149.8 293.9 387.5 481.1 574.7 668.2 762.1 855.7 949.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

9.2 71.5 133.8 262.6 346.2 429.7 513.3 596.9 680.8 764.4 848.0 931.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

8.1 62.9 117.8 231.2 304.8 378.4 452.0 525.6 599.5 673.1 746.7 820.3 894.0 967.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

7.0 54.4 101.9 149.3 263.4 327.0 390.7 454.3 518.1 581.8 645.4 709.1 772.7 836.4 900.0 963.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5.9 45.9 85.9 125.9 222.1 275.7 329.3 382.9 436.8 490.5 544.1 597.8 651.4 705.1 758.8 812.4 866.1 919.7 973.4 0.0 0.0 0.0 0.0

4.8 37.4 70.0 102.6 135.2 224.3 268.0 311.6 355.5 399.2 442.8 486.5 530.2 573.8 617.5 661.2 704.8 748.5 792.2 835.8 879.5 923.2 966.8

3.7 28.9 54.0 79.2 104.4 129.5 154.7 240.3 274.2 307.9 341.6 375.2 408.9 442.6 476.3 509.9 543.6 577.3 611.0 644.7 678.3 712.0 745.7

2.6 20.4 38.1 55.8 73.6 91.3 109.0 126.8 144.5 162.2 240.3 264.0 287.6 311.3 335.0 358.7 382.4 406.1 429.8 453.5 477.2 500.9 524.5

1.5 11.9 22.2 32.5 42.8 53.1 63.4 73.7 84.0 94.3 104.6 114.9 125.2 135.5 145.9 156.2 166.5 232.7 246.3 259.8 273.4 287.0 300.6

0.5 3.3 6.2 9.1 12.0 14.9 17.7 20.6 23.5 26.4 29.3 32.1 35.0 37.9 40.8 43.7 46.6 49.4 52.3 55.2 58.1 61.0 63.8

Runtime Measurements of Single-Layer MLP on Cortex-M4

200 400 600 800Number of cycles in units of thousands

(a)

4 36 68 100132

164196

228260

292324

356388

420452

484516

548580

612644

676708

Output size

70867664461258054851648445242038835632429226022819616413210068364

Inpu

t size

34.0 274.2 538.3 754.6 994.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

32.5 274.0 514.2 754.4 950.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

31.0 261.2 490.1 719.1 948.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

29.5 248.3 466.0 683.7 861.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

29.0 235.5 441.9 648.3 816.7 1013.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

27.4 222.7 417.8 612.9 772.2 958.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

25.8 209.8 393.7 577.5 727.6 903.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

23.3 197.0 369.6 542.2 714.8 848.0 1013.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

21.8 184.1 345.5 506.8 638.6 792.7 946.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

20.3 171.3 321.3 471.4 621.5 737.5 880.9 1024.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

18.7 158.4 297.2 436.0 549.5 682.2 814.9 947.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

17.2 145.6 273.1 400.7 505.0 626.9 748.8 870.7 992.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

15.7 132.7 249.0 365.3 460.5 571.6 682.8 794.0 905.1 1016.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

14.2 119.9 224.9 329.9 415.9 516.3 616.8 717.2 817.6 918.0 1018.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

12.7 107.0 200.8 294.5 388.3 461.1 550.7 640.4 730.1 819.7 909.4 999.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

11.1 94.2 176.7 259.2 341.7 405.8 484.7 563.6 642.5 721.4 800.3 879.3 958.2 1037.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

9.6 81.3 152.6 223.8 295.0 350.5 418.7 486.8 555.0 623.1 691.3 759.5 827.6 895.8 963.9 1032.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0

8.1 68.5 128.4 188.4 237.8 295.2 352.6 410.0 467.5 524.9 582.3 639.7 697.1 754.5 811.9 869.3 926.7 984.1 1041.5 0.0 0.0 0.0 0.0

6.6 53.3 104.3 153.0 201.7 239.9 286.6 333.3 379.9 426.6 473.2 519.9 566.5 613.2 659.9 706.5 753.2 799.8 846.5 893.1 939.8 986.4 1033.1

5.1 41.0 80.2 117.7 148.8 184.7 220.6 256.5 292.4 328.3 364.2 400.1 436.0 471.9 507.8 543.7 579.6 615.5 651.4 687.3 723.2 759.1 795.0

3.5 29.9 53.9 82.3 104.2 129.4 154.5 179.7 204.8 230.0 255.1 280.3 305.5 330.6 355.8 380.9 406.1 431.2 456.4 481.5 506.7 531.8 557.0

2.0 17.1 32.0 45.3 59.7 74.1 88.5 102.9 117.3 131.7 146.1 160.5 174.9 189.3 203.7 218.1 232.5 246.9 261.3 275.7 290.1 304.5 318.9

0.5 4.2 7.9 11.5 15.2 18.8 22.5 26.1 29.7 33.4 37.0 40.7 44.3 48.0 51.6 55.3 58.9 62.6 66.2 69.9 73.5 77.2 80.8

Number of Cycles of Single-Layer MLP on Single-Ibex Core

200 400 600 800 1000Number of cycles in units of thousands

(b)

Fig. 8. Runtime in number of cycles of a single layer by varying the number of input and output to the layer. 0.0 is when the network is too big to be storedin the largest memory. Runtime measurements on (a) an ARM Cortex-M4 and (b) the PULP IBEX core with the basic RV32IMC ISA. In (a), the continuousblue grid delimits the case when the layer is too large to fit into RAM, therefore it is stored in flash memory. In (b), the purple dotted grid delimits the casewhen the layer is too large to fit into private L2, hence it is stored in the shared L2 memory.

done on an STM32L475VG with ARM Cortex-M4. Moreover,we run the same network on Mr. Wolf to show comparisonsin both floating and fixed point.

We then proceeded with the implementation of FANN-on-MCU to support the deployment of ANNs on both ARMCortex-M series and PULP family, and exhaustively evaluatedits performance on STM32L475VG with ARM Cortex-M4 andPULP-based Mr. Wolf with the following two approaches:

• We first considered single-layer performance. We mea-sured the runtime of a single layer by varying the numberof inputs and outputs of the layer, namely the length ofthe input feature vector and the number of neurons in thelayer.

• Subsequently, we varied the number of hidden layers withfixed input features (100 input features) and fixed outputunits (8 classes) and measured the performance of thewhole network. The number of hidden units in everyhidden layer varies according to the following equation:

Nl = (l mod 2 + l div 2) · d (3)

where Nl is the number of neurons in l-th hidden layer,mod is the modulo operation and div is the integerdivision, d a tunable parameter. Hence, the number oftotal hidden units (Ntot) is:

Ntot =

L∑l=1

(l mod 2 + l div 2) · d (4)

with L the total number of hidden layers. The tunableparameter d decides how fast the number of neurons ineach layer increases, i.e. how fast the whole networkgrows before reaching the memory limitation of theembedded platform.

All the measurements are done on-board in number of cy-cles without any normalization. The results are shown anddiscussed in the next section.

B. Experimental Results and Discussion

The inference of MLP comprises of two main parts: it firstcomputes the linear function in (1) by calculating the inner dot-products, then it executes the non-linear activation function.We did our evaluations following this separation. We provideresults for both floating-point and fixed-point implementations,as many Cortex-M and RISC-V devices, which are more cost-efficient, do not include a floating-point unit.

In the previous FANNCortexM, the buffer used to keep theintermediate values of each neuron is first filled in with the biasand then overwritten immediately afterward. This step causesan unnecessary performance slowdown, which is eliminatedin our version of FANN-on-MCU.

Page 9: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

9

4 36 68 100132

164196

228260

292324

356388

420452

484516

548580

612644

676708

Output size

70867664461258054851648445242038835632429226022819616413210068364

Inpu

t size

1.9 2.1 2.2 2.1 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.9 2.2 2.2 2.2 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.9 2.0 2.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.9 2.2 2.2 2.2 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.0 2.2 2.2 2.2 2.1 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.0 2.2 2.2 2.2 2.1 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.0 2.2 2.2 2.2 2.1 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.1 2.0 2.0 2.0 2.0 2.1 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.1 2.0 2.0 2.0 2.1 2.1 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.1 2.0 2.0 2.0 2.0 2.1 2.1 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.1 2.0 2.0 2.0 2.1 2.1 2.1 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.1 2.0 2.0 2.0 2.1 2.1 2.1 2.1 2.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.1 2.0 2.1 2.1 1.9 1.9 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.9 2.2 2.1 2.1 2.1 1.9 1.9 1.9 1.9 1.9 2.0 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.1 2.0 2.1 2.1 2.1 2.0 2.0 2.0 2.0 2.0 1.9 2.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.1 2.0 2.1 2.1 2.1 2.0 2.0 2.0 2.0 2.0 1.9 2.0 2.0 2.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.1 2.0 2.0 1.9 2.0 2.0 2.0 2.0 2.0 2.0 1.8 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.0 0.0 0.0 0.0

2.1 2.1 2.2 2.2 1.9 1.8 2.0 2.0 1.8 1.8 2.0 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 2.0 1.8 1.8

2.0 1.9 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.8 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.8 2.0 2.0

1.8 2.1 2.0 2.1 1.9 1.9 1.9 1.9 1.8 1.8 1.9 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.9 1.8 1.8

1.7 2.0 2.0 2.0 2.0 1.8 1.8 1.8 1.8 1.8 1.8 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.8 1.7 1.7

1.3 1.6 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7

Speedup of Single-Layer MLP on Single-Ri5cy Core wrt. Single-Ibex Core

1.4 1.6 1.8 2.0Speedup

(a)

4 36 68 100132

164196

228260

292324

356388

420452

484516

548580

612644

676708

Output size

70867664461258054851648445242038835632429226022819616413210068364

Inpu

t size

3.4 5.6 6.3 6.7 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.4 5.6 6.3 6.7 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.4 5.6 6.4 6.7 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.4 5.5 6.3 6.7 6.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.4 5.5 6.3 6.6 6.8 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.4 5.5 6.3 6.6 6.8 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.4 5.5 6.3 6.6 6.8 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.0 6.0 6.9 7.2 7.4 6.9 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.0 6.0 6.9 7.2 6.8 6.9 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.0 6.0 6.8 7.2 7.4 6.9 7.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.0 6.0 6.8 7.2 6.7 6.9 6.9 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.0 6.0 6.8 7.2 6.7 6.8 6.9 7.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.0 7.1 6.2 6.5 6.7 6.9 6.9 7.0 7.1 7.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.0 7.0 6.7 7.1 6.7 6.8 6.8 6.9 7.0 7.0 7.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3.3 5.8 6.1 6.4 6.6 6.8 6.9 6.9 7.0 7.6 7.0 7.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.9 6.9 6.1 6.4 6.6 6.2 6.3 6.3 6.9 6.9 7.0 7.0 7.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.9 6.9 6.0 6.3 6.5 6.1 6.2 6.3 6.8 6.8 6.9 6.9 6.9 6.9 6.9 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2.9 6.8 7.3 6.8 6.4 6.5 6.1 6.2 6.7 6.7 6.8 6.8 6.8 6.8 6.8 6.9 6.9 6.9 6.9 0.0 0.0 0.0 0.0

2.8 5.5 6.0 6.2 6.9 7.0 6.5 6.6 6.7 7.2 6.7 7.3 7.3 7.3 7.3 7.3 6.9 6.9 6.9 6.9 6.8 6.9 6.9

2.7 6.4 7.0 7.2 6.7 6.3 5.9 6.0 6.4 6.5 6.6 6.5 6.5 6.5 6.5 6.6 6.6 6.6 6.6 6.6 6.7 6.6 6.6

2.8 5.6 5.6 5.8 6.9 7.0 6.1 6.1 6.2 6.7 6.2 6.7 6.7 6.7 6.8 6.8 6.4 6.4 6.4 6.4 6.3 6.4 6.4

2.5 5.0 5.5 5.3 5.4 6.2 6.3 6.3 6.4 6.4 6.4 6.0 6.1 6.1 6.1 6.1 5.8 5.9 5.9 5.9 5.8 5.9 5.9

1.6 3.2 3.5 3.6 3.7 3.7 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.9 3.9

Speedup of Single-Layer MLP on Multi-Ri5cy Cores wrt. Single-Ri5cy Core

2 3 4 5 6 7Speedup

(b)

Fig. 9. Speedup on PULP of a single layer perceptron by varying the number of input and output to the layer. The purple dotted grid delimits the case whenthe layer is too large to fit into private L2, hence it is stored in the shared L2 memory for the FC. The gray dash-dotted grid delimits the case when the layerdoes not fit into the L1 memory and neuron-wise DMA transfer is applied for the Cluster. 0.0 is when the layer is too big to be stored in the largest memory.(a) Speedup of single RI5CY core with respect to a single IBEX core, (b) Parallel speedup of multiple RI5CY cores with respect to a single one.

Figure 7 shows the gain in performance and the pro-filing results of the example network executed on theSTM32L475VG’s ARM Cortex-M4. In the plot, we can ob-serve three main outcomes on Cortex-M4: 1) The eliminationof unnecessary initialization improves the runtime by 3.1%and 7.7% for floating-point and fixed-point implementations,respectively; 2) The fixed-point version is around 15% fasterthan the floating-point version; 3) The computation of theweight matrices without the activation functions is the mostcomputationally demanding part. More specifically, it com-prises approximately 88% of the total compute time for thisexample network. Based on these observations, we conductedour following analyses focusing on the computation of weightmatrices without activation functions.

For Mr. Wolf, we observe 1) the same network running ona single RI5CY core is around 1.3× and 1.4× faster than theCortex-M4 in floating and fixed point, respectively; 2) Theparallelization provides up to 6× speedup in both floating andfixed point.

These results are well aligned with the expectations: 1)On the Cortex-M, the floating- and fixed-point implemen-tations require 8 and 7 cycles in the inner-most loop (cf.Table I), respectively, which corresponds accurately to theratio of the cycle counts reported in Figure 7; Similarly, theratio of the cycle counts between the Cortex-M and single-core RI5CY implementations match the expected 7/5 and8/5 factors for fixed/float, respectively; 2) The floating-point

and fixed-point implementations on RI5CY both require 5instructions, each taking 1 cycle, and thus resulting in similarexecution times. Even after parallelization we do not see asignificant performance difference between the two versionsdespite sharing only 2 floating-point units (FPUs) among the8 cores. However, as only every fifth instruction uses the FPU,its utilization reaches 80% and is thus not a performancebottleneck.

We further proceeded with a single-layer performance ex-amination by focusing on the computation of weight matriceswithout activation functions. Figure 8a shows the runtime innumber of cycles of a single layer by varying the numberof input and the number of output to the layer executed onSTM32L475VG with ARM Cortex-M4 with the fixed-pointimplementation. The blue grid delimits the case when the layeris too large to fit into RAM and is thus stored in flash memory.0.0 indicates that the layer is even too large to be stored inflash memory. Figure 8b shows the same measurements on theFC of Mr. Wolf, i.e., single IBEX core with the basic RM32IMinstruction set. The dotted purple grid marks the fact that thelayer does not fit into the private L2 memory anymore and isthus stored in the larger shared L2 memory. These two figuresrepresent the reference in number of cycles for the followingdiscussions, which will be led in terms of speedups.

Compared to a single IBEX core, we can see from theresults in Figure 9a that we gain up to 2.2× speedup using asingle RI5CY core, by adding the custom instruction set ex-

Page 10: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

10

TABLE IASSEMBLY CODE FOR INNER-LOOP DOT-PRODUCT. THE NUMBER OF CYCLES TAKEN BY THE INSTRUCTION IS REPORTED IN PARENTHESIS.

ARM Cortex-M4Float

ARM Cortex-M4Fixed

RISC-V RI5CYFloat

RISC-V RI5CYFixed

vldmia.32 (post-incr. load, 1)vldmia.32subs (counter subtract, 1)vfma.f32 (fused mult-add, 3)bne (taken conditional branch, 2)

ldr (post-incr. load, 1)ldr (post-incr. load, 1)mul (multiply, 1)add (addition, 1)subs (counter subtract, 1)bne (taken conditional branch, 2)

flw (float load, 1)flwaddi (pointer incr., 1)addifmadd.s (fused mult-add, 1)

p.lw (post-incr. load, 1)p.lwmul (multiply, 1)sra (shift right arith., 1)add (addition, 1)

4× loopunrolling 2× loopunrolling

4 36 68 100132

164196

228260

292324

356388

420452

484516

548580

612644

676708

Output size

70867664461258054851648445242038835632429226022819616413210068364

Inpu

t size

1.4 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.4 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.4 1.8 1.8 1.8 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.4 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.4 1.9 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.4 1.9 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.4 1.4 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.5 1.3 1.8 1.8 1.8 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.5 1.3 1.8 1.8 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.5 1.3 1.8 1.8 1.8 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.5 1.3 1.7 1.7 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.5 1.3 1.7 1.7 1.9 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.5 1.3 1.9 1.9 1.7 1.7 1.7 1.7 1.7 1.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.5 1.3 1.3 1.7 1.9 1.9 1.9 1.9 1.9 1.9 1.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.4 1.5 1.4 1.9 1.9 1.7 1.7 1.7 1.7 1.7 1.9 1.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.5 1.3 1.4 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.7 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.5 1.3 1.4 1.4 1.9 1.9 1.9 1.9 1.9 1.9 1.7 1.9 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1.5 1.3 1.3 1.3 1.9 1.9 1.9 1.9 1.9 1.9 1.7 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 0.0 0.0 0.0 0.0

1.5 1.4 1.4 1.4 1.3 1.7 1.9 1.9 1.7 1.7 1.9 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.9 1.7 1.7

1.5 1.3 1.3 1.3 1.4 1.4 1.4 1.8 1.8 1.8 1.7 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.7 1.8 1.8

1.3 1.4 1.4 1.4 1.3 1.3 1.3 1.3 1.3 1.3 1.8 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.8 1.7 1.7

1.3 1.4 1.4 1.4 1.4 1.3 1.3 1.3 1.3 1.3 1.3 1.2 1.2 1.2 1.2 1.2 1.2 1.6 1.6 1.6 1.7 1.6 1.6

1.2 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3

Speedup of Single-Layer MLP on Single-Ri5cy Core wrt. Cortex-M4

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9Speedup

(a)

4 36 68 100132

164196

228260

292324

356388

420452

484516

548580

612644

676708

Output size

70867664461258054851648445242038835632429226022819616413210068364

Inpu

t size

4.7 10.7 12.2 12.9 13.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.7 10.7 12.2 12.9 13.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.7 9.8 11.2 11.8 12.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.7 10.7 12.2 12.8 13.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.7 10.6 12.1 12.8 13.1 13.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.7 10.6 12.1 12.8 13.1 13.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.7 7.9 12.1 12.7 13.1 13.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.7 7.9 12.0 12.7 13.0 13.3 13.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.6 7.9 12.0 12.6 13.0 13.2 13.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.6 7.9 12.0 12.6 13.0 13.2 13.3 13.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.6 7.8 11.9 12.5 12.9 13.1 13.3 13.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.6 7.8 11.9 12.5 12.8 13.1 13.2 13.4 13.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.6 9.4 11.8 12.4 11.8 12.0 12.1 12.2 12.3 12.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.5 9.3 8.7 12.3 12.7 12.9 13.1 13.2 13.3 13.3 12.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.5 8.4 8.7 12.2 12.6 11.8 12.0 12.1 12.1 13.2 13.3 13.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.4 9.2 8.6 12.1 12.5 11.8 11.9 12.0 13.0 13.1 12.2 13.2 13.2 13.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.4 9.1 8.5 8.9 12.3 11.6 11.7 11.8 12.8 12.9 12.0 13.0 13.1 13.1 13.1 13.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4.3 9.0 9.7 8.8 12.1 12.3 11.6 11.7 12.6 12.7 11.8 12.8 12.8 12.9 12.9 12.9 12.9 12.9 13.0 0.0 0.0 0.0 0.0

4.2 8.0 8.7 9.0 8.8 12.0 12.1 12.2 11.4 12.4 12.4 12.5 12.5 12.5 12.6 12.6 11.8 11.8 11.8 11.8 12.7 11.8 11.9

4.0 8.5 9.2 9.5 9.6 8.6 8.2 11.0 11.8 11.9 11.2 12.0 12.0 12.1 12.1 12.1 12.1 12.2 12.2 12.2 11.4 12.2 12.2

3.8 8.0 8.1 8.3 9.1 9.2 8.2 8.3 7.8 8.4 11.2 11.2 11.3 11.3 11.3 11.3 10.7 10.8 10.8 10.8 11.4 10.8 10.8

3.3 7.1 7.7 7.5 7.6 8.2 8.3 8.3 8.4 8.4 8.4 7.3 7.4 7.4 7.4 7.4 7.1 9.3 9.4 9.4 9.8 9.4 9.4

1.9 4.2 4.6 4.8 4.9 4.9 5.0 5.0 5.0 5.0 5.0 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1

Speedup of Single-Layer MLP on Multi-Ri5cy Cores wrt. Cortex-M4

2 4 6 8 10 12Speedup

(b)

Fig. 10. Comparison between ARM Cortex-M4 and PULP RI5CY cores on a single layer perceptron by varying the number of input and output to thelayer. The gray dash-dotted grid delimits the case when the layer is too big for the L1 memory and neuron-wise DMA transfer is applied for RI5CY. Thecontinuous blue grid delimits the case when the layer is too large to fit into RAM, hence it is stored in flash for Cortex-M4. 0.0 is when the layer is too bigto be stored in the largest memory. (a) Speedup of single RI5CY core with respect to ARM Cortex-M4, (b) Parallel speedup of multi RI5CY with respect toARM Cortex-M4.

tensions, most importantly hardware loops and post-incrementloads. The grey dash-dotted grid delimits the results whenthe layer does not fit into the L1 memory, and neuron-wisedouble-buffering transfer using DMA unit is applied, i.e., theweights of a single neuron is transferred from the L2 to theL1 memory while the RI5CY processor computes the resultof the previous neuron. We can see that the speedup for largerinput sizes is higher than the speedup for smaller input sizesbecause the overhead of activating the DMA transfers becomesnegligible with longer computation time due to larger inputfeature vectors. We further measured the speedup of parallelexecution with respect to single RI5CY core execution. Theresults in Figure 9b demonstrate up to 7.7× speedup withmeasurements on-board.

Subsequently, we compared the execution on ARM Cortex-M4 and PULP Mr. Wolf. Figure 10a shows the speedup ofsingle RI5CY core over ARM Cortex-M4, while Figure 10bdemonstrates the parallel RI5CY core speedup over ARMCortex-M4. The former reaches almost a 2× speedup, whereasthe latter achieves up to 13.5× speedup.

Finally, we measured the performance of entire networkswith fixed input and output layers while varying the numberof hidden layers and hidden units, as described in Section V.We used the following criteria to choose the parameter d in(3) and (4):

1) To show at least two measurement points in each mem-ory section, for example between the dashed line and thedash-dotted line in Figure 12b, i.e. when the network fits

Page 11: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

11

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Number of hidden layers

0

200

400

600

800

1000

Num

ber o

f cyc

les i

n un

it of

thou

sand

s

L1 L2

layer-wise neuron-wise

private L2 shared L2

RAM FLASH

Cortex-M4Single-IBEX CoreSingle-RI5CY CoreMulti-RI5CY Cores

8 16 32 48 72 96 128

160

200

240

288

336

392

448

512

576

648

720

800

880

96810

5611

5212

48Total number of hidden units

Fig. 11. Runtime in number of cycles of whole network with varying numberof hidden layers and hidden units.

into RAM on STM32 with ARM Cortex-M4 and not inL1 of PULP Mr. Wolf.

2) To include all the possible cases of memory transfer, i.e.both layer-wise transfer and neuron-wise transfer. If d issmall, the case of neuron-wise transfer will be absent,because the number of neurons in each layer grows tooslowly before the whole network is too big to fit intothe largest available memory.

3) To have the highest energy efficiency with paralleliza-tion.

Based on the estimated memory in (2) and running severalexperiments, 6 < d < 12 satisfies the criteria 1 and 2 butnot the criterion 3. With d = 8 all the above mentioneddecision policies are met. With this chosen value, we showthe performance measurements from relatively small networkswith a single hidden layer with only 8 hidden nodes, which canfit into RAM of Cortex-M4 and L1 and private L2 of Mr. Wolf,to relatively large networks with 24 hidden layers with 1248hidden units, where neuron-wise transfer is required for anoptimized execution. With this design, networks larger than 24hidden layers do not fit into the largest memory of our selectedprocessors. Moreover, since Mr. Wolf has eight parallel cores,networks’ layers with multiple of 8 number of neurons exploitthe parallelization at its highest efficiency. Figure 11 shows theruntime in number of cycles, and Figure 12a and Figure 12bpresent the speedup.

As can be seen in Figure 12a, the network fits into the L1memory of Mr. Wolf up to 12 hidden layers, i.e., 336 hiddenunits. Networks larger than 12 hidden layers are stored in theL2 memory and are transferred piece-wise to the L1 memorywith DMA transfers in a double-buffering configuration. Thelargest layer of the networks with between 13 and 21 hiddenlayers still fits into the L1 memory allowing for layer-wiseDMA transfers, i.e., the parameters of a whole layer aretransferred at a time. For networks with more than 21 hiddenlayers, neuron-wise DMA transfers are performed as not alllayers fit into the L1 memory even individually. Thus, wetransfer the parameters of one neuron at a time. We can see

from the plot that when the network size is small, the parallelspeedup is lower than in cases where the network size is large.This can be attributed to the overhead of parallelization andis more noticeable for small networks where the amount ofcomputation is relatively small. However, we can still reacharound 4.5× parallel speedup with respect to the single RI5CYcore for tiny networks such as the one with only one hiddenlayer comprising 8 hidden neurons. The general tendency ofthe parallel speedup increases while augmenting the networksize.

Figure 12b demonstrates the comparisons between PULPMr. Wolf and ARM Cortex-M4. We can see that IBEX coreis slightly slower than Cortex-M4 when the latter accesses toRAM, while the performance of Cortex-M4 degrades slightlywhen the network is too large to fit into RAM, and it hasto access the flash memory. In this case, the IBEX core isas fast as the Cortex-M4. The same phenomenon can beobserved in the other two figures, i.e., single-core and multi-core speedup of RI5CY over Cortex-M4. Specifically, a singleRI5CY core is almost twice as fast as a Cortex-M4 codeas a result of the custom ISA extensions, e.g., hardwareloops and post-increment loads. The parallel speedup growssteadily with increasing network size. As expected, we seea speedup drop when the network is too large to be storedin L1 for RI5CY cores, while it still fits into the RAM ofthe Cortex-M4. Contrarily, when the network is stored intoflash on the Cortex-M4, the DMA transfer from L2 to L1 inMr. Wolf offers much more gain with respect to flash accessin Cortex-M4, more specifically the speedup reaches up to11.1× for the designed network architecture. As mentionedpreviously, the purpose of this experiment is to show theperformance of an entire network starting with very smallnetwork sizes, such as the one with only 8 hidden units,and increasing the network size by adding hidden layers andhidden neurons consecutively. With this design, the networkswith more than 24 hidden layers do not fit into the largestmemory. Nevertheless, the performance of larger networksusing neuron-wise DMA transfer is exhaustively presented infigures 9a, 9b, 10a, and 10b, and discussed in the first part ofthis section.

VI. APPLICATION SHOWCASES

MLPs have been successfully used in a wide range ofapplication scenarios, such as disease detection [45], activityrecognition [46], and brain-machine interface [47]. Manystudies identified MLPs to be the best or one of the bestalgorithms to solve tasks in the IoT domain using wearabledevices [48]–[51]. In this section, we present three applicationshowcases found in the literature using an MLP with differentnetwork sizes [46]–[48] in order to demonstrate the usabilityand the power efficiency of our proposed framework with thesupported MCUs. We reproduced the network architecturesand executed the classifications on both Nordic nRF52832 withARM Cortex-M4 and PULP Mr. Wolf present on InfiniWolfusing our proposed framework and measured the runtime andpower consumption.

Page 12: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Number of hidden layers

1.01.52.02.53.03.54.04.55.05.56.06.5

Spee

dup

L1 L2layer-wise neuron-wise

private L2 shared L2

Single-RI5CY/Single-IBEXMulti-RI5CY/Single-RI5CY

8 16 32 48 72 96 128

160

200

240

288

336

392

448

512

576

648

720

800

880

96810

5611

5212

48Total number of hidden units

(a)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Number of hidden layers

123456789

1011

Spee

dup

L1 L2layer-wise neuron-wise

private L2 shared L2

RAM FLASH

Single-IBEX/Cortex-M4Single-RI5CY/Cortex-M4Multi-RI5CY/Cortex-M4

8 16 32 48 72 96 128

160

200

240

288

336

392

448

512

576

648

720

800

880

96810

5611

5212

48Total number of hidden units

(b)

Fig. 12. Speedup measurements of whole network with varying number of hidden layers and hidden units. (a) Speedup measurements on PULP Mr. Wolf,(b) Comparison between PULP Mr. Wolf and ARM Cortex-M4.

A. Hand Gesture Recognition

The authors in [47] presented a gesture classification methodbased on a sensor fusion technique using surface electromyo-graphy (EMG) and a 9-axes inertial measurement unit (IMU).The EMG and IMU signals were acquired using Myo Arm-band [52] placed around the forehand. The goal is to recognize10 hand gestures. 76 time-domain features are extracted fromEMG and IMU signals and fed as input to the designedMLP with three hidden layers of 300, 200, and 100 hiddenunits each, and an output layer of 10 classes. The highestclassification accuracy achieved is 85.58%. As a shorthand,we name it application A.

B. Fall Detection for Elderly People

We implement the model proposed in [48], named applica-tion B, where is proposed a combination of pressure sensorsin the insoles and accelerometers at the head, pelvis, left,and right shanks of the shoes to assess fall-risk in elderlypeople. The authors extracted spatial, temporal, and frequencydomain features from the pressure and accelerometer dataand explored various sensor combinations and three differentmachine learning models, i.e., MLPs, naıve Bayesian, andSupport Vector Machine. As a result, the best performingmodel was an MLP with input parameters from pressuresensors and accelerometers at the head, pelvis, and left shank,reaching the best accuracy of 84%. The network is composedof 117, 20, and 2 nodes, respectively, as input, hidden, andoutput layers.

C. Human Activity Classification

The authors in [46] proposed an FPGA implementation ofMLPs with parallel computation to classify human activity inreal-time. The dataset is acquired by a 3-axial accelerometerwore on the waist and classified into five activity classes.Various MLP topologies were investigated, and the final pro-posed architecture consists of 7 input features extracted using

a sliding window, 6 hidden nodes arranged in a single layer,and 5 output nodes, with the best accuracy being 94.6%. Thetrained model runs on a Xilinx FPGA with an execution timeof 270 ns and a power consumption of 241 mW. Here we callit application C for convenience.

D. Experimental Evaluation and Results

We reproduced the network architectures described abovewith sigmoidal activation functions using the FANN libraryand deployed them with our framework on the two processorspresent on InfiniWolf, i.e., the Nordic nRF52832 with an ARMCortex-M4 and Mr. Wolf, using our proposed framework.The measurements on nRF52832 are done with the processorrunning at 64 MHz and with DC/DC regulator enabled, sinceon the datasheet is stated that using the DC/DC regulator willreduce current consumption compared to when using the LDOregulator. The measurements on Mr. Wolf are done consideringboth the SoC and Cluster domains with the processors runningat 100 MHz since it is shown that at this frequency, theenergy efficiency is maximized [31]. The measurements areperformed using the power analyzer Keysight N6705C withthe minimum sampling interval being 0.1024 ms.

Table II summarizes the measured runtime and the averagepower consumption of the three MLPs used in the threeapplications. Let us first analyze the application A whichhas the largest network architecture, i.e., 76-300-200-100-10,yielding 103800 MACs. We can see that the runtime parallelspeedup using multi RI5CY cores is 7.1× with respect to asingle one, which corresponds to the measurements shown inFigure 9b. For example, according to Figure 9b, if we take aninput size of 324 and an output size of 292, which requires ap-proximately the same amount of computation (94608 MACs),we gain the same speedup of 7.1×. For comparisons withIBEX and Cortex-M4 we have to consider additionally theactivation, the initialization, and the deactivation of the cluster,that introduce a constant overhead of 1.2 ms on average, andthe DMA transfers of the input data from L2 to L1, which

Page 13: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

13

in this case is negligible with 76 inputs (~2.5µs). Figure 13plots the measured power consumption during the end-to-end execution of a single classification, including the clusteractivation/deactivation and DMA transfers. Due to Amdahl’slaw, the overall runtime speedup of multi RI5CY cores is3.5× with respect to a single core for one classification.If we do multiple classifications, the constant overhead willbecome negligible as we increase the number of classifications.Moreover, this overhead can be reduced with improved drivers,which is not the focus of this work.

Considering the end-to-end performance of a single classifi-cation, we obtain 5.7× and 8.8× speedup using multi RI5CYcores with respect to IBEX and Cortex-M4. These figures growasymptotically towards 14.3× and 22× with an increasingnumber of MLP classifications per cluster activation. Similarconclusions can be drawn for the energy consumption. Theconstant overhead is around 13µJ, while the energy consumedfor one classification using parallel computation is 54µJ,which is around 2.2× more energy efficient than the single-core computation. Comparing the end-to-end parallel perfor-mance of one classification to Cortex-M4 and IBEX, we saverespectively 2.8× and 1.8× more energy. These gains increaseup to 3.4× and 2.3×, with a rising number of classificationsper cluster activation.

Finally, for very small networks, such as the ones used inapplications B and C, the runtime is far below the millisecondrange. If the application scenario requires only very fewclassifications per cluster activation, then the IBEX core is themost energy-efficient one, with a consumption of 2.9µJ and0.15µJ, respectively for applications B and C. Comparing tothe work in [46] for application C, the IBEX core is 13.5×faster in computation time and 434× more energy efficientthan a parallel FPGA implementation. However, if continuousclassification is required, which is the case for the vast majorityof the IoT applications, then the parallel execution, onceagain, outperforms in terms of speed and energy efficiency.For example, for one classification in application B, IBEXcore consumes 2.86µJ, while the parallel execution consumes0.67µJ in addition to the constant overhead of 13µJ whichis spent only once, then the parallel ultra-low-power approachalready pays off when more than 6 classifications are done. Ifa continuous classification is required, the parallel approach is4× as energy-efficient as the single IBEX core.

VII. DISCUSSION

In this work we presented FANN-on-MCU, a toolkit toimplement efficient ANNs on both ARM Cortex-M and RISC-V-based PULP processors. We also analyzed the performanceof our toolkit by varying the number of hidden layers andhidden neurons. We demonstrated that the parallelization overeight cores offers up to 7.1× speedup with respect to a singlecore. However, the usage of more cores means more powerconsumption, as can be noticed in Figure 13. In a future work,the trade-off between the number of active cores, i.e. powerconsumption, and the parallel speedup is to be analyzed. Wefurther analyzed into more details three different networksused in three application showcases with runtime and power

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0Time (ms)

0

10

20

30

40

50

60

70

80

Powe

r (m

W)

Application A Mr. Wolf with cluster

53.58 J12.82 J117.41 J

Multi-RI5CYSingle-RI5CY

Fig. 13. End-to-end power measurements of one classification of applicationA on Mr. Wolf with RI5CY core(s).

measurements. However, the limitation in this approach isthat the data acquisition and preprocessing of data are notconsidered. In future work, we are planning to implement thefull chain of data acquisition and processing to fully evaluatethe complete system.

VIII. CONCLUSION

Artificial intelligence and machine learning for low-powerIoT devices that host MCUs are key technologies for near-sensor data analytics and decision making. Multi-layer neuralnetworks are showing incredible performance in terms ofaccuracy in many applications. However, very few implemen-tations and measurement results exist for low-power MCUs.We have presented FANN-on-MCU, a framework to facilitatedeployment of optimized neural networks trained using theopen-source FANN library. Our framework supports not onlythe very popular ARM Cortex-M series MCUs, but also theRISC-V-based Parallel Ultra-Low Power (PULP) processors,both with and without a floating-point unit. We have furthershown performance comparisons of neural network inferencebetween the two classes of processors and analyzed the parallelspeedups and degradations due to parallelization overhead andmemory transfers. Experimental results have shown that aparallel implementation on PULP-based Mr. Wolf can reachup to 7.1× speedup with respect to a single core implemen-tation, while it outperforms an ARM Cortex-M4 up to 13.5×.Moreover, we presented three different use-cases to evaluatethe energy efficiency of our framework. We have demonstratedthat ANNs are suitable for deployment on ultra-low-powerMCUs in terms of memory usage, compute time, and energyconsumption. Specifically, experimental measurements haveshown that using FANN-on-MCU, a relatively big networkwith 103800 multiply-accumulate operations can be executedwithin 17.6 ms using 183µJ on a Nordic nRF52832 MCUwith an ARM Cortex-M4, whilst the parallel implementationon Mr. Wolf with 8 RISC-V-based RI5CY cores executes thesame inference in less than 1 ms consuming around 50µJ.More generally, a parallel implementation on Mr. Wolf offerson average an 80% energy reduction and a 14× speedup in

Page 14: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

14

TABLE IIMEASURED RUNTIME AND AVERAGE POWER CONSUMPTION DURING COMPUTATION ON APPLICATION SHOWCASES. RELATIVE IMPROVEMENTS PER

INFERENCE W.R.T. ARM CORTEX-M4 IMPLEMENTATION IN EXECUTION TIME SPEEDUP AND IN ENERGY REDUCTION ARE REPORTED IN PARENTHESIS.

App. nRF52832 —————————– PULP Mr. Wolf —————————–ARM Cortex-M4 IBEX Single-RI5CY Multi-RI5CY

17.6 ms 11.4 ms (1.54×) 5.7 ms*(3.09×) 0.8 ms*(22×)10.44 mW 10.75 mW 20.35 mW 61.79 mWA183.74µJ 122.55µJ (-33.30%) 116.00µJ (-36.87%) 49.43µJ (-73.10%)

B0.4 ms 0.3 ms (1.33×) 0.14 ms*, †(2.86×) 0.03 ms*, †(13.33×)

11.21 mW 9.52 mW 17.54 mW 22.18 mW4.48µJ 2.86µJ (-36.16%) 2.46µJ (-45.09%) 0.6654µJ (-85.15%)0.03 ms† 0.02 ms†(1.5×) 0.01 ms*, †(3×) 0.004 ms*, †(7.5×)

9.74 mW 7.31 mW 16.91 mW 17.17 mWC 0.2922µJ 0.1462µJ (-49.97%) 0.1691µJ (-42.13%) 0.06868µJ (-76.50%)* In addition around 1~1.3 ms for cluster activation, initialization, and deactivation with an average power consumption of 11.88 mW.† The runtime is below or very close to the precision of the measuring instrument, hence it is calculated from the measured number of cycles.

inference compared to an ARM Cortex-M4 implementation.Finally, the framework is released open-source and ready tobe used to deploy neural networks for applications on low-power embedded systems.

REFERENCES

[1] K. He, X. Zhang, S. Ren et al., “Delving Deep into Rectifiers: SurpassingHuman-Level Performance on ImageNet Classification,” in Proc. IEEEICCV. IEEE, 12 2015, pp. 1026–1034.

[2] Y. Yoshimura, B. Cai, Z. Wang et al., “Deep learning architect: Classifi-cation for architectural design through the eye of artificial intelligence,”in International Conference on Computers in Urban Planning andUrban Management. Springer, 2019, pp. 249–265.

[3] J. Ray, O. Johnny, M. Trovati et al., “The rise of big data science: Asurvey of techniques, methods and approaches in the field of naturallanguage processing and network theory,” Big Data and CognitiveComputing, vol. 2, no. 3, p. 22, 2018.

[4] G. Zazzaro, S. Cuomo, A. Martone et al., “Eeg signal analysis for epilep-tic seizures detection by applying data mining techniques,” Internet ofThings, 2019.

[5] A. Zanella, N. Bui, A. Castellani et al., “Internet of things for smartcities,” IEEE Internet of Things journal, vol. 1, no. 1, pp. 22–32, 2014.

[6] K. E. Jeon, J. She, P. Soonsawad et al., “Ble beacons for internetof things applications: Survey, challenges, and opportunities,” IEEEInternet of Things Journal, vol. 5, no. 2, pp. 811–828, 2018.

[7] A. Chianese, F. Piccialli, and I. Valente, “Smart environments andcultural heritage: a novel approach to create intelligent cultural spaces,”Journal of Location Based Services, vol. 9, no. 3, pp. 209–234, 2015.

[8] Y. Yoshimura, S. Sobolevsky, C. Ratti et al., “An analysis of visitors’behavior in the louvre museum: A study using bluetooth data,” En-vironment and Planning B: Planning and Design, vol. 41, no. 6, pp.1113–1131, 2014.

[9] F. Samie, L. Bauer, and J. Henkel, “From cloud down to things: Anoverview of machine learning in internet of things,” IEEE Internet ofThings Journal, 2019.

[10] G. Premsankar, M. Di Francesco, and T. Taleb, “Edge computing forthe internet of things: A case study,” IEEE Internet of Things Journal,vol. 5, no. 2, pp. 1275–1284, 2018.

[11] H. Yao, D. Fu, P. Zhang et al., “Msml: A novel multilevel semi-supervised machine learning framework for intrusion detection system,”IEEE Internet of Things Journal, vol. 6, no. 2, pp. 1949–1959, 2018.

[12] M. Qin, L. Chen, N. Zhao et al., “Power-constrained edge computingwith maximum processing capacity for iot networks,” IEEE Internet ofThings Journal, 2018.

[13] K. Kaewkannate and S. Kim, “A comparison of wearable fitnessdevices,” BMC Public Health, vol. 16, no. 1, p. 433, 12 2016.

[14] T. Rault, A. Bouabdallah, Y. Challal et al., “A survey of energy-efficientcontext recognition systems using wearable sensors for healthcare appli-cations,” Pervasive and Mobile Computing, vol. 37, pp. 23–44, 6 2017.

[15] L. Cavigelli, P. Degen, and L. Benini, “CBinfer: Change-Based Inferencefor Convolutional Neural Networks on Video Data,” in Proc. ACMICDSC, 2017.

[16] R. Andri, L. Cavigelli, D. Rossi et al., “Hyperdrive: A SystolicallyScalable Binary-Weight CNN Inference Engine for mW IoT End-Nodes,” in Proc. IEEE ISVLSI, 2018, pp. 509–515.

[17] L. Cavigelli and L. Benini, “CBinfer: Exploiting Frame-to-Frame Lo-cality for Faster Convolutional Network Inference on Video Streams,”IEEE Transactions on Circuits and Systems for Video Technology, 2019.

[18] A. Rahimi, P. Kanerva, and J. M. Rabaey, “A Robust and Energy-Efficient Classifier Using Brain-Inspired Hyperdimensional Computing,”in Proc. IEEE ISLPED. New York, New York, USA: ACM Press, 2016,pp. 64–69.

[19] M. Magno, M. Pritz, P. Mayer et al., “DeepEmote: Towards multi-layerneural networks in a low power wearable multi-sensors bracelet,” inProc. IEEE IWASI, 2017, pp. 32–37.

[20] M. Salvaro, S. Benatti, V. J. Kartsch et al., “A minimally invasivelow-power platform for real-time brain computer interaction based oncanonical correlation analysis,” IEEE Internet of Things Journal, vol. 6,no. 1, pp. 967–977, 2018.

[21] H. Rathore, Mapping Biological Systems to Network Systems. Cham:Springer International Publishing, 2016.

[22] P. Mayer, M. Magno, and L. Benini, “Self-Sustaining Acoustic SensorWith Programmable Pattern Recognition for Underwater Monitoring,”IEEE Transactions on Instrumentation and Measurement, vol. 68, no. 7,pp. 2346–2355, 7 2019.

[23] X. Wang, M. Magno, L. Cavigelli et al., “Rat Cortical Layers Classifi-cation extracting Evoked Local Field Potential Images with ImplantedMulti-Electrode Sensor,” in Proc. IEEE HEALTHCOM, 2018.

[24] S. Murali, F. Rincon, and D. Atienza, “A wearable device for physicaland emotional health monitoring,” in Computing in Cardiology, 2015.

[25] L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W ConvolutionalNetwork Accelerator,” IEEE Transactions on Circuits and Systems forVideo Technology, vol. 27, no. 11, pp. 2461–2475, 11 2017.

[26] M. Gautschi, P. D. Schiavone, A. Traber et al., “Near-threshold risc-v core with dsp extensions for scalable iot endpoint devices,” IEEETransactions on Very Large Scale Integration (VLSI) Systems, vol. 25,no. 10, pp. 2700–2713, 2017.

[27] M. Eggimann, S. Mach, M. Magno et al., “A risc-v based open hardwareplatform for always-on wearable smart sensing,” in Proc. IEEE IWASI.IEEE, 2019, pp. 169–174.

[28] M. Chen, Y. Li, X. Luo et al., “A novel human activity recognitionscheme for smart health using multilayer extreme learning machine,”IEEE Internet of Things Journal, vol. 6, no. 2, pp. 1410–1418, 2018.

[29] S. Nissen, “FANN - Fast Artificial Neural Network Library,” Universityof Copenhagen (DIKU), Tech. Rep., 2003. [Online]. Available:www.leenissen.dk

[30] ARM Inc., “ARM CMSIS: Cortex Microcontroller Software InterfaceStandard,” 2017.

[31] A. Pullini, D. Rossi, I. Loi et al., “Mr.wolf: An energy-precision scalableparallel ultra low power soc for iot edge processing,” IEEE Journal ofSolid-State Circuits, vol. 54, no. 7, pp. 1970–1981, July 2019.

[32] M. Magno, F. Vultier, B. Szebedy et al., “Long-term monitoring ofsmall-sized birds using a miniaturized bluetooth-low-energy sensornode,” in Proc. IEEE SENSORS, 2017.

[33] Y. Dong, H. Su, J. Zhu et al., “Towards Interpretable Deep NeuralNetworks by Leveraging Adversarial Examples,” arXiv:1708.05493, 82017.

Page 15: FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient ...ARM Cortex-M and PULP processorsand their comparison when varying the computational complexity and the memory footprint

15

[34] M. Magno, L. Cavigelli, P. Mayer et al., “Fanncortexm: An open sourcetoolkit for deployment of multi-layer neural networks on arm cortex-mfamily microcontrollers : Performance analysis with stress detection,” inProc. IEEE WF-IoT, 04 2019, pp. 793–798.

[35] B. Kuyumcu, “FANN Tool,” 2007. [Online]. Available: https://code.google.com/archive/p/fanntool

[36] ST Microelectronics Inc., “X-CUBE-AI Data brief Artificial intelligence( AI ) software expansion for STM32Cube X-CUBE-AI,” ST Microelec-tronics Inc., Tech. Rep., 2019.

[37] M. S. Louis, Z. Azad, L. Delshadtehrani et al., “Towards Deep Learningusing TensorFlow Lite on RISC-V,” in Proc. ACM CARRV, 2019.

[38] L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efficient Neural NetworkKernels for Arm Cortex-M CPUs,” arXiv:1801.06601, 2018.

[39] D. Palossi, A. Loquercio, F. Conti et al., “A 64-mw dnn-based visualnavigation engine for autonomous nano-drones,” IEEE Internet of ThingsJournal, vol. 6, no. 5, pp. 8357–8371, Oct 2019.

[40] P. D. Schiavone, F. Conti, D. Rossi et al., “Slow and steady wins therace? A comparison of ultra-low-power RISC-V cores for Internet-of-Things applications,” in 2017 27th International Symposium on Powerand Timing Modeling, Optimization and Simulation (PATMOS). IEEE, 92017. [Online]. Available: http://ieeexplore.ieee.org/document/8106976/

[41] D. Palossi, A. Loquercio, F. Conti et al., “A 64mW DNN-based VisualNavigation Engine for Autonomous Nano-Drones,” IEEE Internet ofThings Journal, vol. PP, no. May, pp. 1–1, 2019.

[42] M. Thielen, L. Sigrist, M. Magno et al., “Human body heat forpowering wearable devices: From thermal energy to application,”Energy Conversion and Management, vol. 131, pp. 44 – 54,2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0196890416310007

[43] M. Magno, D. Brunelli, L. Sigrist et al., “Infinitime: Multi-sensor wearable bracelet with human body harvesting,” SustainableComputing: Informatics and Systems, vol. 11, pp. 38 – 49, 2016, sI:IGCC 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2210537916300816

[44] E. Coninck, T. Verbelen, B. Vankeirsbilck et al., “Distributed neuralnetworks for internet of things: The big-little approach,” in Proc. Internetof Things 360, 2016.

[45] A. K. Das, F. Catthoor, and S. Schaafsma, “Heartbeat classificationin wearables using multi-layer perceptron and time-frequency jointdistribution of ecg,” in Proc. IEEE/ACM CHASE, 2018, pp. 69–74.

[46] N. B. Gaikwad, V. Tiwari, A. Keskar et al., “Efficient fpga implementa-tion of multilayer perceptron for real-time human activity classification,”IEEE Access, vol. 7, pp. 26 696–26 706, 2019.

[47] J. G. Colli-Alfaro, A. Ibrahim, and A. L. Trejos, “Design of user-independent hand gesture recognition using multilayer perceptron net-works and sensor fusion techniques,” in Proc. IEEE ICORR, 2019, pp.1103–1108.

[48] J. Howcroft, E. D. Lemaire, and J. Kofman, “Wearable-sensor-basedclassification models of faller status in older adults,” PLOS ONE, vol. 11,no. 4, pp. 1–16, 04 2016.

[49] E. Sazonov, N. Hegde, R. C. Browning et al., “Posture and activityrecognition and energy expenditure estimation in a wearable platform,”IEEE Journal of Biomedical and Health Informatics, vol. 19, no. 4, pp.1339–1346, July 2015.

[50] R. LeMoyne, F. Heerinckx, T. Aranca et al., “Wearable body andwireless inertial sensors for machine learning classification of gaitfor people with friedreich’s ataxia,” in 2016 IEEE 13th InternationalConference on Wearable and Implantable Body Sensor Networks (BSN),June 2016, pp. 147–151.

[51] Y. Can, N. Chalabianloo, D. Ekiz et al., “Continuous stress detectionusing wearable sensors in real life: Algorithmic programming contestcase study,” Sensors, vol. 19, 2019.

[52] “Myo gesture control armband—wearable technology by thalmic labs,”https://www.myo.com/.

Xiaying Wang received her B.Sc. and M.Sc. de-grees in biomedical engineering from Politecnico diMilano, Italy and ETH Zurich, Switzerland in 2016and 2018, respectively. She is currently pursuing aPh.D. degree at the Integrated Systems Laboratory atETH Zurich. Her research interests include biosignalprocessing, low power embedded systems, energy-efficient smart sensors and machine learning onmicrocontrollers. She received the excellent paperaward at the IEEE Healthcom conference in 2018and she won the Ph.D. Fellowship funded by Swiss

Data Science Center in 2019.

Michele Magno received the masters’ and Ph.D.degrees in electronic engineering from the Univer-sity of Bologna, Bologna, Italy, in 2004 and 2010,respectively. He is currently a Senior Researcher andlecturer with ETH Zurich, Switzerland. He has au-thored more than 150 papers in international journalsand conferences, few of them awarded as best paper.His current research interests include wireless sensornetworks, wearable devices, energy harvesting, lowpower management techniques, and extension of thelifetime of batteries-operating devices.

Lukas Cavigelli received the B.Sc., M.Sc., andPh.D. degree in electrical engineering and informa-tion technology from ETH Zurich, Zurich, Switzer-land in 2012, 2014 and 2019, respectively. Hehas since been a postdoctoral researcher with ETHZurich. His research interests include deep learning,computer vision, embedded systems, and low-powerintegrated circuit design. He has received the bestpaper award at the VLSI-SoC and the ICDSC con-ferences in 2013 and 2017, the best student paperaward at the Security+Defense conference in 2016,

and the Donald O. Pederson best paper award (IEEE TCAD) in 2019.

Luca Benini is the Chair of Digital Circuits andSystems at ETH Zurich and a Full Professor atthe University of Bologna. He has served as ChiefArchitect for the Platform2012 in STMicroelectron-ics, Grenoble. Dr. Benini’s research interests are inenergy-efficient system and multi-core SoC design.He is also active in the area of energy-efficient smartsensors and sensor networks. He has published morethan 1’000 papers in peer-reviewed internationaljournals and conferences, four books and severalbook chapters. He is a Fellow of the ACM and of

the IEEE and a member of the Academia Europaea.