umit y. ogras, radu marculescu, arxiv:2108.00568v1 [cs.cv

FLASH: Fast Neural Architecture Search with HardwareOptimizationGUIHONG LI, The University of Texas at Austin, USASUMIT K. MANDAL, University of Wisconsin–Madison, USAUMIT Y. OGRAS, University of Wisconsin–Madison, USARADU MARCULESCU, The University of Texas at Austin, USA

Neural architecture search (NAS) is a promising technique to design efficient and high-performance deep neuralnetworks (DNNs). As the performance requirements of ML applications grow continuously, the hardwareaccelerators start playing a central role in DNN design. This trend makes NAS even more complicated andtime-consuming for most real applications. This paper proposes FLASH, a very fast NAS methodology thatco-optimizes the DNN accuracy and performance on a real hardware platform. As the main theoreticalcontribution, we first propose the NN-Degree, an analytical metric to quantify the topological characteristicsof DNNs with skip connections (e.g., DenseNets, ResNets, Wide-ResNets, andMobileNets). The newly proposedNN-Degree allows us to do training-free NAS within one second and build an accuracy predictor by training asfew as 25 samples out of a vast search space with more than 63 billion configurations. Second, by performinginference on the target hardware, we fine-tune and validate our analytical models to estimate the latency,area, and energy consumption of various DNN architectures while executing standard ML datasets. Third,we construct a hierarchical algorithm based on simplicial homology global optimization (SHGO) to optimizethe model-architecture co-design process, while considering the area, latency, and energy consumption ofthe target hardware. We demonstrate that, compared to the state-of-the-art NAS approaches, our proposedhierarchical SHGO-based algorithm enables more than four orders of magnitude speedup (specifically, theexecution time of the proposed algorithm is about 0.1 seconds). Finally, our experimental evaluations show thatFLASH is easily transferable to different hardware architectures, thus enabling us to do NAS on a RaspberryPi-3B processor in less than 3 seconds.

CCS Concepts: • Computing methodologies → Artificial intelligence; Computer vision; • Computersystems organization→ Embedded systems.

Additional Key Words and Phrases: Neural Networks, Network Science, Hardware Optimization, NeuralArchitecture Search, Model-Architecture Co-design, Resource-constrained Devices

ACM Reference Format:Guihong Li, Sumit K. Mandal, Umit Y. Ogras, and Radu Marculescu. 2021. FLASH: Fast Neural ArchitectureSearch with Hardware Optimization. 1, 1 (August 2021), 25 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 IntroductionDuring the past decade, deep learning (DL) has led to significant breakthroughs in many areas,such as image classification and natural language processing [6, 21, 25]. However, the existing largemodels and computation complexity limit the deployment of DL on resource-constrained devicesand its large-scale adoption in edge computing. Multiple model compression techniques, such asnetwork pruning [20], quantization [13], and knowledge distillation [22], have been proposed tocompress and deploy such complex models on resource-constrained devices without sacrificing thetest accuracy. However, these techniques require a significant amount of manual tuning. Hence,

Authors’ addresses: Guihong Li, [email protected], The University of Texas at Austin, Austin, Texas, USA; Sumit K. Mandal,[email protected], University of Wisconsin–Madison, Madison, Wisconsin, USA; Umit Y. Ogras, [email protected],University of Wisconsin–Madison, Madison, Wisconsin, USA; Radu Marculescu, [email protected], The University ofTexas at Austin, Austin, Texas, USA.

2021. XXXX-XXXX/2021/8-ART $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

, Vol. 1, No. 1, Article . Publication date: August 2021.

arX

iv:2

108.

0056

8v1

[cs

.CV

] 1

Aug

202

1

https://doi.org/10.1145/nnnnnnn.nnnnnnn

https://doi.org/10.1145/nnnnnnn.nnnnnnn

2 G. Li, et al.

neural architecture search (NAS) has been proposed to automatically design neural architectureswith reduced model sizes [2, 17, 31, 32, 60].

NAS is an optimization problem with specific targets (e.g., high classification accuracy) over aset of possible candidate architectures. The set of candidate architectures defines the (typicallyvast) search space, while the optimizer defines the search algorithm. Recent breakthroughs in NAScan simplify the tricky (and error-prone) ad-hoc architecture design process [31, 42]. Moreover, thenetworks obtained via NAS have higher test accuracy and significantly fewer parameters than thehand-designed networks [32, 44]. These advantages of NAS have attracted significant attentionfrom researchers and engineers alike [55]. However, most of the existing NAS approaches do notexplicitly consider the hardware constraints (e.g., latency and energy consumption). Consequently,the resulting neural networks still cannot be deployed on real devices.

To address this drawback, recent studies propose hardware-aware NAS, which incorporates thehardware constraints of networks during the search process [27]. Nevertheless, current approachesare time-consuming since they involve training the candidate network, and a tedious searchprocess [56]. To accelerate NAS, recent NAS approaches rely on graph neural networks (GNNs) toestimate the accuracy of a given network [9, 33, 40, 54]. However, training a GNN-based accuracypredictor is still time-consuming (in the order of tens of minutes [12] to hours [36] on GPU clusters).Therefore, adapting existing NAS approaches to different hardware architecture is challenging dueto their intensive computation and execution time requirements.To alleviate the computation cost of current NAS approaches, we propose to analyze the

NAS problem from a network topology perspective. This idea is motivated by observing thatthe tediousness and complexity of current NAS approaches stem from the lack of understanding ofwhat actually contributes to a neural network’s accuracy. Indeed, the innovations on the topologyof neural architecture, especially the introduction of skip connections, have achieved great successin many applications [21, 25]. This is because, in general, the network topology (or structure)strongly influences the phenomena taking place over them [39]. For instance, how closely the socialnetwork users are interconnected directly affects how fast the information propagates throughthe network [3]. Similarly, a DNN architecture can be seen as a network of connected neurons. Asdiscussed in [5], the topology of deep networks has a significant impact on how effectively thegradients can propagate through the network and thus the test performance of neural networks.These observationsmotivate us to take an approach from network science to quantify the topologicalproperty of neural networks to accelerate NAS.

From an application perspective, the performance and energy efficiency of DNN accelerators areother critical metrics besides the test accuracy. In-memory computing (IMC)-based architectureshave recently emerged as a promising technique to construct high-performance and energy-efficient hardware accelerators for DNNs. IMC-based architectures can store all the weights on-chip, hence removing the latency occurring from off-chip memory accesses. However, IMC-basedarchitectures face the challenge of a tremendous increase of on-chip communication volume. Whilemost of the state-of-the-art neural networks adopt skip connections in order to improve theirperformance [21, 25, 45], the wide usage of skip connections requires large amounts of data transferacross multiple layers, thus causing a significant communication overhead. Prior work on IMC-based DNN accelerators proposed bus-based network-on-chip (NoC) [10] or cmesh-based NoC [46]for communication between multiple layers. However, both bus-based and cmesh-based on-chipcommunication significantly increase the area, latency, and energy consumption of hardware;hence, they do not offer a promising solution for future accelerators.Starting from these overarching ideas, this paper proposes FLASH – a fast neural architecture

search with hardware optimization – to address the drawbacks of current NAS techniques. FLASHdelivers a neural architecture that is co-optimizedwith respect to accuracy and hardware performance.


FLASH: Fast Neural Architecture Search with Hardware Optimization 3

Specifically, by analyzing the topological property of neural architectures from a network scienceperspective, we propose a new topology-based metric, namely, the NN-Degree. We show that NN-Degree could indicate the test performance of a given architectures. This makes our proposed NAStraining-free during the search process and accelerates NAS by orders of magnitude compared tostate-of-the-art approaches. Then, we demonstrate that NN-Degree enables a lightweight accuracypredictor with only three parameters. Moreover, to improve the on-chip communication efficiency,we adopt the mesh-NoC for the IMC-based hardware. Based on the communication-optimizedhardware architecture, we measure the hardware performance for a subset of neural networks fromthe NAS search space. Then, we construct analytical models for the area, latency, and energyconsumption of a neural network based on our optimized target hardware platform. Unlikeexisting neural network-based and black-box style searching algorithms [27], the proposed NASmethodology enable searching across the entire search space via a mathematically rigorous and time-efficient optimization algorithm. Consequently, our experimental evaluations show that FLASHsignificantly pushes forward the NAS frontier by enabling NAS in less than 0.1 seconds on a 20-coreIntel Xeon CPU. Finally, we demonstrate that FLASH could be readily transferred to other hardwareplatforms (e.g., Raspberry Pi) only by fine-tuning the hardware performance models.

Overall, this paper makes the following contributions:

• We propose a new topology-based analytical metric (NN-Degree) to quantify the topologicalcharacteristics of DNNs with skip connections. We demonstrate that the NN-Degree enablesa training-free NAS within seconds. Moreover, we use the NN-Degree metric to build a newlightweight (three-parameter) accuracy predictor by training as few as 25 samples out of avast search space with more than 63 billion configurations. Without any significant loss inaccuracy, our proposed accuracy predictor requires 6.88× fewer samples and provides a 65.79×reduction of the fine-tuning time cost compared to existing GNN/GCN based approaches [54].

• We construct analytical models to estimate the latency, area, and energy consumption ofvarious DNN architectures. We show that our proposed analytical models are applicableto multiple hardware architectures and achieve a high accuracy with less than one secondfine-tuning time cost.

• We design a hierarchical simplicial homology global optimization (SHGO)-based algorithm,to search for the optimal architecture. Our proposed hierarchical SHGO-based algorithmenables 27729× faster (less than 0.1 seconds) NAS compared to RL-based baseline approach.

• We demonstrate that our methodology enables NAS on a Raspberry Pi 3B with less than3 seconds computational time. To our best knowledge, this is the first work showing NASrunning directly on edge devices with such low computational requirements.

The rest of the paper is organized as follows. In Section 2, we discuss related work and backgroundinformation. In Section 3, we formulate the optimization problem, then describe the new analyticalmodels and search algorithm. Our experimental results are presented in Section 4. Finally, Section5 concludes the paper with remarks on our main contributions and future research directions.

2 Related Work and Background InformationHardware-aware NAS: Hardware accelerators for DNNs have recently become popular due tohigh-performance demand for multiple applications [4, 15, 35]; they can reduce the latency andenergy associated with DNN inference significantly. The hardware performance (e.g., latency,energy, and area) of accelerators varies with DNN properties (e.g., number of layers, parameters,etc.); therefore, hardware performance also is a crucial factor to consider during NAS.Several recent studies consider hardware performance for NAS. Authors in [14] introduce a

growing and pruning strategy that automatically maximizes the test accuracy and minimizes the


4 G. Li, et al.

FLOPs of neural architectures during training. A platform-aware NAS targeting mobile devices isproposed in [50]; the objective is to maximize the model accuracy with an upper bound on latency.Authors in [56] create a latency-aware loss function to perform differentiable NAS. The latency ofDNNs is estimated through a lookup table which consists of the latency of each operation/layer.However, both of these studies consider latency as the only metric for hardware performance.Authors in [37] propose a hardware-aware NAS framework to design convolutional neural networks.Specifically, by building analytical latency, power, and memory models, they create a hardware-aware optimization methodology to search for the optimal architecture that meets the hardwarebudgets. Authors in [27] consider latency, energy, and area as metrics for hardware performancewhile performing NAS. Also, a reinforcement learning (RL)-based controller is adopted to tune thenetwork architecture and device parameters. The resulting network is retrained to evaluate themodel accuracy. There are two major drawbacks of this approach. First, RL is a slow-convergingprocess that prohibits fast exploration of the design space. Second, retraining the network furtherexacerbates the search time leading to hundreds of GPU hours needed for real applications [60].Furthermore, most existing hardware-aware NAS approaches explicitly optimize the architecturesfor a specific hardware platform [8, 30, 56]. Hence, if we switch to some new hardware, weneed to repeat the entire NAS process, which is very time-consuming under the existing NASframeworks [8, 30, 56]. The demand for reducing the overhead of adaptation to new hardwaremotivates us to improve the transferability of hardware-aware NAS methodology.Accuracy Predictor-based NAS: Several approaches perform NAS by estimating the accuracyof the network [9, 33, 40, 54]. These approaches first train a graph neural network (GNN), or agraph convolution network (GCN), to estimate the network accuracy while exploring the searchspace. During the searching process, the test accuracy of the sample networks is obtained from theestimator instead of doing regular training. Although by estimating the accuracy, the NAS process issignificantly accelerated, the training cost of the accuracy predictor itself remains a bottleneck. GNNrequires many training samples to achieve high accuracy, thus involving a significant overheadduring training the candidate networks from the search space. Therefore, using accuracy predictorsto do NAS still suffers from excessive computation and time requirements.Time-efficient NAS: To reduce the time cost of training candidate networks, authors in [42, 49]introduced theweight sharingmechanism (WS-NAS). Specifically, candidate networks are generatedby randomly sampling part of a large network (supernet). Hence, candidate networks share theweights of the supernet and update these weights during training. By reusing these trained weightsinstead of training from scratch, WS-NAS significantly improves the time efficiency of NAS.However, the accuracy of these models obtained via WS-NAS is typically far below those obtainedfrom training from scratch. Several optimization techniques have been proposed to fill the accuracygap between the shared weights and stand-alone training [7, 58]. For example, authors in [7]propose a progressive shrinking algorithm to train the supernet. However, in many cases, theresulting networks still need some fine-tuning epochs to get the final architecture. To furtheraccelerate the NAS process, some works propose the differentiable NAS to accelerate the NASprocess [8, 32]. The differentiable NAS approaches search for the optimal architecture by learningthe optimal architecture parameters during the training process. Hence, differentiable NAS onlyneeds to train the supernet once, thus reducing the training time significantly. Nevertheless, due tothe significantly large number of parameters of the supernet, differentiable NAS requires a highvolume of GPU memory. In order to further improve the time-efficiency of NAS, several approacheshave been proposed to do training-free NAS [1, 11]. These approaches leverage some training-freeproxy that indicates the test performance of some given architectures; hence, the training timeis eliminated from the entire NAS process. However, these methods usually use gradient-basedinformation to build the proxy [1, 11]. Therefore, in order to calculate the gradients, GPUs are still



necessary for the backward propagation process. To totally decouple the NAS process from usingGPU platforms, our work proposes a GPU-free proxy to do training-free NAS. We provide moredetails in Section 4.3.Skip connections and Network Science: Currently, both networks obtained by manual designand NAS have shown that long-range links (i.e., skip connections) are crucial for getting higheraccuracy [21, 25, 32, 45]. Overall, there are two commonly used skip connections in neural networks.First, we have the DenseNet-type skip connections (DTSC), which concatenate previous layers’outputs as the input for the next layer [25]. To study the topological properties and enlarge thesearch space, we do not use the original DesneNets [25], which contains all-to-all connections.Instead, we consider a generalized version where we vary the number of skip connections byrandomly selecting only some channels for concatenation, as shown in Fig. 1(a). The other type ofskip connections is the addition-type skip connections (ATSC), which consist of links that bypassseveral layers to be directly added to the output of later layers (see Fig. 1(b)) [21].

In network science, a small-world network is defined as a highly clustered network, thus showinga small distance (typically logarithmic in the number of network nodes) between any two nodesinside the network [53]. Considering the skip connections in neural networks, we propose to usethe small-world network concept to analyze networks with both short- and long-range (or skip)links. Indeed, small-world networks can be decomposed into: (i) a lattice network G accountingfor short-range links; (ii) a random network R accounting for long-range links (see Fig. 1(c)). Theco-existence of a rich set of short- and long-range links leads to both a high degree of clusteringand short average path length (logarithmic with network size). We use the small-world network tomodel and analyze the topological property of neural networks in Section 3.Average Degree: The average degree of a network determines the average number of connectionsa node has, i.e., the total number of edges divided by the total number of nodes. The average degreeand degree distribution (i.e., distribution of node degree) are important topological characteristicsthat directly affect how information flows through a network [3]. Indeed, the small network theoryreveals that the average degree of a network has a significant impact on network average pathlength and clustering behavior [53]. Therefore, we investigate the performance gains due to thetopological properties by using network science.

+

(a) DenseNet-type Skip Connections (DTSC)

……

Short-range connections Skip connections/

Selected channels

Depth 𝑅𝑅𝑐𝑐

Wid

th𝑁𝑁𝑐𝑐

Layer 𝐿𝐿 −3

Layer 𝐿𝐿Layer 𝐿𝐿 − 2

+

+

+

(b) Addition-type Skip Connections (ATSC)Depth 𝑅𝑅𝑐𝑐

Wid

th𝑁𝑁𝑐𝑐

Layer 𝐿𝐿 − 2

Layer 𝐿𝐿Layer 𝐿𝐿 − 1

𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑁𝑁𝐿𝐿𝐿𝐿𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝒢𝒢 𝑅𝑅𝐿𝐿𝑅𝑅𝑅𝑅𝑁𝑁𝑅𝑅 𝑁𝑁𝐿𝐿𝐿𝐿𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 ℛ

(c) Use small-world network concepts to decompose DNN into a lattice network (𝒢𝒢) and a random network (ℛ)

=

Fig. 1. Modeling a CNN as a network in network science: Each channel is modeled as a node; eachconvolution kernel/filter is modeled as a link/connection. (a) Illustration of a single cell with DenseNet-type skip connections (DTSC). (b) Illustration of a single cell with Addition-type skip connections (ATSC). (c)Decomposition of a network cell with skip connections into a Lattice Network G and a Random Network R.


6 G. Li, et al.

HardwarePerformance

Models: ℒ,ℇ,𝒜𝒜

Latency (𝓛𝓛) Energy (ℇ)Area (𝓐𝓐 )

Inference/Simulation on Target HardwareAccuracy Predictor Model: 𝜃𝜃

Hardware Performance

Estimationℒ, ℇ, 𝛢𝛢

Accuracy (𝜽𝜽)

𝜃𝜃

Accuracy Estimation

Search Space

Stage 1

Stage 2

Multi-Objective Function:f(ℒ, ℇ, 𝛢𝛢, 𝜃𝜃)

Fig. 2. Overview of the proposed approach. Stage 1 (red box): we build hardware performance model (i.e.,latency L, energy E, and area A) and accuracy predictor by randomly sampling candidate networks fromthe search space to evaluate the hardware characteristics (latency L, energy E, and area A) and testaccuracy \ . Stage 2 (blue box): search for the optimal network architecture given the multi-objective function𝑓 (L, E,A, \ ).

3 Proposed Methodology3.1 Overview of New NAS Approach

The proposed NAS framework is a two-stage process, as illustrated in Fig. 2: (i) We first quantifythe topological characteristics of neural networks by the newly proposed NN-Degree metric. Then,we randomly select a few networks and train them to fine-tune the accuracy predictor based onthe network topology. We also build analytical models to estimate the latency, energy, and area ofgiven neural architectures. (ii) Based on the accuracy predictor and analytical performance modelsin the first stage, we use a simplical homology global optimization (SHGO)-based algorithm in ahierarchical fashion to search for the optimal network architecture.3.2 Problem Formulation of hardware-aware NAS

The overall target of the hardware-aware NAS approach is to find the network architecture thatgives the highest test accuracy while achieving small area, low latency, and low energy consumptionwhen deployed on the target hardware. In practice, there are constraints (budgets) on the hardwareperformance and test accuracy. For example, battery-based devices have very constrained energycapacity [52]. Hence, there is an upper bound for the energy consumption of the neural architecture.To summarize, the NAS problem can be expressed as:

max 𝑓𝑜𝑏 𝑗 =\

A × L × Esubject to: \ ≥ \𝑀 , A ≤ A𝑀 , L ≤ L𝑀 , E ≤ E𝑀

(1)

where \𝑀 , A𝑀 , L𝑀 , and E𝑀 are the constraints on the test accuracy, area, latency, and energyconsumption, respectively. We summarize the symbols (and their meaning) in this part in Table 1.3.3 NN-Degree and Training-free NAS

This section first introduces our idea of modeling a CNN based on network science [53]. To thisend, we define a group of consecutive layers with the same width (i.e., number of output channels,𝑤𝑐 ) as a cell; then we break the entire network into multiple cells and denote the number of cellsas 𝑁𝑐 . Similar to MobileNet-v2 [45], we also adopt a width multiplier (𝑤𝑚) to scale the width ofeach cell. Moreover, following most of the mainstream CNN architectures, we assume that eachcell inside a CNN has the same number of layers (𝑑𝑐 ). Furthermore, as shown in Fig. 1, we considereach channel of the feature map as a node in a network and consider each convolution filter/kernelas an undirected link. These notations are summarized in Table 2.



Table 1. Symbols and their corresponding definition/meaning used in our Problem Formulation.

Symbol Definition𝑓𝑜𝑏 𝑗 Objective function of NAS\ Test accuracy of a given network𝐴 Chip areaL Inference latency of a given networkE Inference energy consumption of a given network\𝑀 Constraint of test accuracy for NAS𝐴𝑀 Constraint of area for NASL𝑀 Constraint of inference latency for NASE𝑀 Constraint of inference energy consumption for NAS

Table 2. Symbols and their corresponding definition/meaning used in our NN-Degree based analyticalaccuracy predictor.

Symbol Definition𝑔 NN-Degree (new metric we propose)𝑔G NN-Degree of the lattice network (short-range connections)𝑔R NN-Degree of the random network (long-range or skip connections)𝑁𝑐 Number of cells𝑤𝑐 Number of output channels per layer within cell 𝑐 (i.e., the width of cell 𝑐)𝑑𝑐 Number of layers within cell 𝑐 (i.e., the depth of cell 𝑐)𝑆𝐶𝑐 Number of skip connections within cell 𝑐𝑎\ , 𝑏\ , 𝑐\ Learnable parameters for the accuracy predictor

Combining the concept of small-world networks in Section 2 and our modeling of a CNN, wedecompose a network cell with skip connections into a lattice network G and random network R(see Fig. 1(c)).Proposed Metrics: Our key objective is two-fold: (i) Quantify which topological characteristics ofDNN architectures affect their performance, and (ii) Exploit such properties to accurately predictthe test accuracy of a given architecture. To this end, we propose a new analytical metric calledNN-Degree, as defined below.Definition of NN-Degree: Given a DNN with 𝑁𝑐 cells, 𝑑𝑐 layers per cell, the width of each cell𝑤𝑐 ,and the number of skip connections of each cell 𝑆𝐶𝑐 , the NN-Degree metric is defined as the sum of theaverage degree of each cell:

𝑔 =

𝑁𝑐∑︁𝑐=1

(𝑤𝑐 +𝑆𝐶𝑐

𝑤𝑐 × 𝑑𝑐) (2)

Intuition: The average degree of a given DNN cell is the sum of the average degrees from latticenetwork G and random network R. Given a cell with 𝑑𝑐 convolutional layers and 𝑤𝑐 channelsper layer, the number of nodes is𝑤𝑐 × 𝑑𝑐 . Moreover, each convolutional layer has𝑤𝑐 ×𝑤𝑐 filters(kernels) accounting for the short-range connections; hence, in the lattice network G, there are


8 G. Li, et al.

𝑤𝑐 ×𝑤𝑐 ×𝑑𝑐 connections (total). Using the above analysis, we can express the NN-Degree as follows:𝑔 = 𝑔G + 𝑔R

=


𝑛𝑢𝑚𝑏𝑒𝑟 𝑜 𝑓 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 G𝑛𝑢𝑚𝑏𝑒𝑟 𝑜 𝑓 𝑛𝑜𝑑𝑒𝑠 𝑖𝑛 𝑐𝑒𝑙𝑙 𝑐

+𝑁𝑐∑︁𝑐=1

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜 𝑓 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 R𝑛𝑢𝑚𝑏𝑒𝑟 𝑜 𝑓 𝑛𝑜𝑑𝑒𝑠 𝑖𝑛 𝑐𝑒𝑙𝑙 𝑐

=


𝑤𝑐 × 𝑑𝑐 ×𝑤𝑐

𝑤𝑐 × 𝑑𝑐+


𝑛𝑢𝑚𝑏𝑒𝑟 𝑜 𝑓 𝑠𝑘𝑖𝑝 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠

𝑤𝑐 × 𝑑𝑐

=



𝑤𝑐 × 𝑑𝑐)

(3)

Discussion: The first term in Equation 3 (i.e., 𝑔G) reflects the the width of the network𝑤𝑐 . Manysuccessful DNN architectures, such as DenseNets [25], Wide-ResNets [59], and MobileNets [45],have shown that wider networks can achieve a higher test performance. The second term (i.e., 𝑔R )quantifies how densely the nodes are connected through the skip connections. As discussed in [51],networks with more skip connections have more forward/backward propagation paths, thus havea better test performance. Based on the above analysis, we claim that a higher NN-Degree valueshould indicate networks with higher test performance. We verify this claim empirically in theexperimental section. Next, we propose an accuracy predictor based only on the NN-Degree.

Accuracy Predictor: Given the NN-Degree (𝑔) definition, we build the accuracy predictor by usinga variant of logistic regression. Specifically, the test accuracy \ of a given architecture is:

\ =1

𝑎\ + exp(𝑏\ × 1𝑔+ 𝑐\ )

(4)

where 𝑎\ , 𝑏\ , 𝑐\ are the parameters that are fine-tuned with the accuracy and NN-Degree of samplenetworks from the search space. Section 4 shows that by using as few as 25 data samples (NN-Degreeand corresponding accuracy values), we can generate an accurate predictor for a huge search spacecovering more than 63 billion configurations within 1 second on a 20-core Intel Xeon CPU.Training-free NAS: Section 4 shows that NN-Degree can indicate the test accuracy of a givenarchitecture. Hence, one can use NN-Degree as a proxy of the test accuracy to enable the training-free NAS. Section 4.3 demonstrates that we can do training-free NAS within 0.11 seconds on a20-core CPU.3.4 Overview of In-memory Computing (IMC)-based HardwareFig. 3 shows the IMC architecture considered in this work. We note that the proposed FLASH

methodology is not specific to IMC-based hardware. We adopt an IMC architecture since it hasbeen proven to achieve less memory access latency [23]. Due to the high communication volumeimposed by deeper and denser networks, communication between multiple tiles is crucial forhardware performance, as shown in [28, 34].

Our architecture consists of multiple tiles connected by network-on-chip (NoC) routers, as shownin Fig. 3(a). We use a mesh-based NoC due to its superior performance compared to bus-basedarchitectures. Each tile consists of a fixed number of compute elements (CE), a rectified linear unit(ReLU), an I/O buffer, and an accumulation unit, as shown in Figure Fig. 3(b).

Within each CE, there exist a fixed number of im-memory processing elements (imPE), amultiplexer, a switch, an analog-to-digital converter (ADC), a shift and add (S&A) circuit, and alocal buffer [10], as shown in Fig. 3(c). The ADC precision is set to four bits to avoid any accuracydegradation. There is no digital-to-analog (DAC) converter used in the architecture. A sequentialsignaling technique to represent multi-bit inputs is adopted [41]. Each imPE consists of 256×256



CE

Tile

IMC

Tile Tile Tile

Tile Tile Tile

Tile Tile Tile

Max

Poo

lI/

O In

terf

ace

CE CE

CECE

ReLU

I/O

Buf

fer

imPE imPE

Mux

Switch S&A

ADC

Loca

l Buf

fer

(a) (b) (c)

NoC Router

CE: Compute Element

ReLU: Rectified Linear Unit

imPE: in-memory Processing Element

S&A: Shift & AddAccumulation

Accumulation

Fig. 3. Details of the IMC hardware. (a) The architecture consists of multiple tiles connected via routers; (b)The structure of a tile. Each tile consists of multiple computing elements (CE), I/O buffer, ReLU unit andaccumulation unit; (c) The structure of each CE. Each CE consists of multiple in-memory processing elements(imPE), local buffers, switch, multiplexer, analog to digital converter (ADC), shift and add (S&A) circuit.

IMC crossbars (the memory elements) based on ReRAM (1T1R) technology [10, 28, 34]. This workincorporates a sequential operation between DNN layers since a pipelined operation may causepipeline bubbles during inference [43, 48].

Table 3. Symbols and their corresponding definition used in our analytical area, latency, and energy models.

Symbol Definition Symbol Definition

𝑁𝑐 Number of cells 𝑁 𝑟𝑖

Number of rows of imPE arraysof 𝑖th layer

𝑎\ , 𝑏\ , 𝑐\Learnable parameters foraccuracy predictor 𝑁 𝑐

𝑖

Number of columns of imPE arraysof 𝑖th layer

𝑤𝑚 Width multiplier 𝐾𝑥𝑖 , 𝐾𝑦𝑖Kernel sizeof 𝑖th layer

𝑑𝑐 Number of layers within cell 𝑐 𝑁𝑖 𝑓

𝑖, 𝑁𝑜 𝑓

𝑖

Number of input andoutput features of 𝑖th layer

𝑤𝑐 Width of cell 𝑐 (𝑃𝐸𝑥 )𝑖 , (𝑃𝐸𝑦)𝑖Size of a single imPEof 𝑖th layer

𝑆𝐶𝑐Number of skip connectionswithin cell 𝑐 𝑇𝑖

Number of tilesof 𝑖th layer

𝐹𝐿𝑂𝑃𝑐 Number of FLOPs of cell 𝑐 𝑐Number of CEs in each tileof 𝑖th layer

𝐶𝑜𝑚𝑚𝑐The amount of data transferredthrough NoC inside cell 𝑐 𝑝 Number of imPEs in each CE

𝑁𝑇Total number of tilesof the chip 𝐴𝑇 Area of a tile

𝐹E Features for energy E𝑇 Energy consumptionof a tile

Λ𝑐𝑜𝑚𝑝 , Λ𝑁𝑜𝐶Weight vectors to estimatecomputation and NoC latency 𝐹𝐶𝑜𝑚𝑝 , 𝐹𝑁𝑜𝐶

Features to estimate computationand NoC latency

3.5 Hardware Performance ModelingThis section describes the methodology of modeling hardware performance. We consider three

metrics for hardware performance: area, latency, and energy consumption. We use customized


10 G. Li, et al.

versions of NeuroSim [10] for circuit simulation (computing fabric) and BookSim [26] for cycle-accurate NoC simulation (communication fabric). First, we describe the details of the simulator.Input to the simulator: The inputs to the simulator include the DNN structure, technology node,and frequency of operation. In this work, we consider a layer-by-layer operation. Specifically, wesimulate each DNN layer and add its performance at the end to obtain the total performance of thehardware for the DNN.Simulation of computing fabric: Table 4 shows the parameters considered for the simulationof computing fabric. At the start of the simulation, the number of in-memory computing tiles iscomputed. Then, the area and energy of one tile are computed through analytical models derivedfrom HSPICE simulation. After that, the area and energy of one tile are multiplied by the totalnumber of tiles to obtain the total area and energy of the computing fabric. The latency of thecomputing fabric is computed as a function of the workload (the DNN being executed). We notethat the original version of NeuroSim considers point-to-point on-chip interconnects, while ourproposed work uses mesh-based NoC. Therefore, we skip the interconnect simulation in NeuroSim.Simulation of communication fabric:We consider cycle-accurate simulation for the communic-ation fabric. BookSim is used to perform simulation. First, the number of tiles required for eachlayer is obtained from the simulation of computing fabric. In this work, we assume that each tile isconnected to a dedicated router of the NoC. A trace file is generated corresponding to the particularlayer of the DNN. The trace file consists of the information of the source router, destination router,and timestamp when the packet is generated. The trace file is simulated through BookSim to obtainthe latency to finish all the transactions between two layers. We also obtain the area and energy ofthe interconnect through BookSim. Table 4 shows the parameters considered for the interconnectsimulator. More details of the simulator can be found in [29].For hardware performance modeling, first we obtain the performance of the DNN through

simulation, then the performance numbers are used to construct the performance models.Analytical Area Model: An in-memory computing-based DNN accelerator consists of two majorcomponents: computation and communication. The computation unit consists of multiple tilesand peripheral circuits; the communication unit includes an NoC with routers and other networkcomponents (e.g., buffers, links). To estimate the total area, we first compute the number of rows (𝑁 𝑟

𝑖 )and number of columns (𝑁 𝑐

𝑖 ) of imPEs required for the 𝑖th layer of the DNN following Equation 5and Equation 6.

𝑁 𝑟𝑖 =

⌈𝐾𝑥𝑖 × 𝐾𝑦𝑖 × 𝑁 𝑖 𝑓

𝑖

(𝑃𝐸𝑥 )𝑖

⌉(5)

𝑁 𝑐𝑖 =

⌈𝑁𝑜 𝑓

𝑖× 𝑁𝑏𝑖𝑡𝑠

(𝑃𝐸𝑦)𝑖

⌉(6)

where all the symbols are defined in Table 3. Therefore, total number of imPEs required for the𝑖th layer of the DNN is 𝑁 𝑟

𝑖 × 𝑁 𝑐𝑖 . Each tile consists of 𝑐 CEs, and each CE consists of 𝑝 number of

imPEs. Accordingly, each tile comprises 𝑐 × 𝑝 imPEs. Therefore, the total number of tiles requiredfor the 𝑖th layer of the DNN (𝑇𝑖 ) is:

𝑇𝑖 =

⌈𝑁 𝑟𝑖 × 𝑁 𝑐

𝑖

𝑐 × 𝑝

⌉(7)

Hence, the total number of tiles (𝑁𝑇 ) required for a given DNN is 𝑁𝑇 =∑

𝑖 𝑇𝑖 .As shown in Fig. 3(a), each tile is connected to the NoC routers for the on-chip communication.

We assume that the total number of required routers is equal to the total number of tiles. Hence,the total chip area is expressed as follows:



Table 4. Parameters used for simulation of computation and communication fabric.

Circuit NoCimPE array size 128 × 128 Bus width 32Cell levels 2 bit/cell Routing algorithm X–YFlash ADC resolution 4 bits Number of router ports 5Technology used RRAM Topology Mesh

Latency breakdown by layer

0 10 20 30 40Layer Index

0

100

200

300

400

500

Late

ncy

(us)

Computation LatencyCommunication Latency

Energy breakdown by layer

0 10 20 30 40Layer Index

0

5

10

15

20

25

30

Ener

gy (u

J)

Computation EnergyCommunication Energy

(b)(a)

Fig. 4. Layerwise hardware performance breakdown of a DNNwith 3 cells (𝑁𝑐 = 3), 16 layers per cell (𝑑𝑐 = 16),and a total of 48 layers. (a) Latency breakdown layer by layer: the computation latency accounts for 37.9% ofthe total latency, while communication accounts for 62.1%. (b) Energy consumption breakdown layer by layer:the computation energy accounts for 96.1% of the total latency, while communication accounts for 3.9%.

A = 𝐴𝑐𝑜𝑚𝑝 +𝐴𝑁𝑜𝐶

= (𝐴𝑇𝑜𝑡𝑇𝑖𝑙𝑒

+𝐴𝑃𝑒𝑟𝑖𝑝ℎ𝑒𝑟𝑦) + (𝐴𝑇𝑜𝑡𝑅𝑜𝑢𝑡𝑒𝑟 +𝐴𝑜𝑡ℎ𝑒𝑟𝑠 )

= 𝑁𝑇 ×𝐴𝑇 + 𝑁𝑇 ×𝐴𝑅 + (𝐴𝑃𝑒𝑟𝑖𝑝ℎ𝑒𝑟𝑦 +𝐴𝑜𝑡ℎ𝑒𝑟𝑠 )= 𝑁𝑇 × (𝐴𝑇 +𝐴𝑅) +𝐴𝑟𝑒𝑠𝑡

(8)

where 𝐴𝑇𝑜𝑡𝑇𝑖𝑙𝑒

is the area accounted for all tiles and 𝐴𝑇𝑜𝑡𝑅𝑜𝑢𝑡𝑒𝑟

is the total area accounted for all routersin the design. The area of a single tile is denoted by 𝐴𝑇 ; there are 𝑁𝑇 tiles in the design. Therefore𝐴𝑇𝑜𝑡𝑇𝑖𝑙𝑒

= 𝑁𝑇 ×𝐴𝑇 . The area of the peripheral circuit (𝐴𝑃𝑒𝑟𝑖𝑝ℎ𝑒𝑟𝑦) consists of I/O interface, max poolunit, accumulation unit, and global buffer. The area of a single router is denoted by 𝐴𝑅 ; the numberof routers is equal to the number of tiles (𝑁𝑇 ). Therefore 𝐴𝑇𝑜𝑡

𝑅𝑜𝑢𝑡𝑒𝑟= 𝑁𝑇 × 𝐴𝑅 . The area of other

components in the NoC (𝐴𝑟𝑒𝑠𝑡 ) comprises links and buffers.Analytical Latency Model: Similar to area, the total latency consists of computation latency andcommunication latency, as shown in Fig. 4(a). To construct the analytical model of latency, weuse floating-point operations (FLOPs) of the network to represent the computational workload.We observe that the FLOPs of a given network are roughly proportional to the total number ofconvolution filters (kernels), which is the product of the number of layers and the square of thenumber of channels per layer (i.e., width value). In the network search space we consider, the widthis equivalently represented by the width multiplier𝑤𝑚 and the number of layers is 𝑁𝑐 × 𝑑𝑐 ; hence,we express the number of FLOPs of a given network approximately as the product of the numberof layers, and the square of width multiplier:

𝐹𝐿𝑂𝑃𝑠 ∼ 𝑁𝑐𝑑𝑐𝑤2𝑚 (9)


12 G. Li, et al.

Moreover, communication volume increases significantly due to the skip connections. To quantifythe communication volume due to skip connections, we define𝐶𝑜𝑚𝑚𝑐 (the communication volumeof a given network cell 𝑐) as follows:

𝐶𝑜𝑚𝑚𝑐 = 𝑆𝐶𝑐 × 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑚𝑎𝑝 𝑠𝑖𝑧𝑒 𝑜 𝑓 𝑒𝑎𝑐ℎ 𝑆𝐶Combining the above analysis of computation latency and communication latency, we use a linearmodel to build our analytical latency model as follows:

L = L𝑐𝑜𝑚𝑝 + L𝑁𝑜𝐶 = Λ𝑇𝑐𝑜𝑚𝑝𝐹𝑐𝑜𝑚𝑝 + Λ𝑇

𝑁𝑜𝐶𝐹𝑁𝑜𝐶 (10)where Λ𝑇

𝑐𝑜𝑚𝑝 is a weight vector and 𝐹𝑐𝑜𝑚𝑝 = [𝑤𝑚, 𝑑𝑐 , 𝑁𝑐 , 𝑁𝑐𝑑𝑐𝑤2𝑚] is the vector of features with

respect to the computation latency; Λ𝑇𝑁𝑜𝐶

is another weight vector and 𝐹𝑐𝑜𝑚𝑝 = [𝑆𝐶𝑐 ,𝐶𝑜𝑚𝑚𝑐 ] isthe vector of features corresponding to the NoC latency. We randomly sample some networks fromthe search space and measure their latency to fine-tune the values of Λ𝑇

𝑐𝑜𝑚𝑝 and Λ𝑇𝑁𝑜𝐶

.Analytical Energy Model:We divide the total energy consumption into computation energy andcommunication energy, as shown in Fig. 4(b). Specifically, the entire computation process insideeach tile consists of three steps:

• Read the input feature map from the I/O buffer to the CE;• Perform computations in CE and ReLU unit, then update the results in the accumulator;• Write the output feature map to the I/O buffer.

Therefore, both the size of feature map and FLOPs contribute to the computation energy of asingle cell. Moreover, the communication energy consumption is primarily determined by thecommunication volume, i.e., (𝐶𝑜𝑚𝑚𝑐 ). Hence, we use a linear combination of features to estimatethe energy consumption of each tile E𝑇 :

E𝑇 = Λ𝑇E𝐹E (11)

where Λ𝑇E is a weight vector and 𝐹E = [𝑤𝑚, 𝑑𝑐 , 𝑁𝑐 , 𝑆𝐶𝑐 ,𝐶𝑜𝑚𝑚𝑐 , 𝐹𝐿𝑂𝑃𝑐 , 𝐹𝑀𝑐 ] are the features

corresponding to the energy consumption of each tile. We use the measured energy consumptionvalues of several sample networks to fine-tune the values of Λ𝑇

E . The total energy consumption (E)is the product of E𝑇 and number of tiles:

E = Λ𝑇E𝐹E𝑁𝑇 (12)

We note that all the features used in both our accuracy predictor and analytical hardwareperformance model are only related to the architecture of the network through the basic parameters{𝑤𝑚, 𝑑𝑐 , 𝑁𝑐 , 𝑆𝐶𝑐 }. Therefore, the analytical hardware models are lightweight. We note that thereexist no other lightweight analytical models for IMC platforms. Besides this, FLASH is general andcan be applied to different hardware platforms. For a given hardware platform, energy, latency,and area of the DNNs need to be first collected. Then the analytical hardware models need to betrained using the performance data.3.6 Optimal neural architecture search

Based on the above accuracy predictor and analytical hardware performance models, we performthe second stage of our NAS methodology, i.e., searching for the optimal neural architecture byconsidering both test accuracy and hardware performance on the target hardware. To this end, weuse a modified version of the Simplicial Homology Global Optimization (SHGO [18]) algorithm tosearch for the optimum architecture. SHGO has mathematically rigorous convergence properties onnon-linear objective functions and constraints and can solve derivative-free optimization problems1.1The detailed discussion of SHGO is beyond the scope of this paper. More details are available in [18]



Algorithm 1: Our hierarchical SHGO-based search algorithmInput:

Objective function: 𝑓𝑜𝑏 𝑗 ;Global search space:𝑆𝑃𝑔𝑙𝑜𝑏𝑎𝑙 = [𝑁𝑐𝑚𝑖𝑛, 𝑁𝑐𝑚𝑎𝑥 ] × [𝑤𝑚𝑚𝑖𝑛

,𝑤𝑚𝑚𝑎𝑥] × [𝑑𝑐𝑚𝑖𝑛, 𝑑𝑐𝑚𝑎𝑥 ] × [𝑆𝐶𝑐𝑚𝑖𝑛, 𝑆𝐶𝑐𝑚𝑎𝑥 ];

Search constraints: 𝑆𝑐𝑜𝑛𝑠 = {𝐿𝑀 , 𝐸𝑀 , 𝐴𝑀 , \𝑀 } ;Coarse-grain search step size: _

Output:The optimal architecture {𝑤∗

𝑚, 𝑁∗𝑐 , 𝑑

∗𝑐 , 𝑆𝐶

∗𝑐 };

Search Process:Initialize Candidate Architecture Set (𝐶𝐴𝑆) as empty set;level 1: Fixed-𝑤𝑚 Searchfor𝑤𝑚 in [𝑤𝑚𝑚𝑖𝑛

,𝑤𝑚𝑚𝑎𝑥] do

level 2: Coarse-grain Searchfix𝑤𝑚 , search the optimum 𝑁𝐺

𝑐 , 𝑑𝐺𝑐 , 𝑆𝐶

𝐺𝑐 with large search step _

𝑁𝐺𝑐 , 𝑑

𝐺𝑐 , 𝑆𝐶

𝐺𝑐 =SHGO(𝑓𝑜𝑏 𝑗 , 𝑆𝑃𝑔𝑙𝑜𝑏𝑎𝑙 , 𝑆𝑐𝑜𝑛𝑠 , search step size=_ )

level 3: Fine-grain Searchwithin the neighbourhood of 𝑁𝐺


𝐺𝑐 , search the optimum 𝑁 𝐿

𝑐 , 𝑑𝐿𝑐 , 𝑆𝐶

𝐿𝑐

Local search space: 𝑆𝑃𝑙𝑜𝑐𝑎𝑙 = {𝑁𝐺𝑐 ± 2_, 𝑑𝐺𝑐 ± 2_, 𝑆𝐶𝐺

𝑐 ± 2_}𝑁 𝐿𝑐 , 𝑑

𝐿𝑐 , 𝑆𝐶

𝐿𝑐 =SHGO(𝑓𝑜𝑏 𝑗 , 𝑆𝑃𝑙𝑜𝑐𝑎𝑙 , 𝑆𝑐𝑜𝑛𝑠 , search step size=1 )

Add {𝑤𝑚, 𝑁𝐿𝑐 , 𝑑

𝐿𝑐 , 𝑆𝐶

𝐿𝑐 } to 𝐶𝐴𝑆

endCompare the candidate architecture in 𝐶𝐴𝑆 , find the optimum {𝑤∗


∗𝑐 , 𝑆𝐶

∗𝑐 }.

Return {𝑤∗𝑚, 𝑁

∗𝑐 , 𝑑

∗𝑐 , 𝑆𝐶

∗𝑐 }

end

Moreover, the convergence of SHGO requires much fewer samples and less time than reinforcementlearning approaches [27]. Hence, we use SHGO for our new hierarchical searching algorithm.

Specifically, as shown in Algorithm 1, to further accelerate the searching process, we propose athree-level SHGO-based algorithm instead of using the original SHGO algorithm. At the first level,we enumerate𝑤𝑚 in the search space. Usually, the range of𝑤𝑚 is much more narrow than the otherarchitecture parameters; hence without fixing𝑤𝑚 , we cannot use a large search step size for thesecond-level coarse-grain search. At the second level, we use SHGO with a large search step size _ tosearch for a coarse optimum 𝑁𝐺


𝐺𝑐 by fixing the𝑤𝑚 . At the third level (fine-grain search), we

use SHGO with the smallest search step size (i.e., 1) to search for the optimum 𝑁 𝐿𝑐 , 𝑑

𝐿𝑐 , 𝑆𝐶

𝐿𝑐 values

for a specific𝑤𝑚 , within the neighborhood of the coarse optimum 𝑁𝐺𝑐 , 𝑑

𝐺𝑐 , 𝑆𝐶

𝐺𝑐 , and add it to the

candidate set. After completing the three-level search, we compare all neural architectures in thecandidate set and determine the (final) optimal architecture {𝑤∗


∗𝑐 , 𝑆𝐶

∗𝑐 }. To summarize, given

the number of hyper-parameters𝑀 and the number of possible values of each hyper-parameter 𝑁 ,the complexity of our hierarchical SHGO-based NAS is roughly proportional to MN, i.e., 𝑂 (𝑀𝑁 ).Experimental results in Section 4 show that our proposed hierarchical search accelerates the

overall search process without any decrease in the performance of the obtained neural architecture.Moreover, our proposed hierarchical SHGO-based algorithm involves much less computationalworkload compared to the original (one-level) SHGO-based algorithm and RL-based approaches [27];this even enables us to do NAS on a real Raspberry Pi-3B processor.


14 G. Li, et al.

4 Experimental Results4.1 Experimental setupDataset: Existing NAS approaches show that the test accuracy of CNNs on CIFAR-10 dataset canindicate the test accuracy on other datasets, such as ImageNet [16]. Hence, similar to most ofthe NAS approaches, we use CIFAR-10 as the primary dataset. Moreover, we also evaluate ourframework on CIFAR-100 and Tiny-ImageNet2 to demonstrate the generality of our proposedmetric NN-Degree and accuracy predictor.Training Hyper-parameters: We train each of the selected neural networks five times withPyTorch and use the mean test accuracy of these five runs as the final results. All networks aretrained for 200 epochs with the SGD optimizer and a momentum of 0.9. We set the initial learningrate as 0.1 and use Cosine Annealing algorithm as the learning rate scheduler.Search Space: DenseNets are more efficient in terms of model size and computation workloadthan ResNets while achieving the same test accuracy [25]. Moreover, DenseNets have many moreskip connections; this provides us with more flexibility for exploration compared to networkswith Addition-type skip connections (ResNets, Wide-ResNets, and MobileNets). Hence, in ourexperiments, we explore the CNNs with DenseNet-type skip connections.To enlarge the search space, we generate the generalized version of standard DenseNets by

randomly selecting channels for concatenation. Specifically, for a given cell 𝑐 , we define 𝑡𝑐 as themaximum skip connections that each layer can have; thus, we use 𝑡𝑐 to control the topologicalproperties of CNNs. Given the definition of 𝑡𝑐 , layer 𝑖 can receive DenseNet-type skip connections(DTSC) from a maximum number of 𝑡𝑐 channels from previous layers within the same cell; that is,we randomly select𝑚𝑖𝑛{𝑤𝑐 (𝑖 − 1), 𝑡𝑐 } channels from layers 0, 1, ..., (𝑖 − 2), and concatenate themat layer 𝑖 − 1. The concatenated channels then pass through a convolutional layer to generatethe output of layer 𝑖 (𝑠𝑖 ). Similar to recent NAS research [32], we select links randomly becauserandom architectures are often as competitive as the carefully designed ones. If the skip connectionsencompass all-to-all connections, this would result in the original DenseNet architecture [25]. Animportant advantage of the above setup is that we can control the number of DTSC (using 𝑡𝑐 ) tocover a vast search space with a large number of candidate DNNs.

Like standard DenseNets, we can generalize this setup to contain multiple (𝑁𝑐 ) cells of width𝑤𝑐

and depth 𝑑𝑐 ; DTSC are present only within a cell and not across cells. Furthermore, we increase thewidth (i.e., the number of output channels per layer) by a factor of 2 and halve the height and widthof the feature map cell by cell, following the standard practice [47]. After several cells (groups) ofconvolutions layers, the final feature map is average-pooled and passed through a fully-connectedlayer to generate the logits. The width of each cell is controlled using a width multiplier,𝑤𝑚 (likein Wide-ResNets [59]). The base number of channels of each cell is [16,32,64]. For𝑤𝑚 = 3, cells willhave [48,96,192] channels per layer. To summarize, we control the value {𝑤𝑚, 𝑁𝑐 , 𝑑𝑐 , 𝑡𝑐 } to samplecandidate architectures from the entire search space.

Fig. 5 illustrates a sample CNN similar to the candidate architectures in our search space (smallvalues of𝑤𝑐 and 𝑑𝑐 are used for clarity). This CNN consists of three cells, each containing 𝑑𝑐 = 4convolutional layers. The three cells have a width (i.e., the number of channels per layer) of 2, 3,and 4, respectively. We denote the network width as𝑤𝑐 = [2, 3, 4]. Finally, the maximum number ofchannels that can supply skip connections is given by 𝑡𝑐 = [2, 5, 6]. That is, the first cell can have amaximum of two skip connection candidates per layer (i.e., previous channels that can supply skipconnections), the second cell can have a maximum of five skip connections candidates per layer,and so on. Moreover, as mentioned before, we randomly choose𝑚𝑖𝑛{𝑤𝑐 (𝑖 − 1), 𝑡𝑐 } channels for2Tiny-ImageNet is a downscaled-version ImageNet dataset with 64x64 resolution and 200 classes [15]. For more details,please check: http://cs231n.stanford.edu/tiny-imagenet-200.zip


http://cs231n.stanford.edu/tiny-imagenet-200.zip


skip connections at each layer. The inset of Fig. 5 shows for a specific layer, how skip connectionsare created by concatenating the feature maps from previous layers.

In practice, we use three cells for the CIFAR-10 dataset, i.e., 𝑁𝑐 = 3. We constrain the 1 ≤ 𝑤𝑚 ≤ 3and 5 ≤ 𝑑𝑐 ≤ 30. We also constrain 𝑡𝑐 of each cell: 5 ≤ 𝑡1, 2𝑡1 ≤ 𝑡2 and 2𝑡2 ≤ 𝑡3 for these threecells, respectively. In this way, we can balance the number of skip connections across each cell.Moreover, the maximum number of skip connections that a layer can have is the product of thewidth of the cell (𝑤𝑐 ) and 𝑑𝑐 − 2 which happens for the last layer in a cell concatenating all ofthe output channels except the second last layer. Hence, the upper bound of 𝑡𝑐 , for each cell, is16𝑤𝑚 (𝑑𝑐 − 2), 32𝑤𝑚 (𝑑𝑐 − 2), 64𝑤𝑚 (𝑑𝑐 − 2), respectively. Therefore, the size of the overall searchspace is:

3∑︁𝑤𝑚=1

30∑︁𝑑𝑐=5

16𝑤𝑚 (𝑑𝑐−2)∑︁𝑡1=5

32𝑤𝑚 (𝑑𝑐−2)∑︁𝑡2=2𝑡1

(64𝑤𝑚 (𝑑𝑐 − 2) − 2𝑡2 + 1) = 6.39 × 1010

Hardware Platform: The training of the sample neural architectures from the search space isconducting on Nvidia GTX-1080Ti GPU. We use Intel Xeon 6230, a 20-core CPU, to simulate thehardware performance of multiple candidate networks and fine-tune the accuracy predictor andanalytical hardware models. Finally, we use the same 20-core CPU to conduct the NAS process.4.2 Accuracy PredictorWe first derive the NN-Degree (𝑔) for the neural architecture in our search space. Based on

Equation 2, we substitute 𝑆𝐶𝑐 with the real number of skip connections in a cell as follows:

𝑔 =



𝑤𝑐 × 𝑑𝑐) =


(𝑤𝑐 +∑𝑑𝑐−1

𝑖=2 min{(𝑖 − 1)𝑤𝑐 , 𝑡𝑐 }𝑑𝑐

) (13)

In Section 3, we argue that the neural architecture with a higher NN-degree value tends to providea higher test accuracy. In Fig. 6(a), we plot the test accuracy vs. NN-Degree of 60 randomly sampledneural networks from the search space for CIFAR-10 dataset; our proposed network-topologybased metric NN-Degree indicates the test accuracy of neural networks. Furthermore, Fig 6(b)

……

…

𝒅𝒅𝒄𝒄 = 𝟒𝟒

𝒘𝒘𝒄𝒄

=𝟑𝟑

Average Pooling

Logits

OutputFully-connected

softmax

𝒔𝒔𝟎𝟎 𝒔𝒔𝟏𝟏 𝒔𝒔𝟐𝟐 𝒔𝒔𝟑𝟑

Cell 1 Cell 2 Cell 3



𝒘𝒘𝒄𝒄

=𝟐𝟐 𝒘𝒘

𝒄𝒄=𝟒𝟒

𝒕𝒕𝒄𝒄 = 𝟐𝟐 𝒕𝒕𝒄𝒄 = 𝟓𝟓 𝒕𝒕𝒄𝒄 = 𝟔𝟔

For layer i=2, #𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫 = 𝐦𝐦𝐦𝐦𝐦𝐦 (𝒊𝒊 − 𝟏𝟏) ∗ 𝒘𝒘𝒄𝒄, 𝒕𝒕𝒄𝒄 = 𝟑𝟑

For layer i=3, #𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫 = 𝐦𝐦𝐦𝐦𝐦𝐦 (𝒊𝒊 − 𝟏𝟏) ∗ 𝒘𝒘𝒄𝒄, 𝒕𝒕𝒄𝒄 = 𝟓𝟓

Note: Not all connections are shown in the figure. If a channel is selected, it contributes long-range links to all output channels of the current layer.


𝒕𝒕𝒄𝒄 = 𝟓𝟓

𝒔𝒔𝟎𝟎

𝒔𝒔𝟏𝟏

𝒔𝒔𝟐𝟐

Concatenate feature maps like DenseNets

Fig. 5. An example of candidate neural architectures from our search space. (The values of𝑤𝑐 ,𝑡𝑐 , and 𝑑𝑐 areonly for illustration and they do not represent the real search space). Not all skip connections are shown inthe figure, for simplicity. The upper inset shows the contribution from all skip and short-range links to layer𝑖 = 2: The feature maps for the randomly selected channels are concatenated as the input of the current layer𝑖 = 2 (similar to DenseNets [25]). At each layer in a given cell, the maximum number of channels contributingto skip connections is controlled by 𝑡𝑐 .


16 G. Li, et al.

Table 5. Our NN-Degree based accuracy predictor for neural architecture search vs. existing predictorsimplemented by graph-based neural networks. We calculate the improvement ratio for each of the metric byconsidering the best among all existing approaches in this table. (‘-’ denotes that the corresponding resultsare not reported or not applicable.)

Accuracy EstimationTechnique

Search Space (SS) Size # Training Samples RMSE (%) Training Time (s)

Value % of FLASH SS Value Ratio (×)w.r.t FLASH Value Ratio (×)

w.r.t FLASHGNN+MLP [40] 4.2 × 105 6.6 × 10−4 % 3.8 × 105 15250 - - -GNN [33] 4.2 × 105 6.6 × 10−4 % 3.0 × 105 11862 0.05 - -GCN [9] 1.6 × 104 2.5 × 10−5 % 1.0 × 103 40 >1.8 - -GCN [54] 4.2 × 105 6.6 × 10−4 % 1.7 × 102 6.88 1.4 25 66FLASH (NN-Degree +Logistic Regression) 6.4 × 1010 100% 2.5 × 101 1 0.152 0.38 1

and Fig 6(c) also show the test accuracy vs. NN-Degree of 20 networks on CIFAR-100 dataset and27 networks on Tiny-ImageNet randomly sampled from the search space. Clearly, our proposedmetric NN-Degree predicts the test accuracy of neural networks on these two datasets as well.Indeed, the results prove that our claim in Section 3 is empirically correct, i.e., networks with higherNN-Degree values have a better test accuracy.

Next, we use our proposed NN-Degree to build the analytical accuracy predictor. We train as fewas 25 sample architectures randomly sampled from the entire search space and record their testaccuracy and NN-Degree on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. Then, we fine-tuneour NN-Degree based accuracy predictor described by Equation 7. As shown in Fig. 7(a), Fig 7(b),and Fig 7(c), our accuracy predictor achieves very high performance while using surprisingly fewsamples with only three parameters on all these datasets.

200 400 600 800 1000 1200NN-Degree

94

95

96

97

Test

Acc

urac

y (%

)

Test Accuracy vs. NN-Degree on CIFAR-10(a)

300 400 500 600 700 800NN-Degree

77

78

79

80

81

Test

Acc

urac

y (%

)

Test Accuracy vs. NN-Degree on CIFAR-100(b)

200 400 600 800 1000NN-Degree

48

50

52

54

56

58

60

Test

Acc

urac

y (%

)

Test Accuracy vs. NN-Degree on Tiny-ImageNet(c)

Fig. 6. We randomly select multiple networks from the search space then train and test their accuracyon CIFAR-10, CIFAR-100, Tiny-ImageNet datasets. (a) Real test accuracy vs. NN-Degree: networks withhigher NN-Degree values have a higher test accuracy on the CIFAR-10 dataset. (b) Real test accuracy vs.NN-Degree: networks with higher NN-Degree values have a higher test accuracy on the CIFAR-100 dataset.(c) Real test accuracy vs. NN-Degree: networks with higher NN-Degree values have a higher test accuracy onthe Tiny-ImageNet dataset.

94 94.5 95 95.5 96 96.5 97Test Accuracy (%)

94

95

96

97

Our

Pre

dict

edTe

st A

ccur

acy

(%)

Accuracy Predictor on CIFAR-10(a)

77 78 79 80 81Test Accuracy (%)

77

78

79

80

81

Our

Pre

dict

edTe

st A

ccur

acy

(%)

Accuracy Predictor on CIFAR-100(b)

45 50 55 60Test Accuracy (%)

45

50

55

60

Our

Pre

dict

edTe

st A

ccur

acy

(%)

Accuracy Predictor on Tiny-ImageNet(c)

Fig. 7. (a) Predictions of our NN-Degree based accuracy predictor vs. real test accuracy on CIFAR-10 dataset.(b) Predictions of our NN-Degree based accuracy predictor vs. real test accuracy on CIFAR-100 dataset. (c)Predictions of our NN-Degree based accuracy predictor vs. real test accuracy on Tiny-ImageNet dataset.The red dotted lines in these figures show a very good correlation between the predicted and measured values.



Stage 1

Latency (𝓛𝓛) Energy (ℇ)Area (𝓐𝓐 )

NN-Degree: 𝑔𝑔𝐷𝐷HW Performance

Estimation: ℒ𝐷𝐷′ , ℇ𝐷𝐷′ ,𝒜𝒜𝐷𝐷

′

𝑆𝑆 = 𝑆𝑆 + 1,𝑔𝑔𝐷𝐷∗ = 𝑔𝑔𝐷𝐷,𝐷𝐷∗ = 𝐷𝐷

Resample neural architecture

Train 𝐷𝐷∗

Search Space

No

No

No

Yes

Yes

Yes

Sear

ch a

lgor

ithm

Performance model generation

Performance analysis of inference on the target hardware

Symbol Meaning Symbol Meaning𝐷𝐷 Neural architecture ℒM Constraint of latencyℰ𝑀𝑀 Constraint of energy ℰ𝑀𝑀 Constraint of energy𝑔𝑔 NN-Degree 𝒜𝒜M Constraint of area𝑔𝑔𝐷𝐷∗ NN-Degree of Network 𝐷𝐷∗ 𝑆𝑆 Number of samples𝑔𝑔𝐷𝐷 NN-Degree of Network 𝐷𝐷 𝑇𝑇𝑆𝑆 Threshold of 𝑆𝑆

Optimal neural architecture

Sample network: 𝐷𝐷

Target hardware

Initialize: 𝑆𝑆 = 0,𝑔𝑔𝐷𝐷∗ = 0,ℒM,ℰ𝑀𝑀,𝒜𝒜M

Stage 2

𝑆𝑆 ≥ 𝑇𝑇𝑆𝑆

𝑔𝑔𝐷𝐷 ≥ 𝑔𝑔𝐷𝐷∗

ℒ𝐷𝐷′ ≤ ℒM &ℇ𝐷𝐷′ ≤ ℰ𝑀𝑀 &𝒜𝒜𝐷𝐷

′ ≤ 𝒜𝒜M

Fig. 8. Overview of the proposed training-free NAS approach. Stage 1 (red box): we build hardware (HW)performance models by randomly sampling candidate networks from the search space to evaluate thehardware characteristics (latency L, energy E, and area A). Stage 2 (blue box): we search for the optimalnetwork architecture with the hardware performance constraints (i.e., L𝑀 , E𝑀 , and A𝑀 ); we randomlychoose some architectures and use the HW performance models to estimate their hardware performance.Then, we select the neural architecture 𝐷∗ with the highest NN-Degree which meets the HW performanceconstraints. Finally, we train the obtained architecture 𝐷∗ to get the optimal neural architecture.

We also compare our NN-Degree-based accuracy predictor with the current state-of-the-artapproaches. As shown in Table 5, most of the existing approaches use Graph-based neural networksto make predictions [9, 33, 40, 54]. However, Graph-based neural networks require much moretraining data, and they are much more complicated in terms of computation and model structurecompared to classical methods like logistic regression. Due to the significant reduction in the modelcomplexity, our predictor requires 6.88× fewer training samples, although a much larger searchspace (1.5× 105 larger than the existing work) is covered. Moreover, our NN-Degree based predictorhas only three parameters to be updated; hence it consumes 66× less fine-tuning time than theexisting approaches. Finally, besides such low model complexity and fast training process, ourpredictor achieves a very small RMSE (0.152%) as well.During the search of our NAS methodology, we use the accuracy predictor to directly predict

the accuracy of sample architectures as opposed to performing the time-consuming training. Thehigh precision and low complexity of our proposed accuracy predictor also enable us to adoptvery fast optimization methods during the search stage. Furthermore, because our proposed metricNN-Degree can predict the test performance of a given architecture, we can use NN-Degree asthe proxy of the test accuracy to do NAS without the time-consuming training process. Thistraining-free property allows us to quickly compare the accuracy of given architectures and thusaccelerate the entire NAS.

4.3 NN-Degree based Training-free NAS

To conduct the training-free NAS, we reformulate the problem described by Equation 1 as follows:

max\, subject to: A ≤ A𝑀 , L ≤ L𝑀 , E ≤ E𝑀 (14)


18 G. Li, et al.

Table 6. Our NN-Degree based training-free NAS (FLASH) and several representative time-efficient NAS onCIFAR-10 Dataset. We select the optimal architectures with the highest NN-Degree values among 20,000randomly sampled architectures on a 20-core CPU.

Method Search Method #Params Search Cost Training needed Test error (%)ENAS[42] RL+weight sharing 4.6M 12 GPU hours Yes 2.89SNAS[57] gradient-based 2.8M 36 GPU hours Yes 2.85DARTS-v1[32] gradient-based 3.3M 1.5 GPU hours Yes 3.0DARTS-v2[32] gradient-based 3.3M 4 GPU hours Yes 2.76ProxylessNAS[8] gradient-based 5.7M NA Yes 2.08Zero-Cost[1] Proxy-based NA NA Yes 5.78TE-NAS[11] Proxy-based 3.8M 1.2 GPU hours No 2.63FLASH NN-Degree based 3.8M 0.11 seconds No 3.13

To maximize the values of \ , we can search for the network with maximal NN-Degree values, whicheliminate the training time of candidate architectures. In Fig. 8, we show how we can use theNN-Degree to do training-free NAS. During the first stage, we profile a few networks on the targethardware and fine-tune our hardware performance models. During the second stage, we randomlysample candidate architectures and select those which meet the hardware performance constraints.We use the fine-tuned analytical models to estimate the hardware performance instead of doingreal inference, which improves the time efficiency of the entire NAS. After that, we select theoptimal architecture with the highest NN-Degree values which meets the hardware performanceconstraints. We note that the NAS process itself is training-free (hence lightweight), as only thefinal solution 𝐷∗ needs to be trained.To evaluate the performance of our training-free NAS framework, we randomly sample 20,000

candidate architectures from the search space and select the one with the highest NN-Degreevalues as the obtained/optimal architecture. Specifically, it takes only 0.11 seconds to evaluate these20,000 samples’ NN-Degree on a 20-core CPU to get the optimal architecture (no GPU needed). Asshown in Table 6, the optimal architecture among these 20,000 samples achieves a comparable testperformance with the representative time-efficient NAS approaches but with much less time costand computation capacity requirement.4.4 Analytical hardware performance modelsOur experiments show that using 180 samples offers a good balance between the analytical

models’ accuracy and the number of fine-tuning samples. Hence, we randomly select 180 neuralarchitectures from the search space to build our analytical hardware performance models. Next,

(a) (b) (c)

Fig. 9. Performance of our analytical hardware models on ImageNet classification networks: (a) Predictedvalues by our analytical area model vs. measured area. (b) Predicted values by our analytical latency model vs.measured latency. (c) Predicted values by our analytical energy model vs. measured energy consumption. Thered lines demonstrate that our proposed models generalize well for networks evaluated on ImageNet-scaledatasets.



Table 7. Summary of the performance of our proposed analytical models for Area, Latency, and Energy.

Model #Features Mean Error (%) Max Error (%) Fine-tuning Time (s)Area 2 0.1 0.2 0.49Latency 9 3.0 20.8 0.52Energy 16 3.7 24.4 0.56

Table 8. Estimation error with different ML models for ImageNet with IMC as target hardware platform.

SVM Random Forest(Max. Depth = 16)

Analytical Models(Proposed)

Latency Est. Error (%) 58.98 8.23 6.7Energy Est. Error (%) 78.49 11.01 3.5Area Est. Error (%) 36.99 13.37 1.7

we perform the inference of these selected 180 networks on our simulator [29] to obtain theirarea, latency, and energy consumption. After obtaining the hardware performance of 180 samplenetworks, we fine-tune the parameters of our proposed analytical area, latency, and energy modelsdiscussed in Section 3. To evaluate the performance of these fine-tuned models, we randomly selectanother 540 sample architectures from the search space then conduct inference and obtain theirhardware performance.Table 7 summarizes the performance of our analytical models. The mean estimation error is

always less than 4%. Fig. 9 shows the estimated hardware performance obtained by our analyticalmodel for the ImageNet dataset. We observe that the estimation coincides with the measured valuesfrom simulation. Our analytical models enable us to obtain very accurate predictions of hardwareperformance with the time cost of less than 1 second on a 20-core CPU. The high performance andlow computation workload enable us to directly adopt these analytical models to accelerate oursearching stage instead of conducting real inference.Comparison with other machine learning models: Table 8 compares the estimation error forSVM, random forest with a maximum tree depth of 16 and the proposed analytical hardware modelsfor ImageNet dataset. A maximum tree depth of 16 is chosen for random forest since it providesthe best accuracy among random forest models. We observe that our proposed analytical hardwaremodels achieve the smallest error among all three modeling techniques. SVM performs poorlysince it tries to classify the data with a hyper-plane, and no such plane may exist given the complexrelationship between the features and performance of the hardware platform.

(a) (b) (c)

Fig. 10. Performance comparison between our mesh-NoC and cmesh-NoC [46] on CIFAR-10 classificationnetworks for 16 different networks: (a) Our mesh-NoC needs much less area than the cmesh-NoC; (b) Ourmesh-NoC has almost the same latency as the cmesh-NoC; (c) Our mesh-NoC consumes much less energyconsumption than the cmesh-NoC.


20 G. Li, et al.

Fig. 11. Performance comparison between our mesh-NoC and cmesh-NoC [46] on ImageNet classificationnetworks for 15 different networks: (a) Our mesh-NoC needs much less area than the cmesh-NoC; (b) Ourmesh-NoC has almost the same latency as the cmesh-NoC; (c) Our mesh-NoC consumes much less energyconsumption than the cmesh-NoC.

4.5 On-chip communication optimizationAs shown in Fig. 10 and Fig. 11, we compare the NoC performance (area, energy, and latency) of

our FLASH with respect to the cmesh-NoC [46] for 16 randomly selected networks from the searchspace for CIFAR-10 dataset and ImageNet dataset, respectively. We observe that the mesh-NoCoccupies on average only 37% area and consumes only 41% energy with respect to the cmesh-NoC.Since the cmesh-NoC uses extra links and repeaters to connect diagonal routers, the area andenergy with the cmesh-NoC are significantly higher than the mesh-NoC. Additional links androuters in the cmesh-NoC result in lower hop counts than the mesh-NoC. However, the lower hopcount reduces the latency at low congestion. As the congestion in the NoC increases, the latencyof the cmesh-NoC becomes higher than the mesh-NoC due to increased utilization of additionallinks. This phenomenon is also demonstrated in [19]. Therefore, the communication latency withthe cmesh-NoC is higher than the mesh-NoC for most of the DNNs. The communication latencywith the mesh-NoC is on average within 3% different from the communication latency with thecmesh-NoC. Moreover, we observe that the average utilization of the queues in the mesh-NoCvaries between 20%-40% for the ImageNet dataset. Furthermore, the maximum utilization of thequeues ranges from 60% to 80%. Therefore, the mesh-NoC is heavily congested. Thus, our proposedcommunication optimization strategy outperforms the state-of-the-art approaches.4.6 Hierarchical SHGO-based neural architecture search

After we fine-tune the NN-Degree based accuracy predictor and analytical hardware performancemodels, we use our proposed hierarchical SHGO-based searching algorithm to do the neuralarchitecture search.Baseline approach: Reinforcement Learning (RL) is widely used in NAS [24, 27, 61]; hence wehave implemented a RL-based NAS framework as a baseline. For the baseline, we consider theobjective function in Equation 1. Specifically, we incorporate a deep-Q network approach for thebaseline-RL [38]. We construct four different controllers for the number of cell (𝑁𝑐 ), cell depth (𝑑𝑐 ),width multiplier (𝑤𝑚) and number of long skip connections (𝑆𝐶𝑐 ). The training hyper-parameters forthe baseline-RL are shown in Table 9. The baseline-RL approach estimates the optimal parameters(𝑁𝑐 , 𝑑𝑐 ,𝑤𝑚, 𝑆𝐶𝑐 ). We tune the baseline-RL approach to obtain the best possible results. We alsoimplement a one-level SHGO algorithm (i.e., original SHGO) as another baseline to show theefficiency of our hierarchical algorithm.We compare the baseline-RL approach with our proposed SHGO-based optimization approach.

As shown in Table 10, when there is no constraint in terms of accuracy and hardware performance,our hierarchical SHGO-based algorithm brings negligible overhead compared to the one-levelSHGO algorithm. Moreover, our hierarchical SHGO-based algorithm needs much fewer samples



Table 9. Parameters chosen for the baseline-RL approach.

Metric Value Metric ValueNumber of layers 3 Learning rate 0.001Number of neurons in each layer 20 Activation softmaxOptimizer ADAM Loss MSE

0 2000 4000 6000 8000 10000 12000Measured Latency (ms)

0

2000

4000

6000

8000

10000

12000

Our

Pre

dict

ed L

aten

cy (m

s)

Performance of our Latency Model

0 5 10 15 20 25Measured Energy Consumption (J)

0

5

10

15

20

25

Our

Pre

dict

edEn

ergy

Con

sum

ptio

n (J

)

Performance of our Energy Model(a) (b)

Fig. 12. (a) Predictions of our analytical latency models vs. measured values for RPi-3B. (b) Predictions ofour analytical energy consumption models vs. measured values for RPi-3B. The red dotted lines in these twofigures show a high correlation between predicted and measured values.

(144.93×) during the search process than RL-based methods. Our proposed search algorithm is asfast as 0.07 seconds and 27929× faster than the RL-based methods, while achieving the same qualityof the solution! As for the searching with specific constraints, the training of RL-based methodscannot even converge after training with 10000 samples. Furthermore, our hierarchical SHGO-based algorithm obtains a better-quality model with 7.03× fewer samples and 14.7× less searchtime compared to the one-level SHGO algorithm. The results show that our proposed hierarchicalstrategy further improves the efficiency of the original SHGO algorithm.4.7 Case study: Raspberry Pi and Odroid MC1As discussed in previous sections, each component and stages of FLASH are very efficient in

terms of both computation and time costs. To further demonstrate the efficiency of our FLASHmethodology, we implement FLASH on two typical edge devices, namely, the Raspberry Pi-3Model-B (RPi-3B) and Odroid MC1 (MC1).

Table 10. Comparison between RL-based search, one-level SHGO-based search, and our proposed hierarchicalSHGO-based search. No constraint means that we don’t set any bounds for the accuracy, area, latency, andenergy consumption of the networks; we compare FLASHwith RL when there are no constraints. For searchingwith constraints, we set the minimal accuracy being 95.8% (\ ≥ \𝑀 = 95.8%) as an example; we compareFLASH with one-level SHGO because RL does not converge. The quality of the model is calculated by theobjective function in Equation 1 (higher is better).

Constraintsinvolved?

Method Search cost(#Samples)

SearchTime (s)

Quality of obtainedmodel (Eq. 1)

Converge?

No

RL 10000 1955 20984 Yesone-level SHGO 23 0.03 20984 Yeshierarchical SHGO (FLASH) 69 0.07 20984 YesImprovement 144.93× 27929× 1× -

Yes. \ ≥ \𝑀

RL >10000 - - Noone-level SHGO 1195 3.82 10550 Yeshierarchical SHGO (FLASH) 170 0.26 11969 YesImprovement 7.03× 14.7× 1.13× -


22 G. Li, et al.

(a) (b)

Fig. 13. (a) Predictions of our analytical latency models vs. measured values for MC1. (b) Predictions of ouranalytical energy consumption models vs. measured values for MC1. The red dotted lines in these two figuresshow a very good correlation between the predicted and measured values.

Setup: RPi-3B has an Arm Cortex-A53 quad-core processor with a nominal frequency of 1.2GHzand 1GB of RAM. Furthermore, we use the Odroid Smart Power 2 to measure voltage, current, andpower. We use TensorFlow-Lite (TF-Lite) as the run-time framework on RPi-3B. To achieve this,we first define the architecture of the models by TensorFlow (TF). Then we convert the TF modelinto the TF-Lite format and generate the binary file deployed on the RPi-3B.

Odroid MC1 is powered by Exynos 5422, a heterogeneous system-on-a-chip (MPSoC). This SoCconsists of two clusters of ARM cores and a small GPU core. Besides the hardware platform itself,we use the same setup as for the RPi-3B.Accuracy predictor and analytical hardware performancemodels:Weadopt the same accuracypredictor used in Section 4.6. We only consider latency and energy consumption as the hardwareperformance metrics because the chip area is fixed. Hence, the objective function of searching onRPi-3B and MC1 is:

𝑓𝑜𝑏 𝑗 =𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦

𝐿𝑎𝑡𝑒𝑛𝑐𝑦 × 𝐸𝑛𝑒𝑟𝑔𝑦 (15)

To fine-tune the analytical latency and energy models, we randomly select 180 sample networksfrom the search space. Then we convert them into the TF-Lite format and record their latencyand energy consumption on the RPi-3B. Based on the recorded data, we update the parametersof the analytical latency and energy models. Fig. 12 and 13 show that our analytical hardwareperformance models almost coincide with the real performance of both the RPi-3B and MC1.Search Process on RPi-3B and MC1: We do not show the results of RL-based methods becausethe training of RL models requires intensive computation resources; thus, they cannot be deployedon RPi-3B and MC1. As shown in Table 11, for searching without any constraint, our hierarchicalSHGO-based algorithm has only a minimal overhead compared with the basic (one-level) SHGOalgorithm. Moreover, our hierarchical SHGO-based algorithm is faster than the one-level SHGOalgorithm on MC1.Table 11. Comparison between one-level and hierarchical SHGO-based search on RPi-3B and Odroid MC1.For searching with constraints, we set the minimal accuracy being 96% (\ ≥ \𝑀 = 96%) as an example. Thequality of the model is calculated by Equation 15 (higher is better).

Constraintsinvolved? Method

Search Cost(# Samples)

Search time(s)

Model Quality(Equation 15)

RPi-3B MC1 RPi-3B MC1 RPi-3B MC1

No one-level SHGO 112 113 1.68 0.71 4.74 4.13hierarchical SHGO (FLASH) 180 135 2.21 0.45 4.74 4.13

Yes, \ ≥ \𝑀one-level SHGO 1309 1272 45.98 9.65 0.35 0.38hierarchical SHGO (FLASH) 261 414 2.33 1.32 0.48 0.57Improvement 5.01 × 3.07 × 19.73 × 20.5 × 1.37 × 1.51 ×



For searching with constraints, the hierarchical SHGO-based algorithm obtains a better-qualitymodel with 5.01× fewer samples and 19.73× less search time on the RPi-3B; we achieve similarimprovements on MC1 as well. These results prove the effectiveness of our hierarchical strategyagain. Overall, the total searching time on RPi-3B and MC1 are as short as 2.33 seconds and 1.32seconds, respectively on such resource-constrained edge devices. To our best knowledge, this is thefirst time when a neural architecture search is reported on edge devices.

5 Conclusions and Future WorkThis paper presented a very fast methodology, called FLASH, to improve the time efficiency of NAS.To this end, we have proposed a new topology-based metric, namely the NN-Degree. Using theNN-Degree, we have proposed an analytical accuracy predictor by training as few as 25 samples outof a vast search space with more than 63 billion configurations. Our proposed accuracy predictorachieves the same performance with 6.88× fewer samples and 65.79× reduction in fine-tuning timecost compared to state-of-the-art approaches. We have also optimized the on-chip communicationby designing a mesh-NoC for communication across multiple layers; based on the optimizedhardware, we have built new analytical models to predict area, latency, and energy consumption.Combining the accuracy predictor and the analytical hardware performance models, we have

developed a hierarchical simplicial homology global optimization (SHGO)-based algorithm tooptimize the co-design process while considering both test accuracy and the area, latency, andenergy figures of the target hardware. Finally, we have demonstrated that our newly proposedhierarchical SHGO-based algorithm enables 27729× faster (less than 0.1 seconds) NAS compared tothe state-of-the-art RL-based approaches.We have also shown that FLASH can be readily transferredto other hardware platforms by doing NAS on a Raspberry Pi-3B and Odroid MC1 in less than3 seconds. To our best knowledge, our work is the first to report NAS performed directly andefficiently on edge devices.We note that there is no fundamental limitation to apply FLASH to other machine learning

tasks. However, no IMC-based architectures are widely adopted yet for other machine learningtasks like speech recognition or object segmentation. Therefore,the current work focuses on DNNinference and leaves the extension to other machine learning tasks as future work. Finally, we planto incorporate more types of networks such as ResNet and MobileNet-v2 as part of our future work.

6 AcknowledgmentsThis work was supported in part by the US National Science Foundation (NSF) grant CNS-2007284,and in part by Semiconductor Research Corporation (SRC) grants GRC 2939.001 and 3012.001.

References[1] Mohamed S Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas Donald Lane. 2021. Zero-Cost Proxies for

Lightweight NAS. In International Conference on Learning Representations.[2] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. 2016. Designing Neural Network Architectures using

Reinforcement Learning. arXiv preprint arXiv:1611.02167 (2016).[3] Albert-László Barabási and Eric Bonabeau. 2003. Scale-free Networks. Scientific american 288, 5 (2003), 60–69.[4] Hadjer Benmeziane et al. 2021. A Comprehensive Survey on Hardware-Aware Neural Architecture Search. arXiv

preprint arXiv:2101.09336 (2021).[5] Kartikeya Bhardwaj, Guihong Li, and Radu Marculescu. 2021. How Does Topology Influence Gradient Propagation

and Model Performance of Deep Networks With DenseNet-Type Skip Connections?. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR).

[6] Tom B Brown et al. 2020. Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165 (2020).[7] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2020. Once-for-All: Train One Network and

Specialize it for Efficient Deployment. In International Conference on Learning Representations.


24 G. Li, et al.

[8] Han Cai, Ligeng Zhu, and Song Han. 2019. ProxylessNAS: Direct Neural Architecture Search on Target Task andHardware. In International Conference on Learning Representations.

[9] Thomas Chau, Łukasz Dudziak, Mohamed S Abdelfattah, Royson Lee, Hyeji Kim, and Nicholas D Lane. 2020. BRP-NAS:Prediction-based NAS using GCNs. arXiv preprint arXiv:2007.08668 (2020).

[10] Pai-Yu Chen, Xiaochen Peng, and Shimeng Yu. 2018. NeuroSim: A Circuit-level Macro Model for BenchmarkingNeuro-Inspired Architectures in Online Learning. IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems 37, 12 (2018), 3067–3080.

[11] Wuyang Chen, Xinyu Gong, and Zhangyang Wang. 2021. Neural Architecture Search on ImageNet in Four GPUHours: A Theoretically Inspired Perspective. In International Conference on Learning Representations.

[12] Wei-Lin Chiang et al. 2019. Cluster-gcn: An Efficient Algorithm for Training Deep and Large Graph ConvolutionalNetworks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.257–266.

[13] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized Neural Networks:Training Deep Neural Networks with Weights and Activations Constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830(2016).

[14] Xiaoliang Dai, Hongxu Yin, and Niraj K. Jha. 2020. Grow and Prune Compact, Fast, and Accurate LSTMs. IEEE Trans.Comput. 69, 3 (2020), 441–452.

[15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A Large-Scale Hierarchical ImageDatabase. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 248–255.

[16] Xuanyi Dong and Yi Yang. 2020. NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search.arXiv preprint arXiv:2001.00326 (2020).

[17] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. The Journal ofMachine Learning Research 20, 1 (2019), 1997–2017.

[18] Stefan C Endres, Carl Sandrock, andWalterW Focke. 2018. A Simplicial HomologyAlgorithm for Lipschitz Optimisation.Journal of Global Optimization 72, 2 (2018), 181–217.

[19] Boris Grot and Stephen W Keckler. 2008. Scalable On-Chip Interconnect Topologies. In 2nd Workshop on ChipMultiprocessor Memory Systems and Interconnects.

[20] Song Han, Huizi Mao, and William J Dally. 2015. Deep Compression: Compressing Deep Neural Networks withPruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149 (2015).

[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[22] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv preprintarXiv:1503.02531 (2015).

[23] Mark Horowitz. 2014. 1.1 Computing’s Energy Problem (and what we can do about it). In IEEE International Solid-StateCircuits Conference Digest of Technical Papers (ISSCC). 10–14.

[24] Chi-Hung Hsu et al. 2018. Monas: Multi-objective Neural Architecture Search using Reinforcement Learning. arXivpreprint arXiv:1806.10332 (2018).

[25] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely Connected ConvolutionalNetworks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.

[26] Nan Jiang et al. 2013. A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator. In IEEE ISPASS. 86–96.[27] Weiwen Jiang et al. 2020. Device-circuit-architecture Co-exploration for Computing-in-memory Neural Accelerators.

IEEE Trans. Comput. (2020).[28] Gokul Krishnan et al. 2020. Interconnect-aware Area and Energy Optimization for In-memory Acceleration of DNNs.

IEEE Design & Test 37, 6 (2020), 79–87.[29] Gokul Krishnan et al. 2021. Interconnect-Centric Benchmarking of In-Memory Acceleration for DNNS. In 2021 China

Semiconductor Technology International Conference (CSTIC). IEEE, 1–4.[30] Yuhong Li et al. 2020. EDD: Efficient Differentiable DNN Architecture and Implementation Co-Search for Embedded AI

Solutions. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference. IEEE Press, Article 130, 6 pages.[31] Chenxi Liu et al. 2018. Progressive Neural Architecture Search. In Proceedings of the European Conference on Computer

Vision (ECCV). 19–34.[32] Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable Architecture Search. arXiv preprint

arXiv:1806.09055 (2018).[33] Jovita Lukasik, David Friede, Heiner Stuckenschmidt, and Margret Keuper. 2020. Neural Architecture Performance

Prediction Using Graph Neural Networks. arXiv preprint arXiv:2010.10024 (2020).[34] Sumit K Mandal et al. 2020. A Latency-Optimized Reconfigurable NoC for In-Memory Acceleration of DNNs. IEEE

Journal on Emerging and Selected Topics in Circuits and Systems 10, 3 (2020), 362–375.



[35] Christopher D Manning, Christopher D Manning, and Hinrich Schütze. 1999. Foundations of Statistical NaturalLanguage Processing. MIT press.

[36] Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, andMohammadAlizadeh. 2019. LearningScheduling Algorithms for Data Processing Clusters. In ACM Special Interest Group on Data Communication. 270–288.

[37] Diana Marculescu, Dimitrios Stamoulis, and Ermao Cai. 2018. Hardware-Aware Machine Learning: Modeling andOptimization. In Proceedings of the International Conference on Computer-Aided Design (ICCAD ’18).

[38] Volodymyr Mnih et al. 2013. Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv:1312.5602 (2013).[39] Mark Newman, Albert-László Barabási, and Duncan J Watts. 2006. The Structure and Dynamics of Networks. Princeton

University Press.[40] Xuefei Ning, Yin Zheng, Tianchen Zhao, Yu Wang, and Huazhong Yang. 2020. A Generic Graph-based Neural

Architecture Encoding Scheme for Predictor-based NAS. (2020).[41] Xiaochen Peng et al. 2019. Inference Engine Benchmarking Across Technological Platforms from CMOS to RRAM. In

Proceedings of the International Symposium on Memory Systems. 471–479.[42] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. 2018. Efficient Neural Architecture Search via

Parameters Sharing. In International Conference on Machine Learning. PMLR, 4095–4104.[43] Ximing Qiao, Xiong Cao, Huanrui Yang, Linghao Song, and Hai Li. 2018. Atomlayer: A Universal Reram-based CNN

Accelerator with Atomic Layer Computation. In IEEE/ACM DAC.[44] Esteban Real et al. 2017. Large-scale Evolution of Image Classifiers. In International Conference on Machine Learning.

PMLR, 2902–2911.[45] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2:

Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE conference on computer vision and patternrecognition. 4510–4520.

[46] Ali Shafiee et al. 2016. ISAAC: A Convolutional Neural Network Accelerator with in-situ Analog Arithmetic inCrossbars. ACM/IEEE ISCA (2016).

[47] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-scale Image Recognition.arXiv preprint arXiv:1409.1556 (2014).

[48] Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. Pipelayer: A Pipelined Reram-based Accelerator for DeepLearning. In IEEE HPCA. 541–552.

[49] Dimitrios Stamoulis et al. 2019. Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours. arXivpreprint arXiv:1904.02877 (2019).

[50] Mingxing Tan et al. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR). 2820–2828.

[51] Andreas Veit, Michael Wilber, and Serge Belongie. 2016. Residual Networks Behave Like Ensembles of RelativelyShallow Networks. arXiv preprint arXiv:1605.06431 (2016).

[52] Siqi Wang, Anuj Pathania, and Tulika Mitra. 2020. Neural Network Inference on Mobile SoCs. IEEE Design & Test 37, 5(2020), 50–57.

[53] Duncan J Watts and Steven H Strogatz. 1998. Collective Dynamics of ‘Small-World’Networks. Nature 393, 6684 (1998),440–442.

[54] Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans. 2020. Neural Predictor forNeural Architecture Search. In European Conference on Computer Vision. Springer, 660–676.

[55] Martin Wistuba, Ambrish Rawat, and Tejaswini Pedapati. 2019. A Survey on Neural Architecture Search. arXiv preprintarXiv:1905.01392 (2019).

[56] Bichen Wu et al. 2019. Fbnet: Hardware-aware Efficient Convnet Design via Differentiable Neural Architecture Search.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10734–10742.

[57] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. 2019. SNAS: stochastic neural architecture search. In InternationalConference on Learning Representations.

[58] Jiahui Yu et al. 2020. BigNAS: Scaling up Neural Architecture Search with Big Single-Stage Models. In Computer Vision– ECCV 2020. 702–717.

[59] Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. arXiv preprint arXiv:1605.07146 (2016).[60] Barret Zoph and Quoc V Le. 2016. Neural Architecture Search with Reinforcement Learning. arXiv preprint

arXiv:1611.01578 (2016).[61] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning Transferable Architectures for

Scalable Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).8697–8710.


umit y. ogras, radu marculescu, arxiv:2108.00568v1 [cs.cv

Documents