james ren hou lee, linda wang, and alexander wong waterloo … · 2020-06-30 · james ren hou lee,...

9
EmotionNet Nano: An Efficient Deep Convolutional Neural Network Design for Real-time Facial Expression Recognition James Ren Hou Lee, Linda Wang, and Alexander Wong Department of Systems Design Engineering, University of Waterloo, Canada Waterloo Artificial Intelligence Institute, Canada DarwinAI Corp., Canada {jrhlee, linda.wang, a28wong}@uwaterloo.ca Abstract While recent advances in deep learning have led to significant improvements in facial expression classification (FEC), a major challenge that remains a bottleneck for the widespread deployment of such systems is their high architectural and computational complexities. This is es- pecially challenging given the operational requirements of various FEC applications, such as safety, marketing, learn- ing, and assistive living, where real-time requirements on low-cost embedded devices is desired. Motivated by this need for a compact, low latency, yet accurate system ca- pable of performing FEC in real-time on low-cost embed- ded devices, this study proposes EmotionNet Nano, an ef- ficient deep convolutional neural network created through a human-machine collaborative design strategy, where hu- man experience is combined with machine meticulousness and speed in order to craft a deep neural network design catered towards real-time embedded usage. Two different variants of EmotionNet Nano are presented, each with a different trade-off between architectural and computational complexity and accuracy. Experimental results using the CK+ facial expression benchmark dataset demonstrate that the proposed EmotionNet Nano networks demonstrated ac- curacies comparable to state-of-the-art in FEC networks, while requiring significantly fewer parameters (e.g., 23× fewer compared to [26] at a higher accuracy). Further- more, we demonstrate that the proposed EmotionNet Nano networks achieved real-time inference speeds (e.g. > 25 FPS and > 70 FPS at 15W and 30W, respectively) and high energy efficiency (e.g. > 1.7 images/sec/watt at 15W) on an ARM embedded processor, thus further illustrating the efficacy of EmotionNet Nano for deployment on embedded devices. Figure 1. Examples of six different expressions, taken from the CK+ dataset, that needs to be distinguished via facial expression classification: A) anger, B) disgust, C) fear, D) happiness, E) sad- ness, and F) surprise. 1. Introduction Facial expression classification (FEC) is an area in com- puter vision that has benefited significantly from the rapid advances in machine learning, which has enabled data col- lections comprising a diversity of facial expressions cap- tured of different individuals to be leveraged to learn classi- fiers for differentiating between different facial expression types. In particular, deep learning when applied to FEC has led to significant improvements in accuracy under complex conditions, such as varying lighting, angle, or occlusion. Even though the performance of deep learning-based FEC systems continue to rise, widespread deployment of such systems is limited, with one of the biggest hurdles be- ing the high architectural and computational complexities of the deep neural networks that drive such systems. This hurdle is particularly limiting for real-time embedded sce- narios, where low latency operation is required on the low- cost embedded devices. For example, in the area of assis- tive technologies for improving quality of life, the major- 1 arXiv:2006.15759v1 [cs.CV] 29 Jun 2020

Upload: others

Post on 12-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: James Ren Hou Lee, Linda Wang, and Alexander Wong Waterloo … · 2020-06-30 · James Ren Hou Lee, Linda Wang, and Alexander Wong Department of Systems Design Engineering, University

EmotionNet Nano: An Efficient Deep Convolutional Neural Network Design forReal-time Facial Expression Recognition

James Ren Hou Lee, Linda Wang, and Alexander WongDepartment of Systems Design Engineering, University of Waterloo, Canada

Waterloo Artificial Intelligence Institute, CanadaDarwinAI Corp., Canada

{jrhlee, linda.wang, a28wong}@uwaterloo.ca

Abstract

While recent advances in deep learning have led tosignificant improvements in facial expression classification(FEC), a major challenge that remains a bottleneck forthe widespread deployment of such systems is their higharchitectural and computational complexities. This is es-pecially challenging given the operational requirements ofvarious FEC applications, such as safety, marketing, learn-ing, and assistive living, where real-time requirements onlow-cost embedded devices is desired. Motivated by thisneed for a compact, low latency, yet accurate system ca-pable of performing FEC in real-time on low-cost embed-ded devices, this study proposes EmotionNet Nano, an ef-ficient deep convolutional neural network created througha human-machine collaborative design strategy, where hu-man experience is combined with machine meticulousnessand speed in order to craft a deep neural network designcatered towards real-time embedded usage. Two differentvariants of EmotionNet Nano are presented, each with adifferent trade-off between architectural and computationalcomplexity and accuracy. Experimental results using theCK+ facial expression benchmark dataset demonstrate thatthe proposed EmotionNet Nano networks demonstrated ac-curacies comparable to state-of-the-art in FEC networks,while requiring significantly fewer parameters (e.g., 23×fewer compared to [26] at a higher accuracy). Further-more, we demonstrate that the proposed EmotionNet Nanonetworks achieved real-time inference speeds (e.g. > 25FPS and > 70 FPS at 15W and 30W, respectively) and highenergy efficiency (e.g. > 1.7 images/sec/watt at 15W) onan ARM embedded processor, thus further illustrating theefficacy of EmotionNet Nano for deployment on embeddeddevices.

Figure 1. Examples of six different expressions, taken from theCK+ dataset, that needs to be distinguished via facial expressionclassification: A) anger, B) disgust, C) fear, D) happiness, E) sad-ness, and F) surprise.

1. Introduction

Facial expression classification (FEC) is an area in com-puter vision that has benefited significantly from the rapidadvances in machine learning, which has enabled data col-lections comprising a diversity of facial expressions cap-tured of different individuals to be leveraged to learn classi-fiers for differentiating between different facial expressiontypes. In particular, deep learning when applied to FEC hasled to significant improvements in accuracy under complexconditions, such as varying lighting, angle, or occlusion.

Even though the performance of deep learning-basedFEC systems continue to rise, widespread deployment ofsuch systems is limited, with one of the biggest hurdles be-ing the high architectural and computational complexitiesof the deep neural networks that drive such systems. Thishurdle is particularly limiting for real-time embedded sce-narios, where low latency operation is required on the low-cost embedded devices. For example, in the area of assis-tive technologies for improving quality of life, the major-

1

arX

iv:2

006.

1575

9v1

[cs

.CV

] 2

9 Ju

n 20

20

Page 2: James Ren Hou Lee, Linda Wang, and Alexander Wong Waterloo … · 2020-06-30 · James Ren Hou Lee, Linda Wang, and Alexander Wong Department of Systems Design Engineering, University

ity of individuals using such technologies are unwilling tocarry large, bulky, and expensive devices with them duringtheir daily lives, as that would be a big hindrance that lim-its their ability to leverage the technologies in a seamlessmanner. As such, the assistive devices must leverage small,low-cost, embedded processors, yet provide low latency toenable real-time feedback to the user. Another example isin-car driver monitoring [12], where a FEC system wouldrecord the driver and determine their current mental state,and warn them if their awareness level is deteriorating. Incases such as these, the difference of a few milliseconds ofprocessing is paramount for the safety of not only the user,but also other drivers on the road. In applications for fieldssuch as marketing or security, real-time processing is im-portant to provide salespeople or security guards immediatefeedback such that an appropriate response can be made assoon as possible. For those relying on software assistancefor social purposes, information is required at no delay inorder to keep a conversation alive and not cause discomfortfor both parties.

Motivated by the desire to design deep neural networkarchitectures catered for real-time embedded facial expres-sion recognition, in this study we explore the efficacy ofleveraging a human-machine collaborative design strategythat leverages human experience and ingenuity with the rawspeed and meticulousness of machine driven design explo-ration, in order to find the optimal balance between accu-racy and architectural and computational complexity. Theresulting deep neural network architecture, which we callEmotionNet Nano, is specifically tailored for real-time em-bedded facial expression recognition and created via a twophase design strategy. The first phase focuses on leveragingresidual architecture design principles to capture the com-plex nuances of facial expressions, while the second phaseemployed machine-driven design exploration to generatethe final tailor-made architecture design that achieves higharchitectural and computational efficiency while maintain-ing a high performance. We present two variants of Emo-tionNet Nano, each with a different trade-off between ac-curacy and complexity, and evaluate both variants on theCK+ [17] benchmark dataset against state-of-the-art facialexpression classification networks.

The paper is organized as follows. Section 2 discussesrelated work in the area of facial expression classificationand efficient deep neural network architecture design. Sec-tion 3 presents in detail the methodology leveraged to de-sign the proposed EmotionNet Nano. Section 4 presentsin detail the network architecture of EmotionNet Nano andexplores interesting characteristics observed in the overalldesign. Section 5 presents the experiments conducted toevaluate the efficacy of EmotionNet Nano in terms of accu-racy, architectural complexity, speed, and energy efficiency.Section 6 provides a discussion on not only performance

but also social implications of EmotionNet Nano. Finally,conclusions are drawn and future directions are discussed inSection 7.

2. Related WorkA variety of deep neural network architectures have been

proposed for FEC, ranging from deep convolutional neu-ral networks (DCNN) to recurrent neural networks (RNN)[6] to long-short term memory (LSTM) [24] and have beenexplored, but those introduced in literature have generallyrequired significant architectural complexity and computa-tional power in order to detect and interpret the nuances ofhuman facial expressions. As an alternative to deep learn-ing, strategies leveraging other machine learning strategiessuch as Support Vector Machines (SVM) [18] and hand-crafted features such as Local Binary Patterns (LBP) [8, 23],dense optical flow [1], Histogram of Oriented Gradients(HOG) [15], or Facial Action Coding System [5] have alsobeen explored in literature, but generally have been shownto achieve lower accuracy when compared to deep learning-based approaches, which can better learn the subtle differ-ences that exist between human facial expressions.

To mitigate the aforementioned hurdle and improvewidespread adoption of powerful deep learning-driven ap-proaches for FEC in real-world applications, a key direc-tion that is worth exploring is the design of highly efficientdeep neural network architectures tailored for the task ofreal-time embedded facial expression recognition. A num-ber of strategies for designing highly efficient architectureshave been explored. One strategy is reducing the depthof the neural network architecture [13] to reduce computa-tional and architectural complexity; more specifically, neu-ral networks with a depth of just five were leveraged to learndiscriminating facial features. Another strategy is reducingthe input resolution of the neural network architecture, withShan et. al. [23] showing that FEC can be performed evenat low image resolutions of 14 x 19 pixels, which can furtherreduce the number of operations required for inference bya large margin. Despite the improved architectural or com-putational efficiencies gained by leveraging such efficientnetwork design strategies, they typically lead to noticeablereductions in facial expression classification accuracy andas such alternative strategies that enable a better balancebetween accuracy, architectural complexity, and computa-tional complexity are highly desired.

More recently, there has been a focus on human-drivendesign principles for efficient deep neural network archi-tecture design, ranging from depth-wise separable convo-lutions [2] to Inception [25] macroarchitectures to resid-ual connections [11]. Such design principles can sub-stantially improve FEC performance while reducing archi-tectural complexity [22]. However, despite the improve-ments gained in architectural efficiency, one challenge with

Page 3: James Ren Hou Lee, Linda Wang, and Alexander Wong Waterloo … · 2020-06-30 · James Ren Hou Lee, Linda Wang, and Alexander Wong Department of Systems Design Engineering, University

human-driven design principles is that it is quite time con-suming and challenging for humans to hand-craft efficientneural network architectures that are tailored for specific ap-plications such as FEC that possesses a strong balance be-tween a high performance accuracy, fast inference speed,and low memory footprint, primarily due to the sheer com-plexity of neural network behaviours under different archi-tectural configurations.

In an attempt to address this challenge, neural architec-ture search (NAS) strategies have been introduced to auto-mate the model architecture engineering process by findingthe maximally performing network design from all possiblenetwork designs within a search space. However, given theinfinitely large search space within which the optimal net-work architecture may exist in, significant human effort isoften required in designing the search space in a way that re-duces it to a feasible size, as well as defining a search strat-egy that can run within desired operational constraints andrequirements in a reasonable amount of time. Therefore,a way to combine both human-driven design principles andmachine-driven design exploration is highly desired and canlead to efficient architecture designs catered specifically toFEC.

3. MethodsIn this study, we present EmotionNet Nano, a highly ef-

ficient deep convolutional neural network architecture de-sign for the task of real-time facial emotion classificationfor embedded scenarios. EmotionNet Nano was designedusing a human-machine collaborative strategy in order toleverage both human experience as well as the meticulous-ness of machines. The human-machine collaborative designstrategy leveraged to create the proposed EmotionNet Nanonetwork architecture design is comprised of two main de-sign stages: i) principled network design prototyping, andii) machine-driven design exploration.

3.1. Principled Network Design Prototyping

In the first design stage, an initial network design proto-type, ϕ, was designed using human-driven design principlesin order to guide the subsequent machine-driven explorationdesign stage. In this study, the initial network design pro-totype of EmotionNet Nano leveraged residual architecturedesign principles [11], as it was previously demonstratedto achieve strong performance on a variety of recognitiontasks. More specifically, the presence of residual connec-tions within a deep neural network architecture have beenshown to provide a good solution to both the vanishing gra-dient and curse of dimensionality problems. Residual con-nections also enable networks to learn faster and easier, withlittle additional cost to architectural or computational com-plexity. Additionally, as the network architecture depth in-creases, each consecutive layer should perform no worse

than its previous layer due to the identity mapping option.As a result, residual network architecture designs have beenshown to work well for the problem of FEC [9, 28, 13].In this study, the final aspects of the initial network designprototype, ϕ, consists of an average pooling operation fol-lowed by a fully connected softmax activation layer to pro-duce the final expression classification results. The finalmacroarchitecture and microarchitecture designs of the in-dividual modules and convolutional layers of the proposedEmotionNet Nano were left to the machine-driven designexploration stage to design in an automatic manner. Toensure a compact and efficient real-time model catered to-wards embedded devices, this second stage was guided byhuman-specified design requirements and constraints tar-geting embedded devices possessing limited computationaland memory capabilities.

3.2. Machine Driven Design Exploration

Following the initial human-driven network design pro-totyping stage, a machine-driven design exploration stagewas employed to determine the macroarchitecture and mi-croarchitecture designs at the individual module level toproduce the final EmotionNet Nano. In order to determinethe optimal network architecture based on a set of humandefined constraints, generative synthesis [27] was leveragedfor the purpose of machine-driven design exploration. De-fined in Equation 1, we can formulate generative synthesisas a constrained optimization problem, where the goal is tofind a generator G that, given a set of seeds S, can generatenetworks {Ns|s ∈ S} that maximize a universal perfor-mance function U while also satisfying constraints definedin an indicator function 1r(·),

G = maxGU(G(s)) subject to 1r(G(s)) = 1,∀s ∈ S (1)

As such, given a human-defined indicator function 1r(·) andan initial network design prototype ϕ, generative synthesisis guided towards learning generative machines that gener-ate networks within the human-specified constraints.

An important factor in leveraging generative synthesisfor machine-driven design exploration is to define the op-erational constraints and requirements based on the desiredtask and scenario in a quantitative manner via the indicatorfunction 1r(·). In this study, in order to learn a compact yethighly efficient facial expression classification network ar-chitecture, the indicator function 1r(·) was set up such that:i) accuracy ≥ 92% on CK+ [17], and ii) network architec-ture complexity ≤ 1M parameters. These constraint valueswere chosen to explore how compact a network architecturefor facial expression classification can be while still main-taining sufficient classification accuracy for use in real-timeembedded scenarios. As such, we use the accuracy of Feng

Page 4: James Ren Hou Lee, Linda Wang, and Alexander Wong Waterloo … · 2020-06-30 · James Ren Hou Lee, Linda Wang, and Alexander Wong Department of Systems Design Engineering, University

& Ren [7] as the reference baseline for determining the ac-curacy constraint in the indicator function.

4. EmotionNet Nano ArchitectureThe network architecture of the proposed EmotionNet

Nano is shown in Figure 2. A number of notable character-istics of the proposed EmotionNet Nano network architec-ture design are worth discussing as they give insights intoarchitectural mechanisms that strike a strong balance be-tween complexity and accuracy.

4.1. Architectural heterogeneity

A notable characteristic about the architecture that al-lows the network to achieve high efficiency even with alow number of parameters is the macroarchitecture and mi-croarchitecture heterogeneity. Unlike hand-crafted archi-tecture designs, the macroarchitecture and microarchitec-ture designs within the EmotionNet Nano network archi-tecture as generated via machine-driven design explorationdiffer greatly from layer to layer. For instance, there are amix of convolution layers with varying shapes and differentnumber of channels per layer depending on the needs of thenetwork. As shown in Figure 2, there are a greater numberof channels needed as the sizes of feature maps decrease.

The benefit of high microarchitecture and macroarchi-tecture heterogeneity in the EmotionNet Nano network ar-chitecture is that it enables different parts of the networkarchitecture to be tailored to achieve a very strong bal-ance between architectural and computational complexitywhile maintaining model expressiveness in capturing nec-essary features. The architectural diversity in EmotionNetNano demonstrates the advantage of leveraging a human-collaborative design strategy as it would be difficult for ahuman designer, or other design exploration methods tocustomize a network architecture to the same level of ar-chitectural granularity.

4.2. Selective long-range connectivity

Another notable characteristic of the EmotionNet Nanonetwork architecture is that it exhibits selective long rangeconnectivity throughout the network architecture. The useof long range connectivity in a very selective manner en-ables a strong balance between model expressiveness andease of training, and computational complexity. Most in-teresting and notable is the presence of two densely con-nected 1 × 1 convolution layers that take in outputs frommultiple 3 × 3 convolution layers as input, with its outputconnected farther down at later layers. Such a 1× 1 convo-lution layer design provides dimensionality reduction whileretaining salient features of the channels through channelmixing, thus further improving architectural and compu-tational efficiency while maintaining strong model expres-siveness.

5. Experimental ResultsTo evaluate the efficacy of the proposed EmotionNet

Nano, we examine the network complexity, computationalcost and classification accuracy against other facial ex-pression classification networks on the CK+ [17] dataset,which is the most extensively used laboratory-controlledFEC benchmark dataset [16, 21].

5.1. Dataset

The Extended Cohn-Kanade (CK+) [17] dataset contains593 video sequences from a total of 123 different subjects,ranging from 18 to 50 years of age with a variety of gen-ders and heritage. Each video shows a facial shift from theneutral expression to a targeted peak expression, recordedat 30 frames per second (FPS) with a resolution of either640x490 or 640x480 pixels. Out of these videos, 327 arelabelled with one of seven expression classes, anger, con-tempt, disgust, fear, happiness, sadness, and surprise. TheCK+ database is widely regarded as the most extensivelyused laboratory-controlled FEC database available, and isused in the majority of facial expression classification meth-ods [16, 21]. Figure 3 shows that the CK+ dataset hasgood diversity for each expression type, which is impor-tant from an evaluation perspective. However, as the CK+dataset does not provide specific training, validation, andtest set splits, a mixture of splitting techniques can be ob-served in literature. For experimental consistency, we adoptthe most common dataset creation strategy where the lastthree frames of each sequence is extracted and labeled withthe video label [16]. In this study, we performed subject-independent 10-fold cross validation on the resulting 981facial expression images.

5.2. Implementation Details

EmotionNet Nano was trained for 200 epochs using aninitial learning rate of 1e−3, multiplied by 1e−1, 1e−2,1e−3, and 0.5e−3 at epochs 81, 121, 161, and 181 re-spectively. Categorical cross-entropy loss was used withthe Adam [14] optimizer. Data augmentation was appliedto the inputs, including rotation, width and height shifts,zoom, and horizontal flips. Following this initial training,we leveraged a machine-driven exploration stage to finetune the network specifically for the task of FEC. Trainingwas performed using a GeForce RTX 2080 Ti GPU. TheKeras [3] library was leveraged for this study.

5.3. Performance Evaluation

Two variants of EmotionNet Nano were created to exam-ine the different trade-offs between architectural and com-putational complexity and accuracy. In order to demon-strate the efficacy of the proposed models in a quantita-tive manner, we compare the performance of both variants

Page 5: James Ren Hou Lee, Linda Wang, and Alexander Wong Waterloo … · 2020-06-30 · James Ren Hou Lee, Linda Wang, and Alexander Wong Department of Systems Design Engineering, University

Figure 2. EmotionNet Nano Architecture. The network architecture exhibits high macroarchitecture and microarchitecture heterogene-ity, customized towards capturing deep facial features. Furthermore, the network architecture exhibits selective long-range connectivitythroughout the network architecture. The number of channels per layer are based on EmotionNet Nano-B.

Figure 3. Diversity of expressions in the CK+ dataset. Examplefaces for each expression type in CK+ is shown. Contempt notincluded as relevant subjects did not give publication consent.

against state-of-the-art facial expression classification net-works introduced in literature, shown in Table 1. It can beobserved that both EmotionNet Nano-A and Nano-B net-works achieve strong classification accuracy, with Emotion-Net Nano-A in particular achieving comparable accuracywith the highest-performing state-of-the-art networks thatare more than a magnitude larger. While EmotionNet Nano-B has lower accuracy than the highest-performing networks,it is still able to achieve comparable accuracy as [7] whilebeing three orders of magnitude smaller. A more detaileddiscussion of the performance comparison will be providedin the next section; overall, it can be observed that bothEmotionNet Nano variants provide the greatest balance be-tween accuracy and complexity, making it well-suited forembedded scenarios.

The distribution of expressions in CK+ is unequal, whichresults in an unbalanced dataset both for training and test-ing. The effects of this are prevalent when classifying thecontempt or fear expressions, both of which are underrep-resented in CK+ (e.g. there are only 18 examples of con-tempt, whereas there are 83 examples of surprise). Due tothe nature of human facial expressions, similarities betweenexpressions do exist, but the networks are generally able tolearn the high-level distinguishing features that separate oneexpression from another. However, incorrect classificationscan still occur, as shown in Figure 4, where a “disgust” ex-pression is falsely predicted to be “anger.”

5.4. Speed and Energy Efficiency

We also perform a speed and energy efficiency analy-sis, shown in Table 2, to demonstrate the efficacy of Emo-tionNet Nano in real-time embedded scenarios. Here, anARM v8.2 64-Bit RISC embedded processor was used forevaluation. Referring to Table 2, both EmotionNet Nanovariants are able to perform inference at >25 FPS and >70FPS on the tested embedded processor at 15W and 30Wrespectively, which more than fulfills a real-time system

Page 6: James Ren Hou Lee, Linda Wang, and Alexander Wong Waterloo … · 2020-06-30 · James Ren Hou Lee, Linda Wang, and Alexander Wong Department of Systems Design Engineering, University

Table 1. Comparison of facial expression classification networks on the CK+ dataset. We report 10-fold cross-validation averageaccuracy on the CK+ dataset with 7 classes (anger, contempt, disgust, fear, happiness, sadness, and surprise).

Method Params (M) Accuracy (%)Ouellet [20] 58 94.4

Feng & Ren [7] 332 92.3Wang & Gong [26] 5.4 97.2

Otberdout et al. [19] 11 98.4EmotionNet Nano-A 0.232 97.6EmotionNet Nano-B 0.136 92.7

Figure 4. Example expression predictions of faces in the CK+dataset using EmotionNet Nano-A. Five of the faces are classi-fied correctly, indicated in green, with an example of a misclassi-fied expression (disgust), shown in red.

constraint. In terms of energy efficiency, both EmotionNetNano variants demonstrated high power efficiency, with theNano-B variant running at 5.29 images/sec/watt on the em-bedded processor.

Table 2. EmotionNet Nano Speed and Energy Efficiency. Allmetrics are computed on an ARM v8.2 64-Bit RISC embeddedprocessor at different power levels.

15W 30W

Model FPS [ images/swatt ] FPS [ images/s

watt ]EmotionNet Nano-A 25.8 1.72 70.1 2.34EmotionNet Nano-B 32.8 2.19 72.9 2.43

6. DiscussionIn this study, we explore the human-machine collabora-

tive design of a deep convolutional neural network architec-ture capable of performing facial expression classification

in real-time on embedded devices. It is important to notethat other extremely fast deep convolutional neural networkarchitectures exist, such as MicroExpNet [4], which is capa-ble of processing 1851 FPS on an Intel i7 CPU, is less than1 MB in size, and is tested on the CK+ 8 class problem (7facial expression classes plus neutral) on which it achieves84.8% accuracy. Although a motivating result, a direct com-parison cannot be made with EmotionNet Nano as well asother facial expression classification networks evaluated inthis study due to the different class sizes.

Compared against state-of-the-art facial expression clas-sification network architectures tested on CK+ using thesame seven expression classes (see Figure 1), both vari-ants of the proposed EmotionNet Nano are at least an orderof magnitude smaller yet provide comparable accuracy tostate-of-the-art network architectures. For example, Emo-tionNet Nano-A is >23× smaller than [26], yet achieveshigher accuracy by 0.4%. Furthermore, while EmotionNetNano-A achieves an accuracy that is 0.8% lower than thetop-performing [19], it possesses >47× fewer parameters.In the case of EmotionNet Nano-B, it achieved higher accu-racy (by 0.4%) than [7] while having three orders of magni-tude fewer parameters.

Looking at the experimental results around inferencespeed and energy efficiency on an embedded processor atdifferent power levels (see Table 2), it can be observedthat both variants of EmotionNet Nano achieved real-timeperformance and high energy efficiencies. For example,in the case of EmotionNet Nano-A, it was able to ex-ceed 25 FPS and 70 FPS at 15W and 30W, respectively,with energy efficiencies exceeding 1.7 images/s/watt and2.34 images/s/watt at 15W and 30W, respectively. Thisdemonstrates that the proposed EmotionNet Nano is well-suited for high-performance facial expression classificationin real-time embedded scenarios. An interesting observa-tion that is worth noting is the fact that while the inferencespeed improvements of EmotionNet Nano-B over Emotion-Net Nano-A exceeds 27% at 15W, there is only a speed im-provement of 4% at 30W. As such, it can be seen that Emo-tionNet Nano-B is more suitable at low-power scenariosbut at high-power scenarios the use of EmotionNet Nano-A

Page 7: James Ren Hou Lee, Linda Wang, and Alexander Wong Waterloo … · 2020-06-30 · James Ren Hou Lee, Linda Wang, and Alexander Wong Department of Systems Design Engineering, University

is more appropriate given the significantly higher accuracyachieved.

6.1. Implications and Concerns

The existence of an efficient facial expression classifi-cation network of running in real-time on embedded de-vices can have an enormous impact in many fields, includ-ing safety, marketing, and assistive technologies. In termsof safety, driver monitoring or improved surveillance sys-tems are both areas that benefit from higher computationalefficiency, as it lowers the latency between event notifica-tions as well as reduces the probability that a signal willbe missed. With a real-time facial expression classificationsystem in the marketing domain, companies will gain accessto enhanced real-time feedback when demonstrating or pro-moting a product, either in front of live audiences or evenin a storefront. The largest impact however, is likely in theassistive technology sector, due to the increased accessibil-ity that this efficiency provides. The majority of individualsdo not have access to powerful computing devices, nor arethey likely to be willing to carry a large and expensive sys-tem with them as it would be considered an inconvenienceto daily living.

As shown in this study, EmotionNet Nano can achieveaccurate real-time performance on embedded devices at alow power budget, granting the user access to a facial ex-pression classification system on their smartphone or sim-ilar edge device with embedded processors without rapiddepletion of their battery. This can be extremely benefi-cial towards tasks such as depression detection, empathetictutoring, or ambient interfaces, and can also help individu-als who suffer from Autistic Spectrum Disorder better inferemotional states from facial expressions during social inter-action in the form of augmented reality (see Figure 5 for avisual illustration of how EmotionNet Nano can be used toaid in conveying emotional state via an augmented realityoverlay).

Figure 5. Assistive technology for Autistic Spectrum Disorder.Example of how EmotionNet Nano can be leveraged to assist indi-viduals with Autistic Spectrum Disorder to better infer emotionalstates from facial expressions during social interactions in the formof augmented reality.

Although EmotionNet Nano has many positive impli-cations, there exist concerns that must be considered be-

fore deployment. The first concern is privacy, as individ-uals may dislike being on camera, even if no data storageis taking place. Privacy concerns, especially ones centeredaround filming without consent, are likely to arise if thesesystems start to be used in public areas. The combina-tion of facial expression classification together with facialrecognition could result in unwanted targeted advertising,even though this could be seen as a positive outcome forsome. Additionally, wrong classifications could result inunintended implications. When assisting a user in an am-bient interface or expression interpretation task, a misclas-sified expression could result in a negative experience withmajor consequences. For example, predicting “sad” or “an-gry” expressions as “happy” could influence the user to be-have in the wrong manner. These concerns and issues areall worth further exploration and investigation to ensure thatsuch systems are used in a responsible manner.

7. ConclusionIn this study, we introduced EmotionNet Nano, a highly

efficient deep convolutional neural network design tailoredfor facial expression classification in real-time embeddedscenarios by leveraging a human-machine collaborative de-sign strategy. By leveraging a combination of human-driven design principles and machine-driven design explo-ration, the EmotionNet Nano architecture design possessesseveral interesting characteristics (e.g., architecture hetero-geneity and selective long-range connectivity) that makesit tailored for real-time embedded usage. Two variants ofthe proposed EmotionNet Nano network architecture de-sign were presented, both of which achieve a strong balancebetween architecture complexity and accuracy while illus-trating performance trade-offs at that scale. Using the CK+dataset, we show that the proposed EmotionNet Nano canachieve comparable accuracy to state-of-the-art facial ex-pression classification networks (at 97.6%) while possess-ing a significantly more efficient architecture design (pos-sessing just 232K parameters). Furthermore, we demon-strated that EmotionNet Nano can achieve real-time infer-ence speed on an embedded processor at different powerlevels, thus further illustrating its suitability for real-timeembedded scenarios.

Future work involves incorporating temporal informa-tion into the proposed EmotionNet Nano design when clas-sifying video sequences. Facial expressions are highly dy-namic and transient in nature [10], meaning that informa-tion about the previous expression is valuable when pre-dicting the current expression. Therefore, the retention oftemporal information can lead to increased performance, atthe expense of computational complexity. Investigating thistrade-off between computational complexity and improvedperformance when leveraging temporal information wouldbe worthwhile.

Page 8: James Ren Hou Lee, Linda Wang, and Alexander Wong Waterloo … · 2020-06-30 · James Ren Hou Lee, Linda Wang, and Alexander Wong Department of Systems Design Engineering, University

References[1] Sarah Adel Bargal, Emad Barsoum, Cristian Canton Fer-

rer, and Cha Zhang. Emotion recognition in the wild fromvideos using images. In Proceedings of the 18th ACM Inter-national Conference on Multimodal Interaction, pages 433–436. ACM, 2016. 2

[2] Francois Chollet. Xception: Deep learning with depthwiseseparable convolutions. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages1251–1258, 2017. 2

[3] Francois Chollet et al. Keras. https://keras.io, 2015.4

[4] Ilke Cugu, Eren Sener, and Emre Akbas. Microexpnet: Anextremely small and fast model for expression recognitionfrom frontal face images. arXiv preprint arXiv:1711.07011,2017. 6

[5] Paul Ekman and Wallace V. Friesen. Facial action codingsystem: Manual. 1978. 2

[6] Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. Video-basedemotion recognition using CNN-RNN and C3D hybrid net-works. In Proceedings of the 18th ACM International Con-ference on Multimodal Interaction, pages 445–450. ACM,2016. 2

[7] Duo Feng and Fuji Ren. Dynamic facial expression recog-nition based on two-stream-cnn with lbp-top. In 2018 5thIEEE International Conference on Cloud Computing and In-telligence Systems (CCIS), pages 355–359. IEEE, 2018. 4,5, 6

[8] SL Happy, Anjith George, and Aurobinda Routray. A realtime facial expression classification system using local bi-nary patterns. In 2012 4th International conference on intel-ligent human computer interaction (IHCI), pages 1–5. IEEE,2012. 2

[9] Behzad Hasani and Mohammad H Mahoor. Facial expres-sion recognition using enhanced deep 3d convolutional neu-ral networks. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops, pages30–40, 2017. 3

[10] Behzad Hasani and Mohammad H Mahoor. Facial expres-sion recognition using enhanced deep 3D convolutional neu-ral networks. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops, pages30–40, 2017. 7

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 770–778, 2016. 2, 3

[12] Mira Jeong and Byoung Chul Ko. Drivers facial expres-sion recognition in real-time for safe driving. Sensors,18(12):4270, 2018. 2

[13] Pooya Khorrami, Thomas Paine, and Thomas Huang. Dodeep neural networks learn facial action units when doingexpression recognition? In Proceedings of the IEEE Inter-national Conference on Computer Vision Workshops, pages19–27, 2015. 2, 3

[14] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014. 4

[15] Pranav Kumar, SL Happy, and Aurobinda Routray. A real-time robust facial expression recognition system using hogfeatures. In 2016 International Conference on Comput-ing, Analytics and Security Trends (CAST), pages 289–293.IEEE, 2016. 2

[16] Shan Li and Weihong Deng. Deep facial expression recog-nition: A survey. arXiv preprint arXiv:1804.08348, 2018.4

[17] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih,Zara Ambadar, and Iain Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit andemotion-specified expression. In 2010 IEEE Computer Soci-ety Conference on Computer Vision and Pattern Recognition-Workshops, pages 94–101. IEEE, 2010. 2, 3, 4

[18] Philipp Michel and Rana El Kaliouby. Real time facial ex-pression recognition in video using support vector machines.In Proceedings of the 5th international conference on Multi-modal interfaces, pages 258–264. ACM, 2003. 2

[19] Naima Otberdout, Anis Kacem, Mohamed Daoudi, La-houcine Ballihi, and Stefano Berretti. Automatic analysisof facial expressions based on deep covariance trajectories.IEEE transactions on neural networks and learning systems,2019. 6

[20] Sebastien Ouellet. Real-time emotion recognition for gam-ing using deep convolutional network features. arXivpreprint arXiv:1408.3750, 2014. 6

[21] Maja Pantic, Michel Valstar, Ron Rademaker, and LudoMaat. Web-based database for facial expression analysis.In 2005 IEEE international conference on multimedia andExpo, pages 5–pp. IEEE, 2005. 4

[22] Christopher Pramerdorfer and Martin Kampel. Facial ex-pression recognition using convolutional neural networks:state of the art. arXiv preprint arXiv:1612.02903, 2016. 2

[23] Caifeng Shan, Shaogang Gong, and Peter W McOwan. Rec-ognizing facial expressions at low resolution. In IEEE Con-ference on Advanced Video and Signal Based Surveillance,2005., pages 330–335. IEEE, 2005. 2

[24] Bo Sun, Qinglan Wei, Liandong Li, Qihua Xu, Jun He, andLejun Yu. LSTM for dynamic emotion and group emotionrecognition in the wild. In Proceedings of the 18th ACMInternational Conference on Multimodal Interaction, pages451–457. ACM, 2016. 2

[25] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going deeper withconvolutions. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 1–9, 2015.2

[26] Guan Wang and Jun Gong. Facial expression recognitionbased on improved lenet-5 cnn. In 2019 Chinese ControlAnd Decision Conference (CCDC), pages 5655–5660. IEEE,2019. 1, 6

[27] Alexander Wong, Mohammad Javad Shafiee, BrendanChwyl, and Francis Li. Ferminets: Learning generative ma-

Page 9: James Ren Hou Lee, Linda Wang, and Alexander Wong Waterloo … · 2020-06-30 · James Ren Hou Lee, Linda Wang, and Alexander Wong Department of Systems Design Engineering, University

chines to generate efficient neural networks via generativesynthesis. arXiv preprint arXiv:1809.05989, 2018. 3

[28] Yitao Zhou, Fuji Ren, Shun Nishide, and Xin Kang. Facialsentiment classification based on resnet-18 model. In 2019International Conference on Electronic Engineering and In-formatics (EEI), pages 463–466. IEEE, 2019. 3