1 communication-efficient hierarchical federated learning

14
1 Communication-Efficient Hierarchical Federated Learning for IoT Heterogeneous Systems with Imbalanced Data Alaa Awad Abdellatif * , Naram Mhaisen * , Amr Mohamed * , Aiman Erbad , Mohsen Guizani * , Zaher Dawy + , and Wassim Nasreddine * College of Engineering, Qatar University College of Science and Engineering, Hamad Bin Khalifa University + Electrical and Computer Engineering Department, American University of Beirut Department of Neurology, American University of Beirut E-mail: {alaa.abdellatif, aerbad}@ieee.org, {naram, amrm, mguizani}@qu.edu.qa, {zd03, wn13}@aub.edu.lb Abstract—Federated learning (FL) is a distributed learning methodol- ogy that allows multiple nodes to cooperatively train a deep learning model, without the need to share their local data. It is a promising solution for telemonitoring systems that demand intensive data collec- tion from different locations while maintaining a strict privacy constraint. Due to privacy concerns and critical communication bottlenecks, it can become impractical to send the FL updated models to a centralized server. Thus, this paper studies the potential of hierarchical FL in IoT heterogeneous systems and propose an optimized solution for user as- signment and resource allocation on multiple edge nodes. In particular, this work focuses on a generic class of machine learning models that are trained using gradient-descent-based schemes while considering the practical constraints of non-uniformly distributed data across different users. We evaluate the proposed system using two real-world datasets, and we show that it outperforms state-of-the-art FL solutions. In particu- lar, our numerical results highlight the effectiveness of our approach and its ability to provide 4-6% increase in the classification accuracy, with respect to hierarchical FL schemes that consider distance-based user assignment. Furthermore, the proposed approach could significantly accelerate FL training and reduce communication overhead by providing 75-85% reduction in the communication rounds between edge nodes and the centralized server, for the same model accuracy. Index Terms—Distributed deep learning, edge computing, non-IID data, Internet of Things (IoT), intelligent health systems. 1 I NTRODUCTION The rapid evolution of Artificial Intelligence (AI), Internet of Things (IoT), and big data is paving the path towards a plethora of interactive applications that can inspire sub- stantial transformations in the industrial services. However, these new trends come with several challenges due to the latency and reliability constraints, and the need to process and transmit enormous amount of data, generated by IoT devices, with guaranteed quality of service. Emerging AI technologies, such as deep learning, can turn the vision of building real-time interactive systems into reality [1], [2]. For instance, in light of the recent pandemic, deep learning has proven its crucial role in analyzing and un- derstanding the outbreaks evolution. Indeed, efficient deep learning techniques can provide intelligent healthcare ser- vices, such as event detection and characterization, real- time remote monitoring, and identification of patients with high mortality risks. However, deep learning relies on the availability of large datasets, provided by medical and IoT devices, for training and improving deep learning models [3]. This puts a major challenge on maintaining data privacy, since medical devices are typically collecting private data. Hence, it is commonly not a practical solution to forward this data to a centralized entity for conducting the training [4]. Centralized processing of massive amount of data puts users’ privacy at high risk, while imposing enormous load on wireless networks. Thus, we envision that bringing the intelligence close to the IoT devices, using Multi-access Edge Computing (MEC) [5] along with Federated Learning (FL) [6], is a key for supporting data privacy, while jointly training large deep learning models. FL is a promising solution for training deep learning models without storing the data at a centralized location. It allows multiple entities/users to jointly train a deep learning model on their combined data, without revealing their data to a centralized entity. This approach of privacy- preserving cooperative learning is carried out by three main steps: (i) all participating users receive the latest global model W from the centralized server (also called a broker); (ii) they train the received model using their local data; and (iii) they upload their locally trained models W i back to the centralized server to be aggregated and form an updated global model. These steps are repeated until a certain convergence criterion is obtained. Following FL protocol, the local devices (or users) never arXiv:2107.06548v1 [cs.LG] 14 Jul 2021

Upload: others

Post on 19-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

1

Communication-Efficient Hierarchical FederatedLearning for IoT Heterogeneous Systems with

Imbalanced DataAlaa Awad Abdellatif∗, Naram Mhaisen∗, Amr Mohamed∗, Aiman Erbad†, Mohsen Guizani∗, Zaher Dawy+,

and Wassim Nasreddine‡∗College of Engineering, Qatar University

†College of Science and Engineering, Hamad Bin Khalifa University+Electrical and Computer Engineering Department, American University of Beirut

‡Department of Neurology, American University of BeirutE-mail: {alaa.abdellatif, aerbad}@ieee.org, {naram, amrm, mguizani}@qu.edu.qa, {zd03,

wn13}@aub.edu.lb

F

Abstract—Federated learning (FL) is a distributed learning methodol-ogy that allows multiple nodes to cooperatively train a deep learningmodel, without the need to share their local data. It is a promisingsolution for telemonitoring systems that demand intensive data collec-tion from different locations while maintaining a strict privacy constraint.Due to privacy concerns and critical communication bottlenecks, it canbecome impractical to send the FL updated models to a centralizedserver. Thus, this paper studies the potential of hierarchical FL in IoTheterogeneous systems and propose an optimized solution for user as-signment and resource allocation on multiple edge nodes. In particular,this work focuses on a generic class of machine learning models that aretrained using gradient-descent-based schemes while considering thepractical constraints of non-uniformly distributed data across differentusers. We evaluate the proposed system using two real-world datasets,and we show that it outperforms state-of-the-art FL solutions. In particu-lar, our numerical results highlight the effectiveness of our approach andits ability to provide 4-6% increase in the classification accuracy, withrespect to hierarchical FL schemes that consider distance-based userassignment. Furthermore, the proposed approach could significantlyaccelerate FL training and reduce communication overhead by providing75-85% reduction in the communication rounds between edge nodesand the centralized server, for the same model accuracy.

Index Terms—Distributed deep learning, edge computing, non-IID data,Internet of Things (IoT), intelligent health systems.

1 INTRODUCTION

The rapid evolution of Artificial Intelligence (AI), Internetof Things (IoT), and big data is paving the path towardsa plethora of interactive applications that can inspire sub-stantial transformations in the industrial services. However,these new trends come with several challenges due to thelatency and reliability constraints, and the need to processand transmit enormous amount of data, generated by IoTdevices, with guaranteed quality of service. Emerging AItechnologies, such as deep learning, can turn the vision

of building real-time interactive systems into reality [1],[2]. For instance, in light of the recent pandemic, deeplearning has proven its crucial role in analyzing and un-derstanding the outbreaks evolution. Indeed, efficient deeplearning techniques can provide intelligent healthcare ser-vices, such as event detection and characterization, real-time remote monitoring, and identification of patients withhigh mortality risks. However, deep learning relies on theavailability of large datasets, provided by medical and IoTdevices, for training and improving deep learning models[3]. This puts a major challenge on maintaining data privacy,since medical devices are typically collecting private data.Hence, it is commonly not a practical solution to forwardthis data to a centralized entity for conducting the training[4]. Centralized processing of massive amount of data putsusers’ privacy at high risk, while imposing enormous loadon wireless networks. Thus, we envision that bringing theintelligence close to the IoT devices, using Multi-accessEdge Computing (MEC) [5] along with Federated Learning(FL) [6], is a key for supporting data privacy, while jointlytraining large deep learning models.

FL is a promising solution for training deep learningmodels without storing the data at a centralized location.It allows multiple entities/users to jointly train a deeplearning model on their combined data, without revealingtheir data to a centralized entity. This approach of privacy-preserving cooperative learning is carried out by three mainsteps: (i) all participating users receive the latest globalmodel W from the centralized server (also called a broker);(ii) they train the received model using their local data;and (iii) they upload their locally trained models W i backto the centralized server to be aggregated and form anupdated global model. These steps are repeated until acertain convergence criterion is obtained.

Following FL protocol, the local devices (or users) never

arX

iv:2

107.

0654

8v1

[cs

.LG

] 1

4 Ju

l 202

1

2

EU Mj

EU 1Database EU M1 EU 1

Database

EU M2

EU 1Database

Centralized

server

Global

averaging

Local

averaging

Edge 1

Edge 2

Edge j

(1) D

ownload W

(2) D

ow

nlo

ad W

j(4

) U

plo

ad W

i

(6) U

pload Wj

(3) Compute Wi

independently.

(5) Average Wi to

generate Wj.

Edge 1 Assignment Edge 2 Assignment

Edge j

Assignment

Fig. 1. Hierarchical federated learning: architecture and data flow.

transfer their local data, since only the updated models arecommunicated. However, a major challenge in centralizedFL is the massive communication overhead that arises fromsending the End Users (EUs) their full models’ update tothe centralized server at each iteration. Such updates areof the same size as the trained model, which can be in therange of gigabytes for deep learning models with thousandsof parameters [7]. Moreover, frequent exchange of the up-dated models with the centralized server leads to increasingthe FL convergence time, especially with non-Independentand Identically Distributed (non-IID) data [8]. Accordingly,in large scale networks, centralized FL can generate highcommunication overheads (in terms of latency, networkbandwidth, and energy consumption), which turns the FLto be unproductive in resource-constrained environments.

To address the above challenges, we propose an opti-mized EUs assignment and resource allocation scheme overhierarchical federated learning architecture (see Figure 1),in which multiple edge servers are employed to reducethe overhead resulting from information exchange betweenEUs and the centralized server. In the proposed hierarchicalarchitecture, EUs are assigned to different edge nodes notonly based on their locations (like the state-of-the-art solu-tions), but also based on their data distributions and com-munication constraints. Based on this, the EUs send theirlocal updates to the selected edge node (e.g., small cell basestation) for aggregation. Then, the edge node computes theaggregate model and transfers it back to their assigned EUsto update their models accordingly. After a specific numberof iterations, the edge nodes forward their models’ updatesto the centralized server to maintain a global model. In thiscontext, we propose a novel EUs assignment and resourceallocation scheme that aims to minimize the FL convergencetime, hence the consumed resources, by optimally assigningthe EUs with different data distributions to the availableedge nodes.

Without loss of generality, this paper considers, as acase study, applying the proposed hierarchical FL in large-scale intelligent healthcare systems. These systems includethousands of participants who need continuous monitoringat the same time, especially in case of pandemics. This en-

tails major challenges and puts significant load on differenthealthcare entities. Thus, the strategic solution is to movelarge number of patients with mild symptoms into remotemonitoring or home care. Deep learning models can enablehome care services by acquiring and processing the patients’data to identify the patients’ states. However, collecting andforwarding such private data from the patients to a centralentity come at a risk of violating the patients’ privacy. Thus,we envision that leveraging the proposed hierarchical FLwithin large-scale intelligent healthcare systems is a keyfor enabling remote monitoring. In this context, the maincontributions of our work can be summarized as:

1) Design a distributed learning system with hierarchi-cal FL for supporting intelligent healthcare systems.In the hierarchical FL setting, the EUs are collab-orating to train a deep learning model, withoutexchanging raw data, by connecting to the localedge nodes in their communication range, which arethen connected to a centralized server.

2) Study the global imbalanced data effect on theobtained accuracy and convergence time of deeplearning models trained by hierarchical FL. Thisincludes formulating an optimization problem forEUs assignment and resource allocation based onEUs’ data distributions, while considering latency,bandwidth, and energy consumption constraints.

3) Solving the formulated problem using an efficientalgorithm that considers both data distributions ofdifferent EUs and wireless environment conditions.

4) Evaluating the performance of the proposed ap-proach via comprehensive experiments using real-world datasets, which confirms that our approachprovides near-optimal performance for differentdata distributions and system configurations.

In the rest of the paper, Section 2 presents the relatedwork on single-layer FL and hierarchical FL. Section 3 intro-duces the main challenges that will be tackled in this paper,and presents the considered intelligent healthcare system ar-chitecture with hierarchical FL. Section 4 introduces first themathematical modeling of hierarchical FL. Then, it presentsthe communication and computation latency as well asenergy consumption analysis of the proposed hierarchicalFL framework. Section 5 presents the formulated optimiza-tion framework along with the solution for EUs assignmentand resource allocation over hierarchical FL. Performanceevaluation of our system is then discussed in Section 6.Finally, the paper is concluded in Section 7.

2 RELATED WORK

The proposed FL schemes in the literature can be broadlyclassified, based on the system architecture, into two cate-gories: Single-layer FL and hierarchical FL.

2.1 Single-layer Federated LearningThe concept of FL was first proposed in [9], with itsefficiency demonstrated through experiments on differentdatasets. The presented model in [9] considered a single-layer FL, where the users exchanged their updated modelswith a centralized server that aggregated and formed an

3

updated global model with a fixed frequency. This wasfollowed by several extensions, which can be categorizedinto:

• analyzing the convergence of distributed gradientdescent and federated learning algorithms from atheoretical perspective, and optimizing the learningprocess given computation and communication re-source budgets [6], [8], [10], [11];

• selecting the users participating in the synchroniza-tion process of FL approaches [12]–[15];

• developing communication-efficient techniques toreduce the amount of exchanged data in FL systemsby adopting various sparsification and compressiontechniques [4], [16], [17].

The effect of non-IID on the performance of FL hasbeen studied in [10]. It has been shown, theoretically andempirically, that highly skewed non-IID data can signifi-cantly reduce the accuracy of the global learning model byup to 55%. As a solution to enhance the training on non-IID data, the authors proposed to share globally a smallsubset of data between all users. Combining this data withthe local data of each user turns it to be less biased orskewed. However, exchanging data between different usersis not always feasible due to the privacy constraint andcommunication overhead. In [8], the authors analyzed theconvergence rate of Federated Averaging (FedAvg) on non-IID data for strongly convex and smooth problems. In [6],the authors studied the adaptation of global aggregationfrequency for FL, while considering a fixed resource budget.They analyzed the convergence bound of gradient-descentbased FL on non-IID data from a theoretical perspective.Then, they used this convergence bound to build a controlalgorithm that adapts the frequency of global aggregationin real time to minimize the learning loss under a fixedresource budget. In [11], the convergence analysis of FLwithin an Unmanned Aerial Vehicle (UAV) swarm wasstudied. Then, a joint power allocation and schedulingproblem was formulated to optimize the convergence rateof FL while considering the energy consumption and delayrequirements imposed by the swarm’s control system.

It is shown in [8] that the participation of all users in theFL process forces the central server to wait for stragglers,i.e., users who have low-quality wireless links that cansignificantly slow down the FL process, which turns the FLto be unrealistic. Thus, to mitigate the impact of stragglers,the authors in [12] proposed a method to select a subset ofusers for the FL synchronization (or aggregation) processin a resource-constrained environment. They demonstratedthe advantages of such technique on improving the FLlearning speed. This work has been extended in [13], wherea control scheme is proposed, based on reinforcement learn-ing, to accelerate the FL process by actively selecting thebest subset of users in each communication round thatcan counterbalance the bias introduced by non-IID data. In[14], a realistic wireless network model was considered tostudy the convergence time of FL. Then, given the limitedwireless resources, a joint learning, resource allocation, anduser selection optimization problem was formulated to min-imize the FL convergence time and training loss. In [15],a joint optimization framework for sample selection and

user selection was studied to keep a balance between themodel accuracy and cost. However, the distribution distancebetween different users was optimized in this frameworkthrough adjusting the local batch size, which might lead tothe under-utilization of data in strongly skewed users.

Alternatively, sparsification schemes have been stud-ied to reduce the entropy of the exchanged data (i.e.,models’ updates) in FL communications. The authors in[16] presented an approach that accelerates the distributedstochastic gradient descent by exchanging sparse updatesinstead of dense updates. Indeed, they fixed the sparsityrate by only communicating the fraction of entries withthe biggest magnitude for each gradient. In [4], the authorsproposed a sparse ternary compression scheme that was de-signed to maintain the requirements of the FL environment.The proposed scheme compressed both the upstream anddownstream communications of FL leveraging sparsifica-tion, ternarization, error accumulation, and optimal Golombencoding. This study demonstrated the effect of communi-cations compression and data distributions on the obtainedperformance. However, it neither considered the wirelessresources allocation nor the hierarchical FL architecture. In[17], the FedAvg scheme was adjusted to use a distributedform of Adam optimization, along with the sparsificationand quantization, in order to propose a communication-efficient FedAvg.

2.2 Hierarchical Federated Learning

Few studies have been proposed so far to address theproblem of non-IID data on hierarchical FL architecture. Forinstance, the authors in [18] extended the work in [6] toanalytically prove the convergence of the hierarchical feder-ated averaging algorithm. This work was further extendedin [19] considering probabilistic user selection to avoid theimpact of stragglers. In [20], a self-balancing FL framework,along with two strategies to prevent the bias of trainingcaused by imbalanced data distribution, was proposed. Thefirst strategy aimed to perform data augmentation, beforetraining the model, in order to alleviate global imbalance.The second strategy exploited some mediators (which canbe considered as edge nodes) to reschedule the training ofthe users based on the distribution distances between themediators. In [21], the effect of the skewed data in hierar-chical FL was studied and compared to the centralized FL.Indeed, this work identified the major parameters that affectthe learning performance of hierarchical FL. However, thiswork ignored the resource allocation and wireless commu-nications constraints, such as bandwidth, energy consump-tion, and latency. Although the aforementioned studies haveadvanced the area of hierarchical FL and user selection fromthe theoretical perspective, there are still interesting openresearch directions related to optimized hierarchical FL withwireless resource allocation under practical conditions.

Novelty. Our work is the first to address the problemof EUs’ assignment to different edge nodes in hierarchicalFL architectures, while considering the users’ data distribu-tions, wireless communication characteristics, and resourceallocation. Specifically, unlike other studies, we consider theEUs dataset characteristics and computational capabilities,in addition to practical communication constraints includ-

4

ing wireless channel quality, transmission energy consump-tion, and communication latency. Furthermore, the pro-posed framework for hierarchical FL is incorporated in anintelligent healthcare system and evaluated using two real-world health datasets to provide efficient remote monitoringservices.

3 HIERARCHICAL FEDERATED LEARNING

In this section, we first highlight the key challenges ofusing FL in large-scale healthcare systems, then we presentour Intelligent-Care (I-Care) system architecture that will beleveraged to address these challenges.

3.1 Challenges of FL in healthcare systems

Before proceeding with the discussion of the proposedframework, we highlight the uniqueness of FL comparedto distributed training schemes. FL enables a large amountof EUs to jointly learn a model without sharing their localdata. Hence, the distribution of both training data andcomputational resources is fixed and linked to the learningenvironment. However, for efficient leveraging of FL withinlarge scale healthcare systems, the following challengeshave to be adequately addressed. We remark that thesechallenges are present in most of the FL scenarios, however,they are particularly amplified in health-related use-cases.

Imbalanced and non-IID data: Given the heterogene-ity of healthcare systems, the distribution and size of theacquired datasets at each EU/patient significantly vary.Typically, the training data in FL is collected by diverse IoTdevices near/attached to the EUs, which results in non-IIDdata distribution over different EUs. FedAvg, the leading FLalgorithms for non-IID data distributions, suffers from thelarge number of communications rounds needed betweenEUs and the centralized server, especially in the case ofimbalanced data [8].

Large number of participants: Given the evolutionof healthcare systems with the increasing number of thechronically ill and elderly people, most of the hospitals arerequired to serve hundreds of patients daily (especially, incase of pandemic such as the recent COVID-19 outbreak).This puts a significant load on different health entities. Apromising solution to face such a demand is to move alarge number of patients with mild conditions to home care,while still being monitored remotely. This leads to a wideextension in the FL environment for large-scale healthcaresystems.

Limited resources: Given the large number of EUs par-ticipating in FL, the availability of the needed computationaland communication resources at the EUs and network nodescan be challenging. The generated traffic for the models’update grows linearly with the number of participating EUs.Hence, distributing the available bandwidth on differentEUs in a communication-efficient way is not an easy task.Moreover, the heterogeneity of the EUs, in terms of com-putational capabilities and energy availability (i.e., batterylevel) introduces an extra constraint to optimize the latencyand energy consumption.

3.2 I-Care system architectureFor supporting intelligent healthcare services while address-ing the above challenges, we consider the I-Care systemarchitecture presented in Figure 2. The proposed archi-tecture stretches from the EU layer, where data is gener-ated/collected, to the centralized server (or cloud) layer. Itincludes the following major components:

EU layer: It is considered that each patient has a com-bination of IoT devices, which enable real-time monitoringand ubiquitous sensing for his/her health conditions andactivities within the smart assisted environment. Examplesinclude: body area sensor networks (i.e., implantable orwearable sensors that measure different bio-signals and vitalsigns), IP cameras, and external medical and non-medicaldevices. These IoT devices are connected with a local hub,which is responsible for: 1) gathering the health-relatedinformation, 2) processing these data, and 3) training thelocal learning model at the EU level. In our architecture,the EU can be a patient, in-home healthcare environment,or a hospital with several patients. In both cases, FL allowsfor cooperative training (at the patients or hospital level)without sharing any privacy-sensitive data.

Edge node layer: This layer is an intermediate layerbetween EUs and the centralized server that aims to re-duce the massive communication overhead that arises fromsending full EU model updates to the centralized server. Inthe proposed architecture, the EUs are assigned to differentedge nodes, such that each group of EUs runs, with its serv-ing edge node, Synchronous distributed Gradient Descentlearning algorithm, also referred to as FedSGD1 [22]. Underthis learning model, an EU i updates its local model W i

based on its local dataset Di. Then, the EUs’ models are ag-gregated at the assigned edge node j to form the edge modelW j . Then, all edge nodes synchronize their models at thecentralized server through the FedAvg learning scheme; andthe process is repeated until the convergence is attained.

Centralized server layer: This layer takes the compara-tive advantages of powerful computing resources to aggre-gate, average, and update the global learning model. Then,it sends back the global model to all participants to completethe FL process.

4 DISTRIBUTED RESOURCES OPTIMIZATION

In this section, we first introduce the mathematical modelingthat defines the problem of hierarchical FL with imbalancedand non-IID datasets. Then, we present the communicationand computation latency as well as energy consumptionanalysis of the proposed hierarchical FL framework. Table1 presents the main mathematical symbols that are used inthe following sections.

4.1 Hierarchical FL problem formulationWe consider the hierarchical FL learning system presentedin Figure 2, which consists of one centralized server, a setNof N edge nodes, and a setM of M EUs. Each participatingEU i stores a local dataset Di, with a size Di. Hence, the

1. FedSGD is a gradient-based algorithm that is equivalent to feder-ated learning when the global aggregation occurs at every step, whereevery local step is a full deterministic gradient.

5

Edge server N

Patient 2

Centralized

Server

Edge server 2Edge server 1

Home 1

Patient m2 Hospital 1

Hospital m3

Home 2

Home 3 Home m1

Global update

Local update

Local update

Local update

Patient 1

Fig. 2. The proposed I-Care system architecture.

TABLE 1List of symbols used throughout the paper.

Symbol MeaningN A set of N edge nodes.M A set of M EUs.Di The local dataset stored at EU i.C The set of all possible data classes.Mj Number of EUs assigned to edge j.σj The proportion of edge j dataset to the global dataset.|Wi| Updated model length in bits.rui,j Upload data rate of EU i to edge j.rdj,i Download data rate of EU i from edge j.T ci Computation time for updating the local model at EU i.

Lij Transmission delay from EU i to edge j.Bij Allocated bandwidth from edge j to EU i.Eij Energy consumed by EU i to upload the updates to edge j.wf Federated weights.wc Virtual central weights.DKL Kullback–Leibler divergence (KLD).λij Assignment indicator of EU i to edge j.

virtual dataset at an edge node j is D(j) = ∪iD(i,j),∀i ∈Mj , and the virtual global dataset at the centralized serveris D = ∪jD(j),∀j ∈ N . Note that we use the term virtualdataset to clarify the analysis, although these datasets arenot stored physically on the edge or the centralized server.

The local dataset Di, at EU i, consists of the collectionof data samples {xi, yi}Di

i=1, where xi denotes the dataacquired by EU i (i.e., features), and yi ∈ C is the associatedlabel, such that C is the set of all possible labels or dataclasses. Typically, the main task in any learning model is tofind the model parameters w (i.e., weights) that minimizethe loss function L(w) using the dataset Di. We use thewidely adopted cross-entropy loss function:

L(w) =Ey∼p(y),x∼q(x|y) [−1y=k log dk(x;w)]

=C∑k=1

−p(y = k)Ex∼q(x|y=k) [log dk(x;w)] (1)

Where x is a feature vector of a data point, C is the numberof classes in the learning problem, p(·) is the global classesdistribution (global distribution), q(·|·) is the likelihoodfunction, and dk(x,w) is the probability of the k-th class

for input x under parameters w. Then, the objective ofthe overall learning process is to reach a set of parametersw that minimize the loss function across the virtual globaldataset:

Minimizew

C∑k=1

p(y = k)Ex∼q(x|y=k) [log dk(x;w)]. (2)

The loss function in (1) is minimized using the regulargradient descent update:

wtc = wt−1

c − δ∇wcL(wc). (3)

We refer to the parameters in (3) as the centralized model.Following the proposed hierarchical FL, each group of

EUs is associated with a specific edge. Then, in the firststep of the hierarchical FL procedure, each EU calculatesits learning parameters wi that are optimized based on itsdataset Di using FedSGD. Since the virtual global data isdistributed across different EUs, each EU i locally calculatesthe loss function over its own dataset to obtain its learningparameters wi.

wti = wt−1

i − δ∇wiL(wi). (4)

where

L(wi) =Ey∼pi(y),x∼q(x|y) [1y=k log dk(x;w)]

=C∑k=1

pi(k)Ex∼q(x|y=k) [log dk(x;w)] . (5)

In (5), pi(·) is the classes distribution of EU i.In the hierarchical FL settings, the synchronization (i.e.,

aggregation) of EUs’ parameters (weights) is periodicallydone by taking the weighted average of EUs’ parameterswith respect to their local datasets’ size. The EUs thenexploit the averaged parameters until the next aggregationround. Hence, the second step in the hierarchical FL pro-cedure is to synchronize the local weights across all EUsbelonging to a specific edge node every T ′ local gradientssteps. Hence, the parameters of an edge node j at a-th edgeaggregation iteration are:

waj =

Mj∑i=1

σi,jwa×T ′

i , (6)

where

σi,j =|D(i,j)|| ∪i D(i,j)|

. (7)

In (6), Mj refers to the number of EUs associated with edgenode j. Similarly, the learning models are synchronizedacross all edge nodes every (T ′ × T ) steps, where T isthe centralized aggregation frequency. Hence, at b − thcentralized aggregation, the learning parameters are averagedacross all edge nodes to obtain the federated weights as:

wbf =

N∑j=1

σjwb×Tj (8)

where

σj =|D(j)|| ∪j D(j)|

. (9)

Finally, the updated model at the centralized server (i.e.,federated weights) is sent back to the edge nodes as well

6

as EUs, and the process is repeated until maintaining theconvergence. We remark here that the proposed frameworkconsiders the synchronous FL approach, which has shownits efficiency in [6], [9], [23] compared to asynchronousapproaches. In synchronous FL, the computation steps aresynchronized between all EUs, meaning that all EUs haveto finish updating their local models before proceeding withthe communication step (to forward their models to the edgenodes). Instead, asynchronous FL [23] is an alternative tothe typically used synchronous FL, in which the edge nodesoperate in an asynchronous manner. The main advantage ofasynchronous FL is the ability to fully exploit the availablecomputational resources, at each EU, by performing moregradient descent iterations at powerful (or faster) EUs [6].However, in case of imbalanced and/or non-IID datasets,the aggregated models will be biased to the faster EUs,which may lead to considerable performance degradation.On the contrary, the performance of synchronous FL may belimited to the performance of the worst EU, i.e., a user withthe largest computation and communication delay. How-ever, this limitation can be addressed using the proposedsolution, which aims to optimize the allocated resourcesfor different EUs, such that it guarantees the best possibleperformance for all EUs (as will be shown in the nextsection).

4.2 Computation and Communication Latency Analy-sisIn the proposed hierarchical FL framework, we also con-sider both the communication and computation time foroptimizing the performance. In the local update step ofour hierarchical FL (i.e., from the EUs to the edge nodes),we denote the minimum upload rate of EU i to edge j byrui,j . Hence, the maximum uplink communication latency ofmodels aggregation at different edge nodes is defined as

Lu = maxi

(|Wi|rui,j

+ ξui

),∀i ∈M, (10)

where |Wi| is the total amount of bits that have to beuploaded and downloaded by every EU during the training.In (10), |Wi|

rui,jand ξi are, respectively, the transmission time

and the access channel delay that EU i expects to experiencewhen transmitting |Wi| bits to edge node j. In other words,it refers to the estimated end-to-end delay when using agiven technology [24]. Similarly, if rdj,i is the minimumdownload rate from edge node j to EU i, the maximumdownlink latency of receiving the updates at the EUs willbe

Ld = maxj

(|Wi|rdj,i

+ ξdj

),∀j ∈ N , (11)

For the computation time per local iteration at EU i,it depends on the size of the FL model, number of CPUcycles that are needed to execute one sample of data, whichis denoted by ψi, and size of the local dataset Di at eachEU [25]. Given that all samples {xi, yi}Di

i=1 have the samesize (i.e., number of bits), the number of CPU cycles neededfor EU i to run one local iteration is (ψi · Di). The totalnumber of iterations needed to update the local model, atEU i, is upper bounded by O(log(1/ε)), where ε is the

local accuracy. This upper bound is applicable for a widerange of iterative algorithms, such as gradient descent,coordinate descent, or stochastic dual coordinate descent[26]. Moreover, it is proven in [27] that whatever algorithm isused in the computation phase, FL performance will not beaffected as long as the convergence time of these algorithmsis upper-bounded by O(log(1/ε)). Hence, the computationtime for updating the local model at EU i can be written asa function of T ci = f(|Wi| , υ, ε, ψi, Di, fi), where fi is theCPU-cycle frequency, and υ is a constant that depends onthe dataset size and conditions of the local problem [26]. Weremark here that our scheme focuses on the delay over thelinks between EUs and edge nodes, since the delay over thelinks between edge nodes and the centralized server can beconsidered constant for different edge node-EU associations.

4.3 Transmission Energy Consumption AnalysisIn the considered hierarchical architecture, it is assumed thatthe EU can be an IoT device (e.g., a smartphone) with abattery-limited capacity. Hence, it is important to considerthe energy consumption that the EU incurs for transfer-ring its local updates to the edge. The transmitted energyconsumption at EU i is mainly a function of its wirelesschannel state, transmission rate, and allocated bandwidth.The wireless channel between an EU i and an edge node jis characterized by the received signal to noise ratio (SNR),denoted by γij , which is defined as

γij =P rij · |hij |

2

N0 ·Bij, (12)

where P rij is the received power at the edge node, |hij | is thefading channel magnitude, N0 is the noise spectral density,and Bij is the allocated bandwidth. Also, we consider adeterministic path loss model where the EU and the edgenode are separated by a distance dij , hence the receivedpower P rij is attenuated with respect to the transmissionpower P tij following P rij = P tij ·ω · d−αij , where α is the pathloss exponent (such that 2 ≤ α ≤ 6), and ω is a constantthat depends on the wavelength and transmitter/receiverantennas gain [28]. Given the SNR γij , the transmission rateof EU i to edge node j is defined as:

rui,j = Bij log2(1 + ϑγij) (13)

where ϑ = −1.5/(log(5 · BER)), as in [29], and BER is thebit error rate target. By Substituting from (12) in (13), thetransmitted power can be written as

P tij =N0 ·Bijgij

(2rij/Bij − 1), (14)

where gij is the channel gain that is defined as

gij = ϑ · ω · d−αij · |hij |2. (15)

Accordingly, the total energy consumed by EU i to send adata of length |Wi| to edge node j is

Eij =P tij · |Wi|

rij=|Wi| ·N0 ·Bij

rij · gij(2rij/Bij − 1). (16)

We remark that the equation in (16) considers the energyconsumption due to the transmission, while ignoring the

7

energy consumption due to processing since it is normallynotable smaller, and it is neither affected by the EUs assign-ment nor the wireless resource allocation.

5 JOINT EU ASSIGNMENT AND RESOURCE ALLO-CATION

Although leveraging hierarchical FL allows for avoidingdata exchange between different EUs, the FL procedure canalso drain all resources in the system. Thus, it is crucial tooptimize the amount of resources used during the learningprocess, to avoid overloading the system, while maintainingthe accuracy and convergence time on an acceptable level.This paper paves the way for optimizing the hierarchicalFL performance in resource-constrained heterogeneous sys-tems through answering the following questions:

• To which edge node should the participating EUs beallocated?

• How can we optimally leverage the available amountof resources while adequately training the learningmodel?

These questions are particularly important in our system,since we consider a real scenario of distributed nodes andIoT devices, where: (i) the global data distribution is imbal-anced, and (ii) EUs’ resources as well as edge computingresources are not as abundant as in the centralized server.Herein, we refer to “resources” as the the computationand communication resources, including time, energy, andwireless network bandwidth.

5.1 Problem FormulationTypically, FL performance is measured by its convergencebehavior, which depends on the divergence between thefederated weights wb

f and the central weights wtc. Herein,

the central weights represent the learning parameters thatare calculated by the centralized gradient descent, when itconsiders the aggregated dataset of all EUs. This divergencerepresents the deviation caused due to the distribution ofthe datasets at different EUs while performing distributedlearning. Hence, it is always mandatory to obtain the min-imal divergence in order to generate a closer model to theconventional centralized solution.

Convergence bound of FedAvg or gradient-descentbased FL on non-IID data has been studied in [8], [10]. Suchconvergence analysis depicts two important facts:

• there is always a trade-off between thecommunication-efficiency and convergence rate,and

• heterogeneity of the data slows down the conver-gence.

By extending the analysis in [10] for hierarchical FL, theupper bound of the divergence between federated weightsand virtual central weights can be approximated to

‖wf −wc‖ ∝N∑j=1

σj × ‖D(j)‖1, (17)

where σj is the proportion of edge j dataset to the globaldataset, and ‖D(j)‖1 is the classes distribution distance

between the edge j dataset and the global aggregated virtualdataset. This upper bound can be obtained by adjustingthe analysis in [10] to follow the hierarchical FL settingswhile writing the upper bound as a function of ‖D(j)‖.Hence, in order to minimize the divergence between feder-ated weights and virtual central weights while acceleratingFL convergence, it is crucial to maintain a balanced datadistribution (similar to the global virtual dataset) at the edgenodes by uniformly distributing the acquired data across alledge nodes. This fact has been also proved mathematicallyin [20].

Based on the previous theoretical analysis, we formulatean optimization problem that considers the imbalanceddata distribution and wireless communication resources todynamically adapt the EUs assignment to the edge nodes inreal time, while minimizing the learning loss. Our problemaims at optimally allocating EUs to different edge nodes,such that it obtains a balanced data distribution at dif-ferent edge nodes, hence accelerating the FL convergence.To achieve this, we opt to minimize the Kullback–Leiblerdivergence (KLD) of the data distributions [30] in order totackle the global imbalanced data problem. KLD, also calledrelative entropy, measures how a probability distributionH(ck) is different from a second, reference probability distri-bution Q(ck) over the same random variable ck. Let Hj(ck)and Q(ck) be the data classes distribution at edge node jand uniform distribution of a discrete random variable Cwith possible values C = {c1, c2, · · · , cK}, respectively. TheKLD is defined as

DKL(Hj ||Q) =K∑k=1

Hj(ck) logHj(ck)

Q(ck), (18)

where∑Kk=1Hj(ck) = 1 and

∑Kk=1Q(ck) = 1, as well as

Hj(ck) > 0 and Q(ck) > 0, for any ck ∈ C, such that C is theset of all possible classes available at all EUs. Herein, Hj(ck)represents the actual data distribution at the edge, whileQ(ck) representing the ideal data distribution at differentedge nodes (i.e., uniform distribution or the global virtualdataset distribution). A KLD of zero indicates that the dataclasses are uniformly distributed on all edge nodes, i.e., theacquired data at different edge nodes is IID.

Thus, the main objective of our optimization problem isto optimally allocate the M EUs to the available N edgenodes, such that the KLD is minimized. However, given thewireless communication constraints, distributing the EUs todifferent edge nodes will not be an easy task, especiallywhen considering the non-IID data distribution betweendifferent EUs. Thus, our optimization problem is formulatedas follows:

8

P1: minλij ,Bij

N∑j=1

DKL(Hj ||Q) (19)

s.t.

T ci +N∑j=1

λij · Lij ≤ Tm, ∀i ∈M (20)

N∑j=1

λij · Eij ≤ Emi , ∀i ∈M (21)

M∑i=1

λij ·Bij ≤ Bmj , ∀j ∈ N (22)

N∑j=1

λij = 1, ∀i ∈M (23)

λij = {0, 1} , ∀i ∈M & ∀j ∈ N (24)

where λij is the edge indicator, such that λij = 1 whenEU i is allocated to edge node j. The constraint in (20)ensures that the transmitted updates from all EUs will bereceived at the edge nodes with a maximum delay Tm. Thisconstraint considers the computational delay, at each EUT ci , in addition to the transmission delay Lij from an EUi to an edge j, where Lij = |Wi|

rui,j+ ξi. The constraint in

(21) considers the energy budget/limit at different battery-operated EUs. Hence, it ensures that the energy consumedby EU i to send its local updates to edge j cannot exceed themaximum transmission energy Emi . The constraint in (22)refers to the network capacity constraint, where Bij is themaximum fraction of bandwidthBmj that can be used by EUi to communicate with edge j. We remark that Bij dependson the number of EUs communicating with the edge, andit is notified by the edge nodes to the EUs. The constraintsin (23) and (24) ensure that each EU will be connected withonly one edge.

The optimization variables in (19) are the λij ’s and Bij ’s,i.e., each EU needs to be assigned to an edge node, andeach edge node needs to allocate the bandwidth for theassigned EUs. Meanwhile, the proposed optimization prob-lem assigns the EUs to the available edge nodes such thatthe edge nodes’ data distributions turn to be the closest tothe uniform distribution. In other words, our optimizationproblem aims to adjust the EU allocation at different edgenodes in order to obtain balanced data distribution over alledge nodes (by optimizing the KLD). This balanced data (orclass) distribution, from a practical perspective, will lead toaccelerating the learning process and promoting the swiftconvergence (as will be shown in our results).

Looking at problem formulation in (19), one can seethat it is an integer programming problem, which is NP-complete problem [31]. Also, the well-known approachesof converting the problem into a convex problem or aGeometric Program (GP), would not work in this case dueto the existence of the constraints in (23) and (24) withthe non-linearity of the objective function (according tothe definition of KLD in (18)). Thus, below we envision amethodology to solve this problem.

5.2 Proposed SolutionTo solve the formulated problem in (19), we first reformulatethe NP-complete optimization problem to have a linearobjective function; then we relax it to a Linear Program (LP)by removing the bandwidth constraint, and solve the linearprogram. After that, we round the LP solution to a solutionthat satisfies the original integer constraint. Finally, giventhe LP solution, we optimize the bandwidth allocation to theavailable EUs, such that we maintain the KLD minimizationobtained by the LP solution. In what follows, we illustratethe proposed solution in details.

First, we rewrite the objective function as follows:

Z = minλij ,Bij

N∑j=1

DKL(Hj ||Q)

= minλij ,Bij

N∑j=1

K∑k=1

Hj logHj

Q,

= minλij ,Bij

N∑j=1

K∑k=1

Hj log (Hj)−N∑j=1

K∑k=1

Hj log (Q)

.(25)

For the sake of brevity, we refer to Hj(ck) by Hj , and Q(ck)by Q. Given that Q is representing the uniform distribution,i.e., Q is fixed, Hence

Z = minλij ,Bij

N∑j=1

K∑k=1

Hj log (Hj)− log (Q)N∑j=1

K∑k=1

Hj

,= minλij ,Bij

N∑j=1

K∑k=1

Hj log (Hj)−N · log (Q)

. (26)

Given that the term N · log (Q) is fixed with respect to λijand Bij , thus the objective function can be written as

Z = minλij ,Bij

N∑j=1

K∑k=1

Hj log (Hj)

= maxλij ,Bij

N∑j=1

χj(C)

.(27)

Now, the objective function in (27) is turned to be equiv-alent to maximizing the information entropy (or Shan-non entropy) χj at the edge nodes [32], where χj(C) =−∑Kk=1Hj(ck) logHj(ck). It can be easily proved that the

maximum entropy can be only achieved at the value associ-ated with uniform distribution, which can be maintained bydistributing all data classes equally on all edge nodes. Thisimplies that Hj ’s should have the same value for all j ∈ N .However, Hj depends on: (i) λij ’s, i.e., the allocated EUs atedge node j, and (ii) cik’s, i.e., the associated data classes forthe EUs. Hence, Hj(ck) can be defined as

Hj(ck) =

∑Mi=1 λij · cik∑K

k=1

∑Mi=1 λij · cik

. (28)

Given that the maximum entropy is achieved with uniformdistribution, our objective function can be written as:

Z = minλij ,Bij

K∑k=1

∑{j,j}∈S

∣∣∣∣∣M∑i=1

λij · cik −M∑i=1

λij · cik

∣∣∣∣∣ , (29)

9

where S is a subset of all possible subsets that can beselected from M without replacement. For example, if wehave 3 edge nodes, the set S can be defined as S ={{1, 2} , {1, 3} , {2, 3}}.

After simplifying our objective function, we still have thenon-linear constraint in (20), due to the dependence of thetransmission rate on the bandwidth, as well as the integerconstraint in (24). Thus, to convert the formulated optimiza-tion problem into a LP problem that can be easily solvedusing many well-known techniques [33], we reformulate theproblem as follows:

P2: minλij

K∑k=1

∑{j,j}∈S

∣∣∣∣∣M∑i=1

λij · cik −M∑i=1

λij · cik

∣∣∣∣∣ (30)

s.t.

T ci +N∑j=1

λij · Lij ≤ Tm, ∀i ∈M (31)

N∑j=1

λij · Eij ≤ Emi , ∀i ∈M (32)

N∑j=1

λij = 1, ∀i ∈M (33)

0 ≤ λij ≤ 1, ∀i ∈M & ∀j ∈ N (34)

Now, to solve the original formulated problem in (19)and obtain the best assignment to the edge nodes with band-width allocation for all EUs, we propose an efficient EUsAssignment and Resource Allocation (EARA) algorithm (seeAlgorithm 1). The proposed EARA algorithm aims to solveour problem in two steps. Initially, it is assumed that eachedge will have enough bandwidth to be allocated equally toall associated EUs. Then, in the first step, the problem in (30)is solved to obtain λij . However, in (30), the integer con-straint of λij is relaxed, while assuming equal bandwidthallocation for all EUs. Thus, the obtained λij is rounded to”0” or ”1”, such that the original constraints in (23) and(24) are maintained. We remark here that the proposedEARA algorithm allows for enhancing the hierarchical FLperformance by enabling the EUs to have two configu-ration options: Single Connectivity Allocation (SCA), andDual Connectivity Allocation (DCA). The former refers toassigning each EU to only one edge, while the later refersto the possibility of an EU to be connected to one or twoedge nodes simultaneously, leveraging Dual Connectivity(DC)2 and multicast transmission [34], [35]. We remark thatDC is one of the substantial technologies adopted by 5G toincrease the network throughput and mobility robustnessby enabling the users to simultaneously connect with macroand small-cell base stations [36]. In addition, multicast trans-mission provides significant bandwidth savings for sendingduplicate model update from one EU to two edge nodeson a same resource share. It is important also to note that

2. In Release 12 of the 3GPP LTE specifications, DC is initiallydesigned to enable the users to connect with two cells at the sametime, preferably in heterogeneous networks [34]. Then, the requirednon-standalone deployment of 5G technology has extended the conceptof DC to be used with multiple radio technologies, and in particular,for connecting 5G cells to a 4G core network.

DCA configuration represents an extension to our problemformulation, since it relax the constraint in (33) to allow theEU to connect to one or two edge nodes simultaneously.

Using SCA configuration, we set λ∗ij = 1, where

λ∗ij = arg maxj∈N

(λij) , (35)

while setting the others λij ,∀j ∈ N \{λ∗ij

}to 0. In

DCA configuration, we consider the maximum two val-ues of λij ’s, i.e., λ1ij = arg maxj∈N (λij), and λ2ij =arg maxj∈N \{λ1

ij} (λij), such that: (i) λ1ij is set to 1, and

(ii) λ2ij is set to 1, only if λ2ij > ν, while setting the others

λij ,∀j ∈ N \{λ1ij , λ

2ij

}to 0. Herein, ν is a predefined

threshold that represents the EU ability to connect to twoedge nodes.

In the second step, the available bandwidth Bmj of anedge node j is allocated to the associated EUs based ontheir importance. Herein, importance of an EU refers to itsweight (or role) in minimizing the KLD. For instance, EUswith data classes that are different from the available ones atedge node j will be weighted more than others, in order tomaintain the KLD minimization. In this context, an edgej ranks all EUs Mj that are assigned to it, after solving(30). Then, it allocates the minimum bandwidth requiredfor the most important EU to satisfy its latency constraintin (20). This bandwidth allocation process is repeated untilall EUs receive their bandwidth allocation or the edge nodeconsumes the available bandwidth budget, i.e., the originalconstraint in (22) is maintained.

6 PERFORMANCE EVALUATION

In this section, we first present the simulation environmentthat is used to derive our results. Then, we assess the perfor-mance of the proposed hierarchical FL framework comparedto state-of-the-art techniques. In particular, our results assessthe performance of the proposed EARA algorithm using tworeal-world healthcare datasets to measure the enhancementin the learning speed and accuracy.

6.1 Environment setup

For our performance evaluation, we use two datasets,namely, Heartbeat dataset and Seizure dataset. Heartbeatdataset in [37] includes groups of heartbeat signals, i.e.,electrocardiogram (ECG) signals, derived from MIT-BIHArrhythmia dataset [38] for heartbeat classification. Thesesignals represent the normal case and the cases affectedby different arrhythmias and myocardial infarction, hencethey are used in exploring heartbeat classification usingdeep neural network architecture. Seizure dataset representselectroencephalogram (EEG) measurements comprised of amixture of waveforms recorded over time that reflect theelectrical activity of the brain and obtained from electrodesplaced on the scalp, typically in patients with epilepsy.The seizure dataset consists of EEG recordings in suchpatients that was acquired on a Nicolet machine at theAmerican University of Beirut Medical Center (AUBMC)using the international 10-20 system for electrode place-ment. The occurrence, number and timing of seizure(s) in

10

Algorithm 1 EUs Assignment and Resource Allocation(EARA) Algorithm

1: Input: Bij = Bf , where Bf is the fixed bandwidthassigned to EU i at edge j.

2: At the centralized server: (EUs assignment)3: Solve the problem in (30) to obtain λij .4: if SCA is ON then5: Set λ∗ij = 1, while setting λij = 0, ∀j ∈ N \

{λ∗ij

}.

6: end if7: if DCA is ON then8: Set λ1ij = 1.9: if λ2ij > ν then

10: λ2ij = 1.11: else12: λ2ij = 0.13: end if14: Set λij = 0, ∀j ∈ N \

{λ1ij , λ

2ij

}.

15: end if16: Send the values of λij ’s to all edge nodes.17: ———————————————————————18: At the Edge: (bandwidth allocation)19: Receive λij ’s from the centralized server .20: Sort the assigned EUs based on their weights in an

descending order.21: for i = 1→Mj do22: Calculate the minimum Bij that satisfies the con-

straint in (20).23: if

∑Mj

i=1 λij ·Bij ≥ Bmj then24: Break . the available bandwidth is consumed.25: end if26: end for27: return λij , Bij .

TABLE 2Initial edge-level distribution for the Seizure dataset.

Instances per ClassEdge class 0 class 1 class 20 1459 25 251 25 1160 252 25 25 1238

each EEG recording was annotated by an experienced elec-troencephalographer using established criteria that can besummarized as an abrupt change in frequency or amplitudeof the waveforms that exhibit evolution in time or space andlasting for 10 seconds or more [39].

For the distributed settings, we consider different num-ber of EUs while the data is distributed randomly intothe EUs, such that we maintain non-IID data distributionbetween different EUs. For the Seizure dataset, the number ofpossible classes is 3. Hence, we consider 3 edge nodes with13 users. These users are initially assigned to the edge nodessuch that the edge nodes level distribution are obtained asin Table 2. As for the Heartbeat dataset, the number of classesis 5. Hence, we consider 5 edge nodes with 18 users. Theseusers are initially assigned to the edge nodes such that eachedge contains the distribution illustrated in Table 3.

Models architecture: For the Heartbeat dataset, we use

TABLE 3Initial edge-level distribution for the Heartbeat dataset.

Instances per Class×103

Edge class 0 class 1 class 2 class 3 class 40 10 10 0 0 01 0 0 10 10 02 10 0 0 0 103 0 10 10 0 04 0 0 0 10 10

the model presented in [40], which expects 1 input chan-nel and outputs probabilities for 5 classes. For the Seizuredataset, we also use a similar model, but we adapted itto accommodate the 19 input channels and the 3 outputclasses. Our models’ details can be found in [41].

The training-related parameters are the same for bothdatasets, and they are: local epochs equal 1, local batch sizeequals 10, and learning rate equals 0.001, while selectingADAM optimizer. For the centralized baseline, it is assumedthat the batch size is 50 for the Heartbeat dataset and 30for the Seizure dataset. This is because the local batch sizeshould be multiplied by the number of edge nodes, so thatat each communication round in the federated settings, thesame number of data points are used for both the centraland federated models [10].

6.2 Simulation results

In what follows, we compare the proposed EARA algo-rithm with its two configuration options, i.e., SCA andDCA, against the state-of-the-art solution, such as [18], [42],hereinafter we referred to it as Distance-based Allocation(DBA). The DBA scheme considers the same hierarchicalFL architecture in Figure 2, while assigning different EUsto the nearest edge node, based on the distance betweenthe EUs and the edge nodes. Furthermore, we compareour EARA algorithm against a baseline approach, namely,the centralized learning, which represents the benchmark ofthe FL performance since it assumes that all EUs’ data arecollected at a centralized server for the training.

The first aspect we investigate is the influence of theEUs dropping on the obtained classification accuracy. Tothis end, Figure 3 presents the classification accuracy as afunction of the communication rounds between the edgenodes and the centralized server, while considering theDBA scheme. In this figure, we consider different valuesfor Users Participating Percentage (UPP), which representsthe percentage of the EUs that the edge nodes receive theirupdates during the FL training process. Also, we considertwo extreme cases, i.e., Single Class Dropping (SCD) andDual Classes Dropping (DCD). The former refers to missingthe whole samples of one data class by dropping their EUs,while the latter refers to missing the samples of two dataclasses. It is possible to observe that DBA scheme has signif-icantly less performance than the centralized learning, evenwhen considering all EUs’ updates, i.e., UPP=1. An intuitiveexplanation is that the non-IID data on different edge nodesincreases the bias in the learned pattern by the centralizedFL server, which results in low prediction accuracy. This

11

Cla

ssific

atio

n A

ccu

racy (

%)

Communication Rounds

(a)

Centralized learning

DBA, UPP=1

DBA, UPP=0.25

DBA with SCD, UPP=0.25

DBA with DCD, UPP=0.25

Cla

ssific

atio

n A

ccu

racy (

%)

Communication Rounds

(b)

Centralized learning

DBA, UPP=1

DBA, UPP=0.5

DBA with SCD, UPP=0.5

DBA with DCD, UPP=0.5

20

30

40

50

60

70

80

90

100

20

30

40

50

60

70

80

90

100

Fig. 3. Effect of varying UPP on the obtained classification accuracy forDBA scheme, using Heartbeat dataset.

confirms the empirical observations in [8] that heterogene-ity of the data harms the accuracy and slows down theconvergence. More interestingly, decreasing the number ofparticipating EUs in the training yields substantial reductionin the obtained accuracy, especially when the dropping ofthe EUs leads to missing the samples of one or two dataclasses. Thus, it is crucial to address the heterogeneity ofthe data at different edge nodes, to reduce the effect of non-IID data, while considering all EUs with important/uniquedata. These results are the main motivation behind theproposed EARA scheme.

The second aspect we are interested in is how the pro-posed EARA algorithm influences the performance of thehierarchical FL. Figure 4 compares the proposed EARA al-gorithm with its two configuration options against the DBAapproach. This figure depicts the KLD variations of all edgenodes (i.e.,

∑Nj=1DKL(Hj ||Q)) while varying the distance

between EUs and edge nodes. It is clear that our EARA algo-rithm consistently outperforms the DBA approach. This fig-ure indicates that the proposed EARA algorithm decreasesthe local data imbalance unlike DBA that obtains the highestKLD, while the dual connectivity option in EARA algorithm(EARA-DCA) obtaining the lowest KLD. However, withincreasing the distance between EUs and edge nodes, theenergy consumption increases significantly for the EUs that

5 10 15 20 25 30 35 40 45

Distance between edges (m)

(a)

0.4

0.6

0.8

1

1.2

1.4

1.6

DBA

EARA-SCA

EARA-DCA

0 10 20 30 40 50 60

Distance between edges (m)

(b)

0.8

1

1.2

1.4

1.6

1.8

2

KL

DDBA

EARA-SCA

EARA-DCA

KL

D

Fig. 4. KLD variations with increasing the distance between EUs andedge nodes, for different EUs assignment strategies using Heartbeatdataset, while considering: (a) 3 edge nodes and 13 EUs, (b) 5 edgenodes and 18 EUs.

are assigned to an edge node other than the nearest one.This violates the energy constraint in (21), at large distances,hence the EUs are assigned to the nearest edge nodes,which results in converging the EARA performance to theperformance of the DBA. In short, Figure 4 suggests thatthe edge nodes can achieve better balanced data when moreEUs participate in training or when the EUs are assigneduniformly to the edge nodes.

The better performance of the proposed EARA algorithmin decreasing the local data imbalance leads to a signifi-cant enhancement in the obtained classification accuracy,as shown in Figure 5. Each marker therein corresponds toan EUs assignment strategy, and its x- and y- coordinates,respectively, correspond to the communication rounds andthe achieved classification accuracy. This figure highlightshow the performance of our algorithm approaches the cen-tralized benchmark solution, while reducing the communi-cation rounds between the edge nodes and the centralizedserver by 75 − 85% compared to DBA approach, for thesame model accuracy. This is because of the adopted EUsassignment strategy that allows for maintaining balanceddata distributions at different edge nodes. We highlight herethat obtaining balanced data distribution or less-skeweddata is an important issue to avoid the weights divergence

12

Cla

ssific

atio

n A

ccu

racy

Communication Rounds

(a)

Cla

ssific

atio

n A

ccu

racy

Communication Rounds

(b)

DBA

EARA-SCA

EARA-DCA

Centralized learning

DBA

EARA-SCA

EARA-DCA

Centralized learning

75% reduction in

communication rounds

85% reduction in communication rounds

Fig. 5. Classification accuracy as a function of the communicationrounds, for different EUs assignment strategies: (a) using Heartbeatdataset, (b) using Seizure dataset.

of the learning model. Instead, the DBA approach obtainsa slower convergence, while saturating at a lower accu-racy value compared to our EARA algorithm due to theweights divergence at the edge nodes that are affected bytheir local skewed data distributions. More interestingly,leveraging EARA algorithm with dual connectivity option(EARA-DCA) allows for sending duplicate updates to twoedge nodes concurrently, hence reducing more the effectof imbalanced data, which leads to an out-performanceeven compared to the centralized learning benchmark. Thisresembles the well-known machine learning fact that theaccuracy improves as the scale of the training expands.

Figure 5 highlights also that the differences in theobtained accuracy, leveraging different EUs assignmentschemes, are obvious when considering Heartbeat dataset.This is because Heartbeat dataset has a more challengingclassification problem compared to Seizure dataset; hence,the imbalanced data has a major effect on the obtainedperformance. Moreover, we remark that our focus in thispaper is to minimize the gap between the hierarchical FLperformance and the centralized learning performance, in-stead of obtaining the most accurate learning model. Hence,

0

10

20

30

40

50

60

EARA-DCA

(DC EU)

EARA-DCA

(SC EU)

EARA-SCADBA

Co

mm

un

ica

tio

n T

raffic

(M

b)

Fig. 6. Communication traffic per EU for different EUs assignment strate-gies, using Heartbeat dataset.

the adopted deep learning models, described earlier, aresufficient for this purpose.

Finally, in Figure 6, we assess the communication trafficload per EU for our EARA approach compared to theDBA approach, while achieving 90% classification accuracyon Heartbeat dataset. This figure depicts that the proposedapproach is communication-efficient since our EARA-SCAreduces the communication traffic by 50% compared toDBA. For EARA-DCA, it significantly reduces the com-munication rounds, hence it reduces the communicationtraffic by 73% for EUs that have been assigned to oneedge, i.e., single-connectivity EUs (SC EU), compared toDBA. For EUs that have been assigned to two edge nodes,i.e., dual-connectivity EUs (DC EU), the communicationtraffic is increased by 3% compared to EARA-SCA, whilestill obtaining less communication traffic by 47% comparedto DBA. Herein, the communication traffic is calculatedconsidering the deep learning models, described earlier,with 14,789 parameters, while assuming that each parameteris represented by 4 bytes, as in the PyTorch tutorial [43].Moreover, we remark that by leveraging the recent tech-nologies adopted by 5G, two edge nodes (such as macro cellbase station and small cell base station) can simultaneouslytransmit the data to one EU with the help of millimeter-wave (mmWave) massive MIMO technology [44]. This cansignificantly decrease the overhead resulting from the dualconnectivity.

7 CONCLUSION

In this paper, we propose an efficient user assignmentand resource allocation scheme for hierarchical FL. Theproposed scheme allows for leveraging the massive datagenerated from IoT devices for training deep learning mod-els, while effectively addressing the challenges and require-ments posed by the data privacy and resource-constrainedenvironment. In particular, the proposed scheme allows forrendering the distribution of EUs’ data, at the edge, tobe close to the uniform distribution, which significantlyreduces the communication rounds between the edge nodesand the centralized server. The proposed scheme could sig-nificantly alleviate the impact of local imbalanced and non-IID data, and reduce the KLD compared to the state-of-the-

13

art solutions that rely on distance-based user assignment.Hence, it accelerates the learning process and reduces thecommunication traffic by decreasing the global aggregationrounds on the centralized server. Our results show that theproposed approach can reduce 75 − 85% of the communi-cation rounds between the edge nodes and the centralizedserver, compared to the state-of-the-art solutions, while pre-serving the same model accuracy.

The future research directions can include developingonline algorithms to select a subset of outstanding usersthat accelerate the hierarchical FL process, while minimizingthe communication overheads. Moreover, the tight synchro-nization needed among nodes in FL and the presence of acentralized server are still challenging problems that requirefurther research to make FL techniques suitable for highlydynamic scenarios like vehicular applications.

ACKNOWLEDGEMENT

This work was made possible by NPRP grant # NPRP12S-0305-190231 from the Qatar National Research Fund (amember of Qatar Foundation). The findings achieved hereinare solely the responsibility of the authors.

REFERENCES

[1] S. Bosse, D. Maniry, K. Muller, T. Wiegand, and W. Samek, “Deepneural networks for no-reference and full-reference image qualityassessment,” IEEE Transactions on Image Processing, vol. 27, no. 1,pp. 206–219, 2018.

[2] W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K.-R.Muller, Explainable AI: Interpreting, Explaining and Visualizing DeepLearning. Springer International Publishing, 2019.

[3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MITPress, 2016, http://www.deeplearningbook.org.

[4] F. Sattler, S. Wiedemann, K. R. Muller, and W. Samek, “Robust andcommunication-efficient federated learning from non-i.i.d. data,”IEEE Transactions on Neural Networks and Learning Systems, vol. 31,no. 9, pp. 3400–3413, 2020.

[5] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey onmobile edge computing: The communication perspective,” IEEECommunications Surveys Tutorials, vol. 19, no. 4, pp. 2322–2358,2017.

[6] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, andK. Chan, “Adaptive federated learning in resource constrainededge computing systems,” IEEE Journal on Selected Areas in Com-munications, vol. 37, no. 6, pp. 1205–1221, 2019.

[7] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in 2017 IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), 2017, pp.2261–2269.

[8] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On theconvergence of fedavg on non-IID data,” in arXiv:1907.02189, 2019.

[9] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y.Arcas, “Communication-efficient learning of deep networks fromdecentralized data,” in Proceedings of the 20th International Confer-ence on Artificial Intelligence and Statistics (AISTATS), 2017.

[10] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra,“Federated Learning with Non-IID Data,” arXiv:1806.00582[cs, stat], June 2018, arXiv: 1806.00582. [Online]. Available:http://arxiv.org/abs/1806.00582

[11] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Ben-nis, “Federated learning in the sky: Joint power allocation andscheduling with uav swarms,” in ICC 2020 - 2020 IEEE Interna-tional Conference on Communications (ICC), 2020, pp. 1–6.

[12] T. Nishio and R. Yonetani, “Client selection for federated learningwith heterogeneous resources in mobile edge,” ICC 2019 - 2019IEEE International Conference on Communications (ICC), pp. 1–7,2019.

[13] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing FederatedLearning on Non-IID Data with Reinforcement Learning,” inINFOCOM, 2020, p. 10.

[14] M. Chen, H. Vincent Poor, W. Saad, and S. Cui, “Convergence timeoptimization for federated learning over wireless networks,” IEEETransactions on Wireless Communications, pp. 1–1, 2020.

[15] C. Feng, Y. Wang, Z. Zhao, T. Q. S. Quek, and M. Peng, “Jointoptimization of data sampling and user selection for federatedlearning in the mobile edge computing systems,” in IEEE Inter-national Conference on Communications Workshops (ICC Workshops),2020, pp. 1–6.

[16] A. F. Aji and K. Heafield, “Sparse communication for distributedgradient descent,” arXiv:1704.05021, 2017. [Online]. Available:https://arxiv.org/abs/1704.05021

[17] J. Mills, J. Hu, and G. Min, “Communication-efficient federatedlearning for wireless edge intelligence in iot,” IEEE Internet ofThings Journal, vol. 7, no. 7, pp. 5986–5994, 2020.

[18] L. Liu, J. Zhang, S. H. Song, and K. B. Letaief, “Client-edge-cloudhierarchical federated learning,” in IEEE International Conference onCommunications (ICC), 2020, pp. 1–6.

[19] W. Wu, L. He, W. Lin, and R. Mao, “Accelerating federatedlearning over reliability-agnostic clients in mobile edge computingsystems,” arXiv:2007.14374 [cs, eess], July 2020, arXiv: 2007.14374.[Online]. Available: http://arxiv.org/abs/2007.14374

[20] M. Duan, D. Liu, X. Chen, R. Liu, Y. Tan, and L. Liang, “Self-balancing federated learning with global imbalanced data in mo-bile systems,” IEEE Transactions on Parallel and Distributed Systems,vol. 32, no. 1, pp. 59–71, 2021.

[21] N. Mhaisen, A. Awad, A. Mohamed, A. Erbad, and M. Guizani,“Analysis and optimal edge assignment for hierarchical federatedlearning on non-iid data,” arXiv:2012.05622, Dec. 2020. [Online].Available: https://arxiv.org/abs/2012.05622

[22] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A.y Arcas, “Communication-efficient learning of deep networksfrom decentralized data,” Proceedings of the 20 th InternationalConference on Artificial Intelligence and Statistics, pp. 1273–1282,2017.

[23] J. Chen, R. Monga, S. Bengio, and R. Jozefowicz, “Revisitingdistributed synchronous sgd,” in International Conference onLearning Representations Workshop Track, 2016. [Online]. Available:https://arxiv.org/abs/1604.00981

[24] A. A. Abdellatif, A. Mohamed, and C.-F. Chiasserini, “User-centricnetworks selection with adaptive data compression for smarthealth,” IEEE Systems Journal, vol. 12, no. 4, pp. 3618–3628, 2018.

[25] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong,“Federated learning over wireless networks: Optimization modeldesign and analysis,” in IEEE INFOCOM 2019 - IEEE Conference onComputer Communications, 2019, pp. 1387–1395.

[26] J. Koneˇcn´y, Z. Qu, and P. Richt´arik, “Semi-stochastic coordinatedescent,” in Optimization Methods and Software, vol. 32, no. 5, 2017,pp. 993–1005.

[27] C. Ma, J. Koneˇcn´y, M. Jaggi, V. Smith, M. I. Jordan, P. Richt´arik,and M. Tak´aˇc, “Distributed optimization with arbitrary localsolvers,” in Optimization Methods and Software, vol. 32, no. 4, 2017,pp. 813–848.

[28] T. S. Rappaport, Wireless Communications: Principles and Practice,2nd ed. Prentice Hall, 2001.

[29] G. J. Foschini and J. Salz, Digital communications over fading radiochannels, 2nd ed. Bell Systems Technical, 1983.

[30] S. Kullback, Information Theory and Statistics, ser. A Wileypublication in mathematical statistics. Dover Publications,1997. [Online]. Available: https://books.google.com.qa/books?id=05LwShwkhFYC

[31] S. Boyd and L. Vandenberghe, Convex Optimization, 1st ed. cam-bridge university press, 2003.

[32] T. M.Cover and J. A.Thomas, Elements of Information Theory,2nd ed. John Wiley & Sons, 2006.

[33] Jirı Matousek and Bernd Gartner, Understanding and Using LinearProgramming, 1st ed. Springer, Berlin, Heidelberg, 2007.

[34] J. F. Monserrat, F. Bouchmal, D. Martin-Sacristan, and O. Carrasco,“Multi-radio dual connectivity for 5g small cells interworking,”IEEE Communications Standards Magazine, vol. 4, no. 3, pp. 30–36,2020.

[35] Y. Sun, Z. Chen, M. Tao, and H. Liu, “Bandwidth gain from mobileedge computing and caching in wireless multicast systems,” IEEETransactions on Wireless Communications, vol. 19, no. 6, pp. 3992–4007, 2020.

[36] L. Sharma, B. B. Kumar, and S. Wu, “Performance analysis andadaptive drx scheme for dual connectivity,” IEEE Internet of ThingsJournal, vol. 6, no. 6, pp. 10 289–10 304, 2019.

14

[37] “Ecg heartbeat categorization dataset: Segmented and prepro-cessed ECG signals for heartbeat classification,” https://www.kaggle.com/shayanfazeli/heartbeat, accessed: 2021-01-13.

[38] G. B. Moody and R. G. Mark, “The impact of the mit-bih arrhyth-mia database,” IEEE Engineering in Medicine and Biology Magazine,vol. 20, no. 3, pp. 45–50, 2001.

[39] D. L. Schomer and F. H. L. da Silva, Niedermeyer’s Electroen-cephalography: Basic Principles, Clinical Applications, and RelatedFields, 6th ed. Lippincott Williams & Wilkins, 2011.

[40] “An application of deep convolutional networks for the classifi-cation of ecg signal,” https://github.com/eddymina/, accessed:2021-01-13.

[41] “Datasets models,” https://github.com/Naram-m/

Datasets-Models, accessed: 2021-01-13.[42] M. S. H. Abad, E. Ozfatura, D. GUndUz, and O. Ercetin, “Hierar-

chical federated learning across heterogeneous cellular networks,”in ICASSP 2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), 2020, pp. 8866–8870.

[43] “Training a Classifier — PyTorch Tutorials 1.6.0 documentation,”https://pytorch.org/tutorials/beginner/blitz/cifar10 tutorial.html.

[44] Y. Yang, X. Deng, D. He, Y. You, and R. Song, “Machine learning in-spired codeword selection for dual connectivity in 5g user-centricultra-dense networks,” IEEE Transactions on Vehicular Technology,vol. 68, no. 8, pp. 8284–8288, 2019.