abstract arxiv:2006.14591v1 [cs.lg] 25 jun 2020table1:...

56
Bidirectional compression in heterogeneous settings for distributed or federated learning with partial participation: tight convergence guarantees. Constantin Philippenko Aymeric Dieuleveut CMAP, École Polytechnique, Institut Polytechnique de Paris [fistname].[lastname]@polytechnique.edu Abstract We introduce a framework – Artemis – to tackle the problem of learning in a distributed or federated setting with communication con- straints and device partial participation. Sev- eral workers (randomly sampled) perform the optimization process using a central server to aggregate their computations. To alleviate the communication cost, Artemis allows to compresses the information sent in both direc- tions (from the workers to the server and con- versely) combined with a memory mechanism. It improves on existing algorithms that only consider unidirectional compression (to the server), or use very strong assumptions on the compression operator, and often do not take into account devices partial participation. We provide fast rates of convergence (linear up to a threshold) under weak assumptions on the stochastic gradients (noise’s variance bounded only at optimal point ) in non-i.i.d. setting, highlight the impact of memory for unidirec- tional and bidirectional compression, analyze Polyak-Ruppert averaging. We use conver- gence in distribution to obtain a lower bound of the asymptotic variance that highlights practical limits of compression. And we pro- vide experimental results to demonstrate the validity of our analysis. 1 Introduction In modern large scale machine learning applications, optimization has to be processed in a distributed fash- ion, using a potentially large number N of workers. In the data-parallel framework, each worker only accesses a fraction of the data: new challenges have arisen, es- pecially when communication constraints between the workers are present, or when the distributions of the observations on each node are different. In this paper, we focus on first-order methods, es- pecially Stochastic Gradient Descent [Bottou, 1999; Robbins & Monro, 1951] in a centralized framework: a central machine aggregates the computation of the N workers in a synchronized way. This applies to both the distributed [e.g. Li et al., 2014] and the federated learning [introduced in Konečný et al., 2016; McMahan et al., 2017] settings. Formally, we consider a number of features d N * , and a convex cost function F : R d R. We want to solve the following convex optimization problem: min wR d F (w) with F (w)= 1 N N X i=1 F i (w) , (1) where (F i ) N i=1 is a local risk function for the model w on the worker i. Especially, in the classical su- pervised machine learning framework, we fix a loss and access, on a worker i, n i observations (z i k ) 1kni following a distribution D i . In this framework, F i can be either the (weighted) local empirical risk, w 7(n -1 i ) ni k=1 (w, z i k ) or the expected risk w 7E zDi [(w, z)]. At each iteration of the algorithm, each worker can get an unbiased oracle on the gradient of the function F i (typically either by choosing uniformly an observation in its dataset or in a streaming fashion, getting a new observation at each step). Our goal is to reduce the amount of information ex- changed between workers, to accelerate the learning process, limit the bandwidth usage, and reduce energy consumption. Indeed, the communication cost has been identified as an important bottleneck in the distributed settings [e.g. Strom, 2015]. In their overview of the federated learning framework, Kairouz et al. [2019] also underline in Section 3.5 two possible directions to reduce this cost: 1) compressing communication from workers to the central server (uplink) 2) compressing the downlink communication. Most of the papers considering the problem of reducing the communication cost [Alistarh et al., 2017; Agar- wal et al., 2018; Wu et al., 2018; Karimireddy et al., 2019; Mishchenko et al., 2019; Horváth et al., 2019; Li et al., 2020; Horváth & Richtárik, 2020] only focus on compressing the message sent from the workers to the central node. This direction has the highest potential to reduce the total runtime given that the bandwidth for upload is generally more limited than for download, and that for some regimes with a large number of work- arXiv:2006.14591v3 [cs.LG] 8 Mar 2021

Upload: others

Post on 16-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed orfederated learning with partial participation: tight convergence

guarantees.

Constantin Philippenko Aymeric DieuleveutCMAP, École Polytechnique, Institut Polytechnique de Paris

[fistname].[lastname]@polytechnique.edu

Abstract

We introduce a framework – Artemis – totackle the problem of learning in a distributedor federated setting with communication con-straints and device partial participation. Sev-eral workers (randomly sampled) perform theoptimization process using a central server toaggregate their computations. To alleviatethe communication cost, Artemis allows tocompresses the information sent in both direc-tions (from the workers to the server and con-versely) combined with a memory mechanism.It improves on existing algorithms that onlyconsider unidirectional compression (to theserver), or use very strong assumptions on thecompression operator, and often do not takeinto account devices partial participation. Weprovide fast rates of convergence (linear up toa threshold) under weak assumptions on thestochastic gradients (noise’s variance boundedonly at optimal point) in non-i.i.d. setting,highlight the impact of memory for unidirec-tional and bidirectional compression, analyzePolyak-Ruppert averaging. We use conver-gence in distribution to obtain a lower boundof the asymptotic variance that highlightspractical limits of compression. And we pro-vide experimental results to demonstrate thevalidity of our analysis.

1 Introduction

In modern large scale machine learning applications,optimization has to be processed in a distributed fash-ion, using a potentially large number N of workers. Inthe data-parallel framework, each worker only accessesa fraction of the data: new challenges have arisen, es-pecially when communication constraints between theworkers are present, or when the distributions of theobservations on each node are different.

In this paper, we focus on first-order methods, es-pecially Stochastic Gradient Descent [Bottou, 1999;Robbins & Monro, 1951] in a centralized framework: a

central machine aggregates the computation of the Nworkers in a synchronized way. This applies to boththe distributed [e.g. Li et al., 2014] and the federatedlearning [introduced in Konečný et al., 2016; McMahanet al., 2017] settings.

Formally, we consider a number of features d ∈ N∗, anda convex cost function F : Rd → R. We want to solvethe following convex optimization problem:

minw∈Rd

F (w) with F (w) =1

N

N∑i=1

Fi(w) , (1)

where (Fi)Ni=1 is a local risk function for the model

w on the worker i. Especially, in the classical su-pervised machine learning framework, we fix a loss `and access, on a worker i, ni observations (zik)1≤k≤nifollowing a distribution Di. In this framework, Fican be either the (weighted) local empirical risk,w 7→ (n−1

i )∑nik=1 `(w, z

ik) or the expected risk w 7→

Ez∼Di [`(w, z)]. At each iteration of the algorithm, eachworker can get an unbiased oracle on the gradient ofthe function Fi (typically either by choosing uniformlyan observation in its dataset or in a streaming fashion,getting a new observation at each step).

Our goal is to reduce the amount of information ex-changed between workers, to accelerate the learningprocess, limit the bandwidth usage, and reduce energyconsumption. Indeed, the communication cost has beenidentified as an important bottleneck in the distributedsettings [e.g. Strom, 2015]. In their overview of thefederated learning framework, Kairouz et al. [2019]also underline in Section 3.5 two possible directions toreduce this cost: 1) compressing communication fromworkers to the central server (uplink) 2) compressingthe downlink communication.

Most of the papers considering the problem of reducingthe communication cost [Alistarh et al., 2017; Agar-wal et al., 2018; Wu et al., 2018; Karimireddy et al.,2019; Mishchenko et al., 2019; Horváth et al., 2019; Liet al., 2020; Horváth & Richtárik, 2020] only focus oncompressing the message sent from the workers to thecentral node. This direction has the highest potentialto reduce the total runtime given that the bandwidthfor upload is generally more limited than for download,and that for some regimes with a large number of work-

arX

iv:2

006.

1459

1v3

[cs

.LG

] 8

Mar

202

1

Page 2: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

ers, the downlink communication, that corresponds to a“one-to-N ” communication, may not be the bottleneckcompared to the “N -to-one” uplink.

Nevertheless, there are several reasons to also considerdownlink compression. First, the difference betweenupload and download speeds is not significant enoughat all to ignore the impact of the downlink direction(see Appendix B for an analysis of bandwidth). Ifwe consider for instance a small number N of workerstraining a very heavy model – the size of Deep Learningmodels generally exceeds hundreds of MB [Dean et al.,2012; Huang et al., 2019] –, the training speed will belimited by the exchange time of the updates, thus usingdownlink compression is key to accelerating the process.Secondly, in a different framework in which a networkof smartphones collaborate to train a large scale modelin a federated framework, participants to the trainingwould not be eager to download a hundreds of MB foreach update on their phone. Here again, compressionappears to be necessary. To encompass all situations,our framework implements compression in either orboth directions with possibly different compressionlevels.

Bidirectional compression (i.e. compressing both uplinkand downlink) raises new challenges. In the downlinkstep, if we compress the model, the quantity compresseddoes not tend to zero. Consequently the compressionerror significantly hinders convergence. To circumventthis problem we compress the gradient that may asymp-totically approach zero. Double compression has beenrecently considered by Tang et al. [2019]; Zheng et al.[2019]; Liu et al. [2020]; Yu et al. [2019]. The most re-cent work, Dore, defined by Liu et al. [2020], analyzed adouble compression approach, combining error compen-sation, a memory mechanism and model compression,with a uniform bound on the gradient variance. Inthis work, we provide new results on Dore-like algo-rithms, considering a framework without error-feedbackusing tighter assumptions, and quantifying preciselythe impact of data heterogeneity on the convergence.

Moreover, we focus on a heterogeneous setting: thedata distribution depends on each worker (thus noni.i.d.). We explicitly control the differences betweendistributions. In such a setting, the local gradient atthe optimal point ∇Fi(w∗) may not vanish: to get avanishing compression error, we introduce a “memory”process [Mishchenko et al., 2019].

Finally, we encompass in Artemis Partial Participation(PP) settings, in which only some workers contributeto each step. Several challenges arise with PP, becauseof both heterogeneity and downlink compression. Wepropose a new algorithm in that setting, which improveson the approaches proposed by Sattler et al. [2019];

Tang et al. [2019] for bidirectional compression.

Assumptions made on the gradient oracle directly in-fluence the convergence rate of the algorithm: in thispaper, we neither assume that the gradients are uni-formly bounded [as in Zheng et al., 2019] nor thattheir variance is uniformly bounded [as in Alistarhet al., 2017; Mishchenko et al., 2019; Liu et al., 2020;Tang et al., 2019; Horváth et al., 2019]: instead weonly assume that the variance is bounded by a con-stant σ2

∗ at the optimal point w∗, and provide linearconvergence rates up to a threshold proportional to σ2

∗(as in [Dieuleveut et al., 2019; Gower et al., 2019] fornon distributed optimization). This is a fundamentaldifference as the variance bound at the optimal pointcan be orders of magnitude smaller than the uniformbound used in previous work: this is striking when allloss functions have the same critical point, and thusthe noise at the optimal point is null! This happens forexample in the interpolation regime, which has recentlygained importance in the machine learning community[Belkin et al., 2019]. As the empirical risk at the opti-mal point is null or very close to zero, so are all the lossfunctions with respect to one example. This is oftenthe case in deep learning [e.g., Zhang et al., 2017] orin large dimension regression [Mei & Montanari, 2019].

Overall, we make the following contributions:

1. We describe a framework – Artemis – that encom-passes 6 algorithms (with or without up/down com-pression, with or without memory) and that is ex-tended to partial participation of devices. We pro-vide and analyze in Theorem 1 a fast rate of conver-gence – exponential convergence up to a thresholdproportional to σ2

∗ –, obtaining tighter bounds thanin [Alistarh et al., 2017; Mishchenko et al., 2019; Liuet al., 2020].

2. We explicitly tackle heterogeneity using As-sumption 4, proving that the limit variance ofArtemis with memory is independent from the dif-ference between distributions (as for SGD). Thisis the first theoretical guarantee for doublecompression that explicitly quantifies the im-pact of non i.i.d. data.

3. We propose a novel algorithm in the case of par-tial device participation, leveraging the memory toimprove convergence in all compression frameworks.

4. In the non strongly-convex case, we prove the conver-gence using Polyak-Ruppert averaging in Theorem 2.

5. We prove convergence in distribution of the iter-ates, and subsequently provide a lower bounds onthe asymptotic variance. This sheds light on thelimits of (double) compression, which results in anincrease of the algorithm’s variance, and can thusonly accelerate the learning process for early itera-tions and up to a “moderate” accuracy. Interestingly,

Page 3: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

this “moderate” accuracy has to be understood withrespect to the reduced noise σ2

∗.

Furthermore, we support our analysis with variousexperiments illustrating the behavior of our new algo-rithm and we provide the code to reproduce our experi-ments. See this repository. In Table 1, we highlight themain features and assumptions of Artemis comparedto recent algorithms using compression.

The rest of the paper is organized as follows: in Sec-tion 2 we introduce the framework of Artemis. InSubsection 2.1 we describe the assumptions, and wereview related work in Subsection 2.2. We then give thetheoretical results in Section 3, we extend the result todevice sampling in Section 4, we present experimentsin Section 5, and finally, we conclude in Section 6.

2 Problem statement

We consider the problem described in Equation (1). Inthe convex case, we assume that there exist at leastone optimal point which we denote w∗, we also denotehi∗ = ∇Fi(w∗), for i in J1, NK. We use ‖·‖ to denotethe Euclidean norm. To solve this problem, we rely ona stochastic gradient descent (SGD) algorithm.

A stochastic gradient gik+1 is provided at iteration kin N to the device i in J1, NK. This function is thenevaluated at point wk: to alleviate notation, we willuse gik+1 = gik+1(wk) and gik+1,∗ = gik+1(w∗) to denotethe stochastic gradient vectors at points wk and w∗on device i. In the classical centralized framework(without compression), with partial participation ofdevices, SGD corresponds to:

wk+1 = wk − γ1

N

N∑i=1

gik+1 (2)

where γ is the learning rate. Here, we first consider thefull participation case.

However, computing such a sequence would require thenodes to send either the gradient gik+1 or the updatedlocal model to the central server (uplink communica-tion), and the central server to broadcast back eitherthe averaged gradient gk+1 or the updated global model(downlink communication). Here, in order to reducecommunication cost, we perform a bidirectional com-pression. More precisely, we combine two main tools:1) an unbiased compression operator C : Rd → Rd thatreduces the number of bits exchanged, and 2) a memoryprocess that reduces the size of the signal to compress,and consequently the error [Mishchenko et al., 2019; Liet al., 2020]. That is, instead of directly compressingthe gradient, we first approximate it by the memoryterm and, afterwards, we compress the difference. As aconsequence, the compressed term tends in expectation

to zero, and the error of compression is reduced. Follow-ing Tang et al. [2019], we always broadcast gradientsand never models. To distinguish the two compressionoperations we denote Cup and Cdwn the compressionoperator for downlink and uplink. At each iteration,we thus have the following steps:

1. First, each active local node sends to the cen-tral server a compression of gradient differences:∆ik = Cup(gik+1 − hik), and updates the memory

term hik+1 = hik + α∆ik with α ∈ R∗. The server re-

covers the approximated gradients’ values by addingthe received term to the memories kept on its side.

2. Then, the central server sends back the compres-sion of the sum of compressed gradients: Ωk+1 =

Cdwn

(1N∑Ni=1 ∆i

k + hik

). No memory mechanism

needs to be used, as the sum of gradients tends tozero in the absence of regularization.

The update is thus given by:∀i ∈ J1, NK , ∆i

k = Cup

(gik+1 − hik

)Ωk+1 = Cdwn

(1N∑Ni=1(∆i

k + hik))

wk+1 = wk − γΩk+1

. (3)

Constants γ, α ∈ R∗ are learning rates for respectivelythe iterate sequence and the memory sequence. Theadaptation of the framework in the case of device sam-pling is developed in Section 4. We provide the pseudo-code of Artemis in Algorithm 1 and for the under-standing of Artemis implementation, we give a visualillustration of the algorithm in Figure 1.Remark 1. Remark that we have used in Algorithm 1the true value of p in the update (to get an unbiasedestimator of the gradient), but it is obviously possibleto use an estimated value p in Equation (3): indeed, itis exactly equivalent to multiplying the step size γ bya factor p/p, thus neither changes the practical imple-mentation nor the theoretical analysis.

As a summary, the Artemis framework encompasses inparticular these four algorithms: the variant with unidi-rectional compression (ωdwn

C = 0) w.o. or with memory(α = 0 or α 6= 0) recovers QSGD defined by Alistarhet al. [2017] and DIANA proposed by Mishchenko et al.[2019]. The variant using bidirectional compression(ωdwnC 6= 0) w.o memory (α 6= 0) is called Bi-QSGD. The

last and most effective variant combines bidirectionalcompression with memory and is the one we refer toas Artemis if no precision is given. It corresponds to asimplified version of Dore without error-feedback, butthis additional mechanism did not lead to any theo-retical improvement [Remark 2 in Sec. 4.1., Liu et al.,2020].Remark 2 (Local steps). An obvious independent di-rection to reduce communication is to increase the num-

Page 4: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Table 1: Comparison of frameworks for main algorithms handling (bidirectional) compression. By “non i.i.d.”, wemean that the theoretical framework encompasses and explicitly quantifies the impact of data heterogeneity onconvergence (Assumption 4), e.g., Dore does not assume i.i.d. workers but does not quantify differences betweendistributions. References: see Alistarh et al. [2017] for QSGD, Mishchenko et al. [2019] for Diana, Horváth &Richtárik [2020] for [HR20], Liu et al. [2020] for Dore and Tang et al. [2019] for DoubleSquezze.

QSGD Diana [HR20] Dore DoubleSqueeze

DistEF-SGD

Artemis(new)

Data i.i.d. non i.i.d. non i.i.d. i.i.d. i.i.d. i.i.d. non i.i.d.

Bounded variance Uniformly Uniformly Uniformly Uniformly Uniformly Uniformly At optimalpoint

Compression One-way One-way One-way Two-way Two-way Two-way Two-wayError compensation 3 3 3 3Memory 3 3 3Device sampling 3 3

Algorithm 1: Artemis - set α > 0 to use memory.Input: Mini-batch size b, learning rates α, γ > 0, initial model w0 ∈ Rd, operators Cup and Cdwn, M1 and M2

the sizes of the full/compressed gradients.Initialization: Local memory: ∀i ∈ J1, NK hi0 = 0 (kept on both central server and device i). Index of last

participation: ki = 0.Output: Model wKfor k = 0, 1, 2, . . . ,K do

Randomly sample a set of device Skfor each device i ∈ Sk do

Catching up.If k − ki > bM1/M2c, send the model wkElse send (Ωj)

kj=ki+1 and update local model: ∀j ∈ Jki + 1, kK, wj = wj−1 − γΩj,Sj−1

Update index of its last participation: ki = kLocal training.Compute stochastic gradient gik+1 = gk+1(wk) (with mini-batch)Set ∆i

k = gik+1 − hik, compress it ∆ik = Cup(∆i

k)

Update memory term: hik+1 = hik + α∆ik

Send ∆ik to central server

Compute gk+1,Sk = hk + 1pN

∑i∈Sk ∆i

k

Update central memory: hk+1 = hk + α 1N∑i∈Sk ∆i

k

Back compression: Ωk+1,Sk = Cdwn(gk+1,Sk)Broadcast Ωk+1 to all workers.Update model on central server: wk+1 = wk − γΩk+1,Sk

Figure 1: Visual illustration of Artemis bidirectional compression at iteration k ∈ N. wk is the model’s parameter,Cup and Cdwn are the compression operators, γ is the step size, α is the learning rate which simulates a memorymechanism of past iterates and allows the compressed values to tend to zero.

Page 5: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

ber of steps performed before communication. This isthe spirit of Local-SGD [Stich, 2019]. It is an interest-ing extension to incorporate this into our framework,that we do not consider in order to focus on the com-pression insights.

In the following section, we present and discuss assump-tions over the function F , the data distribution andthe compression operator.

2.1 Assumptions

We make classical assumptions on F : Rd → R.Assumption 1 (Strong convexity). F is µ-stronglyconvex, that is for all vectors w, v in Rd:

F (v) ≥ F (w) + (v − w)T∇F (w) +µ

2‖v − w‖22 .

Note that we do not need each Fi to be strongly con-vex, but only F . Also remark that we only use thisinequality for v = w∗ in the proof of Theorems 1 and 2.

Below, we introduce cocoercivity [see Zhu & Marcotte,1996, for more details about this hypothesis]. Thisassumption implies that all (Fi)i∈J1,NK are L-smooth.

Assumption 2 (Cocoercivity of stochastic gradients(in quadratic mean)). We suppose that for all k inN, stochastic gradients functions (gik)i∈J1,NK are L-cocoercive in quadratic mean. That is, for k in N,i in J1, NK and for all vectors w1, w2 in Rd, we have:

E[‖gik(w1)− gik(w2)‖2] ≤L 〈∇Fi(w1)−∇Fi(w2) | w1 − w2〉 .

E.g., this is true under the much stronger assump-tion that stochastic gradients functions (gik)i∈J1,NK arealmost surely L-cocoercive, i.e.:

‖gik(w1)− gik(w2)‖2 ≤ L⟨gik(w1)− gik(w2)

∣∣ w1 − w2

⟩. Next, we present the assumption on the stochasticgradient’s noise. Again, we highlight that the noiseis only controlled at the optimal point. To carefullycontrol the noises process (gradient oracle, uplink anddownlink compression), we introduce three filtrations(Hk,Gk,Fk)k≥0, such that wk is Hk-measurable for anyk ∈ N. Detailed definitions are given in Appendix A.2.

Assumption 3 (Noise over stochastic gradients com-putation). The noise over stochastic gradients at theglobal optimal point, for a mini-batch of size b, isbounded: there exists a constant σ∗ ∈ R, s. t. for all kin N, for all i in J1, NK , we have a.s.:

E[‖gik+1,∗ −∇Fi(w∗)‖2

∣∣ Hk] ≤ σ2∗b.

The constant σ2∗ is null, e.g. if we use deterministic

(batch) gradients, or in the interpolation regime fori.i.d. observations, as discussed in the Introduction. Aswe have also incorporated here a mini-batch parameter,this reduces the variance by a factor b.

Unlike Diana [Mishchenko et al., 2019; Li et al., 2020],Dore [Liu et al., 2020], Dist-EF-SGD [Zheng et al., 2019]or Double-Squeeze [Tang et al., 2019], we assume thatthe variance of the noise is bounded only at optimalpoint w∗ and not at any point w in Rd. It resultsthat if variance is null (σ2

∗ = 0) at optimal point,we obtain a linear convergence while previous resultsobtain this rate solely if the variance is null at anypoint (i.e. only for deterministic GD). Also remark thatAssumptions 2 and 3 both stand for the simplest Least-Square Regression (LSR) setting, while the uniformbound on the gradient’s variance does not. Next, wegive the assumption that links the distributions on thedifferent machines.Assumption 4 (Bounded gradient at w∗). There ex-ists a constant B ∈ R+, s.t.: 1

N

∑Ni=0 ‖∇Fi(w∗)‖2 =

B2 .

This assumption is used to quantify how differentthe distributions are on the different machines. Inthe streaming i.i.d. setting – D1 = · · · = DN andF1 = · · · = FN – the assumption is satisfied withB = 0. Combining Assumptions 3 and 4 resultsin an upper bound on the averaged squared normof stochastic gradients at w∗: for all k in N, a.s.,1N

∑Ni=1 E

[‖gik+1,∗‖2

∣∣∣ Hk] ≤ σ2∗b

+B2.

Finally, compression operators can be classified in twomain categories: quantization [as in Alistarh et al.,2017; Seide et al., 2014; Zhou et al., 2018; Wen et al.,2017; Reisizadeh et al., 2020; Horváth et al., 2019] andsparsification [as in Stich et al., 2018; Aji & Heafield,2017; Alistarh et al., 2018; Khirirat et al., 2020]. Theo-retical guarantees provided in this paper do not rely ona particular kind of compression, as we only considerthe following assumption on the compression opera-tors Cup and Cdwn:Assumption 5. There exists constants ωup

C , ωdwnC ∈

R∗+, such that the compression operators Cup and Cdwn

verify the two following properties for all ∆ in Rd:E[Cup/dwn(∆)] = ∆ ,

E[∥∥Cup/dwn(∆)−∆

∥∥2] ≤ ωup/dwn

C ‖∆‖2 .

In other words, the compression operators are unbiasedand their variances are bounded. Note that Horváth& Richtárik [2020] have shown that using an unbiasedoperator leads to better performances. Unlike us, Tanget al. [2019] assume uniformly bounded compressionerror, which is a much more restrictive assumption.

Page 6: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

We now provide additional details on related papersdealing with compression. Also note that ωup/dwn

C canbe considered as parameters of the algorithm, as thecompression levels can be chosen.

2.2 Related work on compression

Quantization is a common method for compression andis used in various algorithms. For instance, Seide et al.[2014] are one of the first to propose to quantize eachgradient component by either −1 or 1. This approachhas been extended in Karimireddy et al. [2019]. Al-istarh et al. [2017] define a new algorithm – QSGD –which instead of sending gradients, broadcasts theirquantized version, getting robust results with this ap-proach. On top of gradient compression, Wu et al.[2018] add an error compensation mechanism which ac-cumulates quantization errors and corrects the gradientcomputation at each iteration. Diana [introduced inMishchenko et al., 2019] introduces a “memory” termin the place of accumulating errors. Li et al. [2020]extend this algorithm and improve its convergence byusing an accelerated gradient descent. Reisizadeh et al.[2020] combine unidirectional quantization with devicesampling, leading to a framework closer to FederatedLearning settings where devices can easily be switchedoff. In the same perspective, Horváth & Richtárik[2020] detail results that also consider devices partialparticipation. Tang et al. [2019] are the the first tosuggest a bidirectional compression scheme for a de-centralized network. For both uplink and downlink,the method consists in sending a compression of gra-dients combined with an error compensation. Bidirec-tional compression is later developed only in Yu et al.[2019]; Liu et al. [2020]; Zheng et al. [2019]. Instead ofcompressing gradients, Yu et al. [2019] choose to com-press models. This approach is enhanced by Liu et al.[2020] who combine model compression with a memorymechanism and an error compensation drawing fromMishchenko et al. [2019]. Both Tang et al. [2019] andZheng et al. [2019] compress gradients without using amemory mechanism. However, as proved in the follow-ing section, memory is key to reducing the asymptoticvariance in the heterogeneous case.

We now provide theoretical results about the conver-gence of bidirectional compression.

3 Theoretical results

In this section, we present our main theoretical resultson the convergence of Artemis and its variants. For thesake of clarity, the most complete and tightest versionsof theorems are given in Appendices, and simplifiedversions are provided here. The main linear convergencerates are given in Theorem 1. In Theorem 2 we show

Table 2: Details on constants C and E defined inTheorem 1. C = 0 for α = 0, see Th. S6 for α 6= 0.

α E

0 (ωdwnC + 1)

((ωupC + 1)

σ2∗b

+ ωupC B

2

)6= 0

σ2∗b

((2ωupC + 1)(ωdwn

C + 1) + 4α2C(ωupC + 1)− 2αC

)

that Artemis combined with Polyak-Ruppert averagingreaches a sub-linear convergence rate. In this section,we denote δ2

0 = ‖w0 − w∗‖2.Theorem 1 (Convergence of Artemis). Under As-sumptions 1 to 5, for a step size γ satisfying the condi-tions in Table 3, for a learning rate α and for any kin N, the mean squared distance to w∗ decreases at alinear rate up to a constant of the order of E:

E[‖wk − w∗‖2

]≤ (1− γµ)k

(δ20 + 2Cγ2B2

)+

2γE

µN,

for constants C and E depending on the variant (in-dependent of k) given in Table 2 or in the appendix.Variants with α 6= 0 require α ∈ [1/2(ωup

C + 1), αmax],the upper bound αmax is given in Theorem S6.

This theorem is derived from Theorems S5 and S6 whichare respectively proved in Appendices E.1 and E.2. Wecan make the following remarks:

1. Linear convergence. The convergence rate givenin Theorem 1 can be decomposed into two terms:a bias term, forgotten at linear speed (1 − γµ)k,and a variance residual term which corresponds tothe saturation level of the algorithm. The rate ofconvergence (1−γµ) does not depend on the variantof the algorithm. However, the variance and initialbias do vary.

2. Bias term. The initial bias always depends on‖w0 − w∗‖2, and when using memory (i.e. α 6= 0) italso depends on the difference between distributions(constant B2).

3. Variance term and memory. On the other hand,the variance depends a) on both σ2

∗/b, and the dis-tributions’ difference B2 without memory b) onlyon the gradients’ variance at the optimum σ2

∗/b withmemory. Similar theorems in related literature Liuet al. [2020]; Alistarh et al. [2017]; Mishchenko et al.[2019]; Yu et al. [2019]; Tang et al. [2019]; Zhenget al. [2019] systematically had a worse bound forthe variance term depending on a uniform boundof the noise variance or under much stronger con-ditions on the compression operator. This paperand [Liu et al., 2020] are also the first to give a lin-ear convergence up to a threshold for bidirectionalcompression.

Page 7: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Table 3: Upper bound γmax on step size γ to guaranteeconvergence. For unidirectional compression (resp. nocompr.), ωdwn

C = 0 (resp. ωup/dwnC = 0, recovering

classical rates for SGD).

Memory α = 0 α 6= 0

N ωupC

1(ωdwnC + 1)L

12(ωdwnC + 1)L

N ≈ ωupC

13(ωdwnC + 1)L

15(ωdwnC + 1)L

ωupC N

N2ωupC (ωdwn

C + 1)LN

4ωupC (ωdwn

C + 1)L

4. Impact of memory. To the best of our knowledge,this is the first work on double compression thatexplicitly tackles the non i.i.d. case. We prove thatmemory makes the saturation threshold independentof B2 for Artemis.

5. Variance term. The variance term increases witha factor proportional to ωup

C for the unidirectionalcompression, and proportional to ωup

C × ωdwnC for

bidirectional. This is the counterpart of compression,each compression resulting in a multiplicative factoron the noise. A similar increase in the varianceappears in [Mishchenko et al., 2019] and [Liu et al.,2020]. The noise level is attenuated by the numberof devices N , to which it is inversely proportional.

6. Link with classical SGD. For variant ofArtemis with α = 0, if ωup/dwn

C = 0 (i.e. no com-pression) we recover SGD results: convergence doesnot depend on B2, but only on the noise’s variance.

Conclusion: Overall, it appears that Artemis is ableto efficiently accelerate the learning during first itera-tions, enjoying the same linear rate as SGD with lowercommunication complexity, but it saturates at a higherlevel, proportional to σ2

∗ and independent of B2.

The range of acceptable learning rates is an importantfeature for first order algorithms. In Table 3, we sum-marize the upper bound γmax on γ, to guarantee a(1 − γµ) convergence of Artemis. These bounds arederived from Theorems S5 and S6, in three main asymp-totic regimes: N ωup

C , N ≈ ωupC and ωup

C N . Us-ing bidirectional compression impacts γmax by a factorωdwnC + 1 in comparison to unidirectional compression.

For unidirectional compression, if the number of ma-chines is at least of the order of ωup

C , then γmax nearlycorresponds to γmax for vanilla (serial) SGD.

We now provide a convergence guarantee for the aver-aged iterate without strong convexity.

Theorem 2 (Convergence of Artemis withPolyak-Ruppert averaging). Under Assumptions 2 to 6(convex case) with constants C and E as in Theorem 1

(see table 2 for precision), after running K in N

iterations, for a learning rate γ = min

(√Nδ202EK ; γmax

),

with γmax as in Table 3, we have a sublinear conver-gence rate for the Polyak-Ruppert averaged iteratewK = 1

K

∑Kk=0 wk, with εF (wK) = F (wK)− F (w∗):

εF (wK) ≤ 2 max

(√2δ2

0E

NK;

δ20

γmaxK

)+

2γmaxCB2

K.

This theorem is proved in Appendix E.3. Several com-ments can be made on this theorem:

1. Importance of averaging This is the first theo-rem given for averaging for double compression. Inthe context of convex optimization, averaging hasbeen shown to be optimal [Rakhlin et al., 2012].

2. Speed of convergence, if σ∗ = 0, B 6= 0, K →∞. For α 6= 0, E = 0, while for α = 0, E ∝ B2.Memory thus accelerates the convergence from arate O(K−1/2) to O(K−1).

3. Speed of convergence, general case. More gen-erally, we always get a K−1/2 sublinear speed ofconvergence, and a faster rate K−1, when usingmemory, and if E ≤ δ2

0N/(2Kγ2max) – i.e. in the

context of a low noise σ2∗, as E ∝ σ2

∗. Again, it ap-pears that bi-compression is mostly useful in low-σ2

∗regimes or during the first iterations: intuitively, fora fixed communication budget, while bi-compressionallows to perform minωup

C , ωdwnC -times more itera-

tions, this is no longer beneficial if the convergencerate is dominated by

√2δ2

0E/NK, as E increasesby a factor ωup

C × ωdwnC .

4. Memoryless case, impact of minibatch. In thevariant of Artemis without memory, the asymptoticconvergence rate is

√2δ2

0E/NK with the constantE ∝ σ2

∗/b+B2: interestingly, it appears that in thecase of non i.i.d. data (B2 > 0), the convergence ratesaturates when the size of the mini-batch increases:large mini-batches do not help. On the contrary,with memory, the variance is, as classically, reducedby a factor proportional to the size of the batch,without saturation.

The increase in the variance (in item 3) is not anartifact of the proof: we prove the existence of a limitdistribution for the iterates of Artemis, and analyzeits variance. More precisely, we show a linear rate ofconvergence for the distribution Θk of wk (launchedfrom w0), w.r.t. the Wasserstein distance W2 [Villani,2009]: this gives us a lower bound on the asymptoticvariance. Here, we further assume that the compressionoperator is Stochastic sparsification [Wen et al., 2017].

Theorem 3 (Convergence in distribution and lower

Page 8: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

bound on the variance). Under Assumptions 1 to 5 (fullparticipation setting), for γ, α,E given in Theorem 1and Table 3:

1. There exists a limit distribution πγ,v depending onthe variant v of the algorithm, s.t. for any k ≥ 1,W2(Θk, πγ,v) ≤ (1− γµ)kC0, with C0 a constant.

2. When k goes to ∞, the second order momentE[‖wk − w∗‖2] converges to Ew∼πγ,v [‖w − w∗‖2],which is lower bounded by Ω(γE/µN) as in Theo-rem 1 as γ → 0, with E depending on the variant.

Interpretation. The second point (2.) means that theupper bound on the saturation level provided in Theo-rem 1 is tight w.r.t. σ2

∗, ωupC , ω

dwnC , B2, N and γ. Espe-

cially, it proves that there is indeed a quadratic increasein the variance w.r.t. ωup

C and ωdwnC when using bidi-

rectional compression (which is itself rather intuitive).Altogether, these three theorems prove that bidirec-tional compression can become strictly worse than usualstochastic gradient descent in high precision regimes,a fact of major importance in practice and barely (ifever) even mentioned in previous literature. To thebest of our knowledge, only Mayekar & Tyagi [2020]are giving a lower bound on the asymptotic variancefor algorithms using compression. There result is moregeneral i.e., valid for any algorithm using unidirectionalcompression, but weaker (worst case on the oracle doesnot highlight the importance of noise at the optimalpoint and is incompatible with linear rates).

Proof and assumptions. This theorem also natu-rally requires, for the second point, Assumptions 3 to 5to be “tight”: that is, e.g. Var(gik+1,∗) ≥ Ω(σ2

∗/b); moredetails and the proof inspired from Dieuleveut et al.[2019], are given in Appendix E.4. Extension to othertypes of compression reveals to be surprisingly non-simple, and is thus out of the scope of this paper anda promising direction.

4 Partial Participation

In this section we extend our work to the Partial Par-ticipation (PP) setting, considering Assumption 6.Assumption 6. At each round k in N, each devicehas a probability p of participating, independently fromother workers, i.e., there exists a sequence (Bik)k,i ofi.i.d. Bernoulli random variables B(p), such that forany k and i, Bik marks if device i is active at step k.We denote Sk = i ∈ J1, NK/Bik = 1 the set of activedevices at round k and Nk = card(Sk) the number ofactive workers.

There are two approaches to extend the update rule.The first one (we will refer to it as PP1) is the mostintuitive. It consists in keeping all memories (hik)1≤i≤Non the central server. This way, we can reconstruct at

each iteration k in N and for each device i in 1, · · · , Nthe stochastic gradient gik+1 as ∆i

k + hik. The update

equation is wk+1 = wk − γCdwn

(1pN

∑i∈Sk ∆i

k + hik

),

and memories are updated like on local servers. If bothcompression levels are set to 0, this approach recoversclassical SGD with PP: wk+1 = wk − γ

pN

∑i∈Sk gik(wk).

It also corresponds to the proposition of both Sattleret al. [2019] and Tang et al. [2019, v2 on arxiv for thedistributed case]. However, it has two important draw-backs. First, we have to store N additional memorieson the central server, which may have a huge cost. Sec-ondly, this method saturates, even with deterministicgradients σ2

unif = 0 and no compression ωup/dwnC = 0

(see Figure 5). Indeed, the partial participation in-duces an additive noise term, in other words the vari-ance of the noise at the optimum point is not null asVar( 1

Np

∑i∈Sk ∇Fi(w∗)) = 1−p

N2p

∑Ni=1 ‖∇Fi(w∗)‖

2=

(1−p)B2

Np . This happens for all compression regimes inArtemis, even for SGD.

We consider a novel approach (denotedPP2) that lever-ages the full potential of the memory, solving simultane-ously the convergence issue and the need for additionalmemory resources. At each iteration k, this approachkeeps a single memory hk (instead of N memories) onthe central server. This memory is updated at each step:hk+1 = hk + α

N

∑i∈Sk ∆i

k, and the update equation

becomes: wk+1 = wk − γCdwn

(hk + 1

pN

∑i∈Sk ∆i

k

).

This difference is far from being insignificant. Indeedin order to reconstruct the broadcast signal, we use thememory built on all devices during previous iterations,even if the device i in 1, · · · , N was not active! Theimpact of this approach is major as it follows thatalgorithms, even using bidirectional compression, canbe faster than classical SGD. In this settings, SGD withmemory (Artemis with no compression and memory)will be the benchmark: see Figure 6.

Theorem 4 (Artemis with partial participation). Un-der the same assumptions and constraints on γ andα, when considering Assumption 6 (partial participa-tion), Theorem 1 is still valid for PP2 with mem-ory. We have: E = (ωdwn

C + 1)(

2(ωupC +1)

p − 1)σ2∗b +

2pC(2α2(ωup

C + 1)− α) σ2∗b .

The most important observation is that we recoveragain a linear convergence rate if σ∗ = 0. By contrast,even with memory and without compression, for PP1,there is an extra (B2(1− p)/(Np) term.

Remark 3 (Impact of downlink compression). Withboth of these approaches, we still need to synchronizethe model in order to compute the stochastic gradienton the same up-to-date model for each worker. Thus,a newly active worker must catch-up the sequence of

Page 9: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

missed updates (Ωk)k. Of course, if a device has beeninactive for too many iterations, we send the full modelinstead of the sequence of missed updates. More pre-cisely, suppose that the number of bits to send the full(resp. the compressed) model is M1 (resp. M2) thenwe keep exactly the bM1/M2c last updates in mem-ory, and the index of the last participation for eachdevice. At each round, if the number of missed updateis higher than bM1/M2c of kept updates, we send thefull model, otherwise we send the sequence of missedupdates. Thus, this need for synchronization does notlead to additional computational resources. This mecha-nism is described in Algorithm 1 and does not interferewith the theoretical analysis.

5 Experiments

In this section, we illustrate our theoretical guaranteeson both synthetic and real datasets. Because the maingoal of this section is to confirm the theoretical findingsin Theorems 1 to 4, and to underline the impact of thememory, we focus on four of the algorithms coveredby our framework: Artemis with bidirectional com-pression (simply denoted Artemis), then QSGD, Diana,Bi-QSGD, and also usual SGD without any compression.We also compare Artemis with Double-Squeeze, Dore,FedSGD and FedPAQ [see Reisizadeh et al., 2020] in theAppendix (see Figure S15). We also perform experi-ments with optimized learning rates (see Figure S18).

In all experiments, we display the excess error F (wt)−F (w∗) w.r.t. the number of iterations t or the numberof communicated bits. We use a 1-quantization scheme(defined in Appendix A.1 for all experiments; withs = 1, which is the most drastic compression). Curvesare averaged over 5 runs, and we plot error bars on allfigures. To compute error bars we use the standarddeviation of the logarithmic difference between the lossfunction at iteration k and the objective loss, that iswe take standard deviation of log10(F (wk) − F (w∗)).We then plot the curve ± this standard deviation.

We first consider two simple synthetic datasets: onefor least-squares regression (with the same distributionover each machine), and one for logistic regression (withvarying distributions across machines); more detailsare given in Appendix C on the way data is generated.We use N = 20 devices, each holding 200 points ofdimension d = 20, and run algorithms over 100 epochs.

We then considered two real-world dataset to illustratetheorems on real data and higher dimension: supercon-duct [see Hamidieh, 2018, with 21 263 points and 81features] and quantum [see Caruana et al., 2004, with50 000 points and 65 features] with N = 20 workers.To simulate non-i.i.d. and unbalanced workers, we splitthe dataset in heterogeneous groups, using a Gaussian

mixture clustering on TSNE representations (definedby Maaten & Hinton [2008]). Thus, the distributionand number of points hold by each worker largely differsbetween devices, see Figures 2 and S5.

Figure 2: TSNE representa-tion and clusters for quantum.

Convergence. Fig-ure 3a presentsthe convergenceof each algorithmw.r.t. the number ofiterations k. Duringfirst iterations allalgorithms make fastprogress. Howeverbecause σ2

∗ 6= 0, allalgorithms saturate;and saturation level is higher for double compression(Artemis, Bi-QSGD), than for simple compression(Diana, QSGD) or than for SGD. This corroboratesfindings in Theorem 1 and Theorem 3.

Complexity. On Figures 4 to 6, the loss is plottedw.r.t. the theoretical number of bits exchanged after kiterations for the quantum and superconduct dataset.This confirms that double compression should be themethod of choice to achieve a reasonable precision(w.r.t. σ∗): for high precision, a simple method likeSGD results in a lower complexity.

Linear convergence under null variance at theoptimum. To highlight the significance of our newcondition on the noise, we compare σ2

∗ 6= 0 and σ2∗ = 0

on Figure 3. Saturation is observed in Figure 3a, but ifwe consider a situation in which σ2

∗ = 0, and where theuniform bound on the gradient’s variance is not null(as opposed to experiments in Liu et al. [2020] whoconsider batch gradient descent), a linear convergencerate is observed. This illustrates that our new conditionis sufficient to reach a linear convergence. ComparingFigure 3a with Figure S8a sheds light on the fact thatthe saturation level (before which double compressionis indeed beneficial) is truly proportional to the noisevariance at optimal point i.e. σ2

∗. And when σ2∗ = 0,

bidirectional compression is much more effective thanthe other methods (see Figure S8 in Appendix C.1.1).

Heterogeneity and real datasets. While in Fig-ure 3a, data is i.i.d. on machines, and Artemis is thusnot expected to outperform Bi-QSGD (the differencebetween the two being the memory), in Figures 3band 4 to 6 we use non-i.i.d. data distributed in anunbalanced manner. None of the previous papers oncompression directly illustrated the impact of hetero-geneity on simple examples, neither compare it withi.i.d. situations.

Partial participation. On Figures 5 and 6 we run ex-periments with only half of the devices active (randomly

Page 10: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

(a) LSR (i.i.d.): σ2∗ 6= 0 (b) LR (non-i.i.d.): σ∗ = 0.

Figure 3: Left: illustration of the saturation whenσ∗ 6= 0 and data is i.i.d., right: illustration of thememory benefits when σ∗ = 0 but data is non-i.i.d.

(a) Superconduct (b) Quantum

Figure 4: Real dataset (non-i.i.d.): σ∗ 6= 0, N = 20workers, p = 1, b > 1 (1000 iter.). X-axis in # bits.

105 107Communicated bits (non-iid)

(N=20, d=82)

−1.50

−1.25

−1.00

−0.75

−0.50

−0.25

log 1

0(F(wk )−F(w

* )) SGD

QSGDDianaBiQSGDArtemis

(a) Superconduct

105 107Communicated bits (non-iid)

(N=20, d=66)

−4

−3

−2

−1

log 1

0(F(wk )−F(w

* ))

SGDQSGDDianaBiQSGDArtemis

(b) Quantum

Figure 5: Partial participation - PP1 (non-i.i.d.):σ∗ = 0 (same experiments in stochastic regime: Fig-ure S13), N = 20 workers, p = 0.5, b > 1 (1000 iter.).X-axis in # bits.

105 107Communicated bits (non-iid)

(N=20, d=82)

−1.5

−1.0

−0.5

log 1

0(F(wk )−F(w

* ))

SGDSGDMemQSGDArt. ωdwn

C =0BiQSGDArtemis

(a) Superconduct

105 107Communicated bits (non-iid)

(N=20, d=66)

−5

−4

−3

−2

−1

log 1

0(F(wk )(F(w

* ))

SGDSGDMemQSGDArt. ωdwn

C =0BiQSGDArtemis

(b) Quantum

Figure 6: Partial participation - PP2 (non-i.i.d.):σ∗ = 0 (same experiments in stochastic regime: Fig-ure S14), N = 20 workers, p = 0.5, b > 1 (1000 iter.).X-axis in # bits.

sampled) at each iteration with the two approaches(PP1 and PP2) described in Section 4 in a full gradi-ent regime (same experiments in stochastic regime aregiven in Appendix C.2.1). This two figures emphasizethe failure of the first approach PP1 compared to ournew algorithm. We observe that in this last case only,Artemis has a linear convergence. It also stress the keyrole of the memory in this setting. On Figure 6, SGDwith memory (and even Artemis that use bidirectionalcompression) is better than SGD without memory.

6 Conclusion

We propose Artemis, a framework using bidirectionalcompression to reduce the number of bits needed toperform distributed or federated learning. On top ofcompression, Artemis includes a memory mechanismwhich improves convergence over non-i.i.d. data. Aspartial participation is a classical settings, our frame-work also handles this use-case. We provide three tighttheorems giving guarantees of a fast convergence (linearup to a threshold), highlighting the impact of mem-ory, analyzing Polyak-Ruppert averaging and obtaininglowers bounds by studying convergence in distributionof our algorithm. Altogether, this improves the un-derstanding of compression combined with a memorymechanism and sheds light on challenges ahead.

Acknowledgments

We would like to thank Richard Vidal, Laeticia Kamenifrom Accenture Labs (Sophia Antipolis, France) andEric Moulines from École Polytechnique for interest-ing discussions. This research was supported by theSCAI: Statistics and Computation for AI ANR Chairof research and teaching in artificial intelligence andby Accenture Labs (Sophia Antipolis, France).

Bibliography

Speedtest Global Index – Monthly comparisons of in-ternet speeds from around the world.

Agarwal, N., Suresh, A. T., Yu, F. X. X., Kumar, S.,and McMahan, B. cpSGD: Communication-efficientand differentially-private distributed SGD. In Bengio,S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances inNeural Information Processing Systems 31, pp. 7564–7575. Curran Associates, Inc., 2018.

Aji, A. F. and Heafield, K. Sparse Communicationfor Distributed Gradient Descent. Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pp. 440–445, 2017. doi: 10.18653/v1/D17-1045. arXiv: 1704.05021.

Page 11: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Alistarh, D., Grubic, D., Li, J., Tomioka, R., andVojnovic, M. QSGD: Communication-Efficient SGDvia Gradient Quantization and Encoding. Advancesin Neural Information Processing Systems, 30:1709–1720, 2017.

Alistarh, D., Hoefler, T., Johansson, M., Konstantinov,N., Khirirat, S., and Renggli, C. The Convergenceof Sparsified Gradient Methods. Advances in NeuralInformation Processing Systems, 31:5973–5983, 2018.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconcilingmodern machine-learning practice and the classicalbias–variance trade-off. Proceedings of the NationalAcademy of Sciences, 116(32):15849–15854, 2019.

Bottou, L. On-line learning and stochastic approxima-tions. 1999. doi: 10.1017/CBO9780511569920.003.

Caruana, R., Joachims, T., and Backstrom, L. KDD-Cup 2004: results and analysis. ACM SIGKDDExplorations Newsletter, 6(2):95–108, December 2004.ISSN 1931-0145. doi: 10.1145/1046456.1046470.

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M.,Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang,K., Le, Q., and Ng, A. Large Scale DistributedDeep Networks. Advances in Neural InformationProcessing Systems, 25, 2012.

Dieuleveut, A., Durmus, A., and Bach, F. Bridging thegap between constant step size stochastic gradientdescent and markov chains. Annals of Statistics,2019.

Elias, P. Universal codeword sets and representationsof the integers, September 1975.

Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A.,Shulgin, E., and Richtárik, P. SGD: General Analysisand Improved Rates. In International Conferenceon Machine Learning, pp. 5200–5209. PMLR, May2019. ISSN: 2640-3498.

Hamidieh, K. A data-driven statistical model forpredicting the critical temperature of a supercon-ductor. Computational Materials Science, 154:346–354, November 2018. ISSN 0927-0256. doi:10.1016/j.commatsci.2018.07.052.

Horváth, S. and Richtárik, P. A Better Alternativeto Error Feedback for Communication-Efficient Dis-tributed Learning. arXiv:2006.11077 [cs, stat], June2020. arXiv: 2006.11077.

Horváth, S., Kovalev, D., Mishchenko, K., Stich, S.,and Richtárik, P. Stochastic Distributed Learningwith Gradient Quantization and Variance Reduc-tion. arXiv:1904.05115 [math], April 2019. arXiv:1904.05115.

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D.,Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., andChen, z. GPipe: Efficient Training of Giant Neural

Networks using Pipeline Parallelism. In Wallach, H.,Larochelle, H., Beygelzimer, A., Alché-Buc, F. d.,Fox, E., and Garnett, R. (eds.), Advances in NeuralInformation Processing Systems, volume 32. CurranAssociates, Inc., 2019.

Kairouz, P., McMahan, H. B., Avent, B., Bellet, A.,Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles,Z., Cormode, G., Cummings, R., D’Oliveira, R. G. L.,Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z.,Gascón, A., Ghazi, B., Gibbons, P. B., Gruteser, M.,Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson,B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Kho-dak, M., Konečný, J., Korolova, A., Koushanfar, F.,Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri,M., Nock, R., Özgür, A., Pagh, R., Raykova, M.,Qi, H., Ramage, D., Raskar, R., Song, D., Song,W., Stich, S. U., Sun, Z., Suresh, A. T., Tramèr,F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z.,Yang, Q., Yu, F. X., Yu, H., and Zhao, S. Ad-vances and Open Problems in Federated Learning.arXiv:1912.04977 [cs, stat], December 2019. arXiv:1912.04977.

Karimireddy, S. P., Rebjock, Q., Stich, S., and Jaggi, M.Error Feedback Fixes SignSGD and other GradientCompression Schemes. In International Conferenceon Machine Learning, pp. 3252–3261. PMLR, May2019. ISSN: 2640-3498.

Khirirat, S., Magnússon, S., Aytekin, A., and Johans-son, M. Communication Efficient Sparsification forLarge Scale Machine Learning. arXiv:2003.06377[math, stat], March 2020. arXiv: 2003.06377.

Konečný, J., McMahan, H. B., Ramage, D., andRichtárik, P. Federated Optimization: Dis-tributed Machine Learning for On-Device Intelli-gence. arXiv:1610.02527 [cs], October 2016. arXiv:1610.02527.

Lannelongue, L., Grealey, J., and Inouye, M. GreenAlgorithms: Quantifying the carbon emissions ofcomputation. arXiv:2007.07610 [cs], October 2020.arXiv: 2007.07610.

Li, M., Andersen, D. G., Park, J. W., Smola, A. J.,Ahmed, A., Josifovski, V., Long, J., Shekita, E. J.,and Su, B.-Y. Scaling distributed machine learningwith the parameter server. In Proceedings of the 11thUSENIX conference on Operating Systems Designand Implementation, OSDI’14, pp. 583–598, USA,October 2014. USENIX Association. ISBN 978-1-931971-16-4.

Li, Z., Kovalev, D., Qian, X., and Richtarik, P. Accelera-tion for Compressed Gradient Descent in Distributedand Federated Optimization. In International Con-ference on Machine Learning, pp. 5895–5904. PMLR,November 2020. ISSN: 2640-3498.

Page 12: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Liu, X., Li, Y., Tang, J., and Yan, M. A Double Resid-ual Compression Algorithm for Efficient DistributedLearning. In International Conference on ArtificialIntelligence and Statistics, pp. 133–143, June 2020.ISSN: 1938-7228 Section: Machine Learning.

Maaten, L. v. d. and Hinton, G. Visualizing Data usingt-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008. ISSN ISSN 1533-7928.

Mayekar, P. and Tyagi, H. RATQ: A Universal Fixed-Length Quantizer for Stochastic Optimization. InInternational Conference on Artificial Intelligenceand Statistics, pp. 1399–1409. PMLR, June 2020.ISSN: 2640-3498.

McMahan, B., Moore, E., Ramage, D., Hampson, S.,and Arcas, B. A. y. Communication-Efficient Learn-ing of Deep Networks from Decentralized Data. InArtificial Intelligence and Statistics, pp. 1273–1282.PMLR, April 2017. ISSN: 2640-3498.

Mei, S. and Montanari, A. The generalization er-ror of random features regression: Precise asymp-totics and double descent curve. arXiv preprintarXiv:1908.05355, 2019.

Meyn, S. and Tweedie, R. Markov Chains and Stochas-tic Stability. Cambridge University Press, New York,NY, USA, 2nd edition, 2009. ISBN 0521731828,9780521731829.

Mishchenko, K., Gorbunov, E., Takáč, M., andRichtárik, P. Distributed Learning with CompressedGradient Differences. arXiv:1901.09269 [cs, math,stat], June 2019. arXiv: 1901.09269.

Nesterov, Y. Introductory Lectures on Convex Opti-mization: A Basic Course. Applied Optimization.Springer US, 2004. ISBN 978-1-4020-7553-7. doi:10.1007/978-1-4419-8853-9.

Rakhlin, A., Shamir, O., and Sridharan, K. Making gra-dient descent optimal for strongly convex stochasticoptimization. ICML, 2012.

Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie,A., and Pedarsani, R. FedPAQ: A Communication-Efficient Federated Learning Method with PeriodicAveraging and Quantization. In International Con-ference on Artificial Intelligence and Statistics, pp.2021–2031. PMLR, June 2020. ISSN: 2640-3498.

Robbins, H. and Monro, S. A Stochastic ApproximationMethod. Annals of Mathematical Statistics, 22(3):400–407, September 1951. ISSN 0003-4851, 2168-8990. doi: 10.1214/aoms/1177729586. Number: 3Publisher: Institute of Mathematical Statistics.

Sattler, F., Wiedemann, S., Müller, K.-R., and Samek,W. Robust and Communication-Efficient FederatedLearning From Non-i.i.d. Data. IEEE Transactionson Neural Networks and Learning Systems, pp. 1–14,

2019. ISSN 2162-2388. doi: 10.1109/TNNLS.2019.2944481. Conference Name: IEEE Transactions onNeural Networks and Learning Systems.

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bitstochastic gradient descent and its application todata-parallel distributed training of speech DNNs.pp. 1058–1062, January 2014.

Stich, S. U. Local SGD Converges Fast and Communi-cates Little. ICLR, May 2019. arXiv: 1805.09767.

Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Spar-sified SGD with Memory. In Bengio, S., Wallach,H., Larochelle, H., Grauman, K., Cesa-Bianchi, N.,and Garnett, R. (eds.), Advances in Neural Informa-tion Processing Systems 31, pp. 4447–4458. CurranAssociates, Inc., 2018.

Strom, N. Scalable distributed DNN training using com-modity GPU cloud computing. In Sixteenth AnnualConference of the International Speech Communica-tion Association, 2015.

Tang, H., Yu, C., Lian, X., Zhang, T., and Liu, J.DoubleSqueeze: Parallel Stochastic Gradient Descentwith Double-pass Error-Compensated Compression.In International Conference on Machine Learning,pp. 6155–6165. PMLR, May 2019. ISSN: 2640-3498.

Villani, C. Optimal transport : old and new.Grundlehren der mathematischen Wissenschaften.Springer, Berlin, 2009. ISBN 978-3-540-71049-3.URL http://opac.inria.fr/record=b1129524.

Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y.,and Li, H. TernGrad: Ternary Gradients to ReduceCommunication in Distributed Deep Learning. InGuyon, I., Luxburg, U. V., Bengio, S., Wallach, H.,Fergus, R., Vishwanathan, S., and Garnett, R. (eds.),Advances in Neural Information Processing Systems30, pp. 1509–1519. Curran Associates, Inc., 2017.

Wu, J., Huang, W., Huang, J., and Zhang, T. Er-ror Compensated Quantized SGD and its Applica-tions to Large-scale Distributed Optimization. InInternational Conference on Machine Learning, pp.5325–5333. PMLR, July 2018. ISSN: 2640-3498.

Yu, Y., Wu, J., and Huang, L. Double Quantizationfor Communication-Efficient Distributed Optimiza-tion. In Wallach, H., Larochelle, H., Beygelzimer,A., Alché-Buc, F. d., Fox, E., and Garnett, R. (eds.),Advances in Neural Information Processing Systems32, pp. 4438–4449. Curran Associates, Inc., 2019.

Zhang, C., Bengio, S., Hardt, M., Recht, B., andVinyals, O. Understanding deep learning requiresrethinking generalization. In 5th International Con-ference on Learning Representations, ICLR 2017,Toulon, France, April 24-26, 2017, Conference TrackProceedings. OpenReview.net, 2017.

Page 13: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Zheng, S., Huang, Z., and Kwok, J. Communication-Efficient Distributed Blockwise Momentum SGDwith Error-Feedback. In Wallach, H., Larochelle,H., Beygelzimer, A., Alché-Buc, F. d., Fox, E., andGarnett, R. (eds.), Advances in Neural InformationProcessing Systems 32, pp. 11450–11460. Curran As-sociates, Inc., 2019.

Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., and Zou,Y. DoReFa-Net: Training Low Bitwidth Convolu-tional Neural Networks with Low Bitwidth Gradi-ents. arXiv:1606.06160 [cs], February 2018. arXiv:1606.06160.

Zhu, D. L. and Marcotte, P. Co-Coercivity and ItsRole In the Convergence of Iterative Schemes ForSolving Variational Inequalities, March 1996.

Page 14: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed orfederated learning with partial participation: tight convergence

guarantees.

Supplementary material

In this appendix, we provide additional details to our work. In Appendix A, we give more details on Artemis,we describe the s-quantization scheme used in our experiments and we define the filtrations used in followingdemonstrations. Secondly, in Appendix B, we analyze at a finer level the bandwidth speeds across the worldto get a better intuition of the state of the worldwide internet usage. Thirdly, in Appendix C, we present thedetailed framework of our experiments and give further illustrations to our theorems. In Appendix D, we gathera few technical results and introduce the lemmas required in the proofs of the main results. Those proofs arefinally given in Appendix E. More precisely, Theorem 1 follows from Theorems S5 and S6, which are proved inAppendices E.1 and E.2, while Theorems 2 and 3 are respectively proved in Appendices E.3 and E.4.

ContentsA Additional details about the Artemis framework 14

A.1 Quantization scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14A.2 Filtrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

B Bandwidth speed 19

C Experiments 20C.1 Synthetic dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20C.2 Real dataset: superconduct and quantum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24C.3 CPU usage and Carbon footprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D Technical Results 28D.1 Useful identities and inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28D.2 Lemmas for proof of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

E Proofs of Theorems 39E.1 Proof of main Theorem for Artemis - variant without memory . . . . . . . . . . . . . . . . . . . 39E.2 Proof of main Theorem for Artemis - variant with memory . . . . . . . . . . . . . . . . . . . . . 42E.3 Proof of Theorem - Polyak-Ruppert averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46E.4 Proof of Theorem 3 - convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A Additional details about the Artemis framework

The aim of this section is twofold. First, we provide supplementary details about the quantization scheme used inour work. We also explain (based on Alistarh et al. [2017]) how quantization combined with Elias code [seeElias, 1975] reduces the required number of bits to send information. Secondly, we define the filtrations used inour proofs and give their resulting properties.

A.1 Quantization scheme

In the following, we define the s-quantization operator Cs which we use in our experiments. After giving itsdefinition, we explain [based on Alistarh et al., 2017] how it helps to reduce the number of bits to broadcast.

Definition 1 (s-quantization operator). Given ∆ ∈ Rd, the s-quantization operator Cs is defined by:

Cs(∆) := sign(∆)× ‖∆‖2 ×ψ

s.

ψ ∈ Rd is a random vector with j-th element defined as:

Page 15: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

ψj :=

l + 1 with probability s |∆j |

‖∆‖2 − ll otherwise .

where the level l is such that ∆i‖∆‖2

∈[ls ,l + 1s

].

The s-quantization scheme verifies Assumption 5 with ωC = min(d/s2,√d/s). Proof can be found in [Alistarh

et al., 2017, see Appendix A.1].

Now, for any vector v ∈ Rd, we are in possession of the tuple (‖v‖2, φ, ψ), where φ is the vector of signs of (vi)di=1,

and ψ is the vector of integer values (ψj)j=1. To broadcast the quantized value, we use the Elias encoding Elias[1975]. Using this encoding scheme, it can be shown (Theorem 3.2 of Alistarh et al. [2017]) that:Proposition S1. For any vector v, the number of bits needed to communicate Cs(v) is upper bounded by:(

3 +

(3

2+ o(1)

)log

(2(s2 + d)

s(s+√d)

))s(s+

√d) + 32 .

The final goal of using memory for compression is to quantize vectors with s = 1. It means that we will employO(√d log d) bits per iteration instead of 32d, which reduces by a factor

√d

log d the number of bits used by iteration.Now, in a FL settings, at each iteration we have a double communication (device to the main server, main serverto the device) for each of the N devices. It means that at each iteration, we need to communicate 2×N × 32dbits if compression is not used. With a single compression process like in Mishchenko et al. [2019]; Li et al. [2020];Wu et al. [2018]; Agarwal et al. [2018]; Alistarh et al. [2017], we need to broadcast

O(

32Nd+N√d log d

)= O

(Nd

(1 +

log d√d

))= O (Nd) .

But with a bidirectional compression, we only need to broadcast O(

2N√d log d

).

Time complexity analysis of simple vs double compression for the 1-quantization schema. Usingquantization with s = 1, and then the Elias code [defined in Elias, 1975] to communicate between servers, leads toreduce from O(Nd) to O(N

√d log(d)) the number of bits to send, for each direction. Getting an estimation of the

total time complexity is difficult and inevitably dependant of the considered application. Indeed, as highlightedby Figure S2, download and upload speed are always different. The biggest measured difference between uploadand download is found in Europa for mobile broadband ; their ratio is around 3.5.

Denoting vd and vu the speed of download and upload (in bits per second), we typically have vd = ρvu, 3.5 > ρ > 1.

Then for unidirectional compression, each iteration takes O(Ndvd

+ N√d log(d)vu

)≈ O

(Ndρvu

)seconds, while for a

bidirectional one it takes only O(N√d log(d)vd

+ N√d log(d)vu

)≈ O

(N√d log(d)vu

)seconds.

In other words, unless ρ is really large (which is not the case in practice as stressed by Figure S2, doublecompression reduces by several orders of magnitude the global time complexity, and bidirectional compression isby far superior to unidirectional.

A.2 Filtrations

In this section we provide some explanations about filtrations - especially a rigorous definition - and how it isused in the proofs of Theorems 2, 3, S5 and S6. We recall that we denoted by ωup

C and ωdwnC the variance factors

for respectively uplink and downlink compression.

Let a probability space (Ω,A,P) with Ω a sample space, A an event space, and P a probability function. Werecall that the σ-algebra generated by a random variable X : Ω→ Rm is

σ(X) = X−1(A) : A ∈ B(Rm) ,

Page 16: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

wkξik+1−−−−−−−→ gik+1

εik+1−−−−−−−→ gik+1

Bik−−−−−−→ gk+1,Sk =∑i∈Sk

gik+1

εk+1−−−−−−−→ Ωk+1,Sk = C (gk+1,Sk)

Figure S1: The sequence of successive additive noises in the algorithm.

where B(Rm) is the Borel set of Rm.

Furthermore, we recall that a filtration of (Ω,F , P ) is defined as an increasing sequence (Fn)n∈N of σ-algebras:

F0 ⊂ F1 ⊂ F2 ⊂ · · · ⊂ F .

Concerning randomness in our algorithm, it comes from four sources:

1. Stochastic gradients. It corresponds to the noise associated with the stochastic gradients computation ondevice i at epoch k. We have:

∀k ∈ N , ∀i ∈ J0, ..., NK, gik+1 = ∇Fi(wk) + ξik+1(wk) , with V(ξik+1) bounded.

2. Uplink compression: this noise corresponds to the uplink compression when local gradients are compressed.Let k ∈ N and i ∈ J0, ..., NK, suppose, we want to compress ∆i

k ∈ Rd, then the associated noise is εik+1 withV(εik+1(∆i

k)) ≤ ωupC∥∥∆i

k

∥∥2, where ωupC ∈ R∗ is defined by the uplink compression schema (see Assumption 5).

And it follows that:

∀k ∈ N, ∀i ∈ J0, ..., NK, ∆ik = ∆i

k + εik(∆ik)⇐⇒ gik+1 = gik+1 + εik+1(∆i

k) .

3. Downlink compression. This noise corresponds to the downlink compression, when the global modelparameter is compressed. Let k ∈ N, suppose we want to compress gk+1,Sk ∈ Rd, then the associated noise isεk+1(gk+1,Sk) with V(εk+1) ≤ ωdwn

C ‖gk+1,Sk‖2. There is:

∀k ∈ N, Ωk+1,Sk = Cs(gk+1,Sk) = gk+1,Sk + εk+1(gk+1,Sk) .

4. Random sampling. This randomness corresponds to the partial participation of each device. We recall thataccording to Assumption 6, each device has a probability p of being active. For k in N, for i in J1, NK, wenote Bik ∼ B(p) the Bernoulli random variable that marks if a device is active or not at step k.

This “succession of noises” in the algorithm is illustrated in Figure S1. In order to handle these four sources ofrandomness, we define five sequences of nested σ-algebras.

Definition 2. We note (Fk)k∈N the filtration associated to the stochastic gradient computation noise, (Gk)k∈Nthe filtration associated to the uplink compression noise and (Hk)k∈N the filtration associated to the downlinkcompression noise, (Bk)k∈N the filtration associated to the random device participation randomness. For k ∈ N∗,we define:

Fk = σ(Γk−1, (ξ

ik)Ni=1

)Gk = σ

(Γk−1, (ξ

ik)Ni=1, (ε

ik)Ni=1

)Hk = σ

(Γk−1, (ξ

ik)Ni=1, (ε

ik)Ni=1, εk

)Bk−1 = σ

((Bik−1)Ni=1

)Ik = σ (Hk ∪ Bk−1) ,

withΓk = (ξit)i∈J1,NK, (ε

it)i∈J1,NK, εt, (B

ik−1)Ni=1t∈J1,kK and Γ0 = ∅ .

We can make the following observations for all k ≥ 1:

Page 17: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

• From these three definitions, it follows that our sequences are nested.

F1 ⊂ G1 ⊂ H1 ⊂ F2 ⊂ · · · ⊂ HK .

However, (Bk)k∈N is independent of the other filtrations.

• Ik = σ (Hk ∪ Bk−1) = σ(Γk), and the aim is to express the expectation w.r.t. all randomness i.e Ik.

• wk is Ik-measurable.

• gk+1(wk) is Fk+1-measurable.

• gk+1(wk) is Gk+1-measurable.

• Bik−1 is Bk−1-measurable.

• gk+1,Sk , gk+1,Sk and Ωk+1,Sk are respectively σ(Fk+1 ∪ Bk)-measurable, σ(Gk+1 ∪ Bk)-measurable andσ(Hk+1 ∪ Bk)-measurable. Note that Fk+1 contains Γk, and thus all (Bik−1)Ni=1, but does not containall the (Bik)Ni=1.

As a consequence, we have Propositions S2 to S8. Please, take notice that for sake of clarity Propositions S2to S6 are stated without taking into account the random participation Sk. All this proposition remains identicalwhen adding partial participation, the results only have to be expressed w.r.t. active nodes Sk.

Below Proposition S2 gives the expectation over stochastic gradients conditionally to σ-algebras Hk and Fk+1.

Proposition S2 (Stochastic Expectation). Let k ∈ N and i ∈ J1, NK. Then on each local device i ∈ J1, NK wehave almost surely (a.s.):

E[gik+1

∣∣ Fk+1

]= gik+1

E[gik+1

∣∣ Hk] = ∇Fi(wk) ,

which leads to: E [gk+1 | Fk+1] = gk+1

E [gk+1 | Hk] = ∇F (wk) .

Proposition S3 gives expectation of uplink compression (information sent from remote devices to central server)conditionally to σ-algebras Fk+1 and Gk+1.

Proposition S3 (Uplink Compression Expectation). Let k ∈ N and i ∈ J1, NK. Recall that gik = gik + εik, thenon each local device i ∈ J1, NK, we have a.s:

E[gik+1

∣∣ Gk+1

]= gik+1

E[gik+1

∣∣ Fk+1

]= gik+1 ,

which leads to E [gk+1 | Gk+1] = gk+1

E [gk+1 | Fk+1] = E[

1N

∑Ni=1 g

ik+1

∣∣∣ Fk+1

]= gk+1 .

From Assumption 5, if follows that variance over uplink compression can be bounded as expressed in Proposition S4.

Proposition S4 (Uplink Compression Variance). Let k ∈ N and i ∈ J1, NK. Recall that ∆ik = gik+1 + hik, using

Assumption 5 following hold a.s:

E[‖∆i

k+1 −∆ik+1‖2

∣∣∣ Fk+1

]≤ ωup

C ‖∆ik+1‖2 (S1)

(⇐⇒ E[‖gik+1 − gik+1‖2

∣∣ Fk+1

]≤ ωup

C ‖gik+1‖2 when no memory ) . (S2)

Concerning downlink compression (information sent from central server to each node), Proposition S5 gives itsexpectation w.r.t σ-algebras Gk+1 and Hk+1.

Page 18: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Proposition S5 (Downlink Compression Expectation). Let k ∈ N, recall that Ωk+1 = Cdwn(gk+1) = gk+1 + εk,then a.s:

E [Ωk+1 | Hk+1] = Ωk+1

E [Ωk+1 | Gk+1] = gk+1 .

Next proposition states that downlink compression can be bounded as for Proposition S4.Proposition S6 (Downlink Compression Variance). Let k ∈ N, using Assumption 5 following holds a.s:

E[‖Ωk+1 − gk+1‖2

∣∣ Gk+1

]≤ ωdwn

C ‖gk+1‖2 .

Now, we give in Propositions S7 and S8 the expectation and the variance w.r.t. devices random sampling noise.Proposition S7 (Expectation of device sampling). Let k ∈ N, let’s note ak+1 = 1

N

∑Ni=1 a

ik+1 and ak+1,Sk =

1pN

∑i∈Sk a

ik+1, where (aik+1)Ni=0 ∈ (Rd)N are N random variables independent of each other and Jk+1-measurable,

for a σ-field Jk+1 s.t. (Bik)Ni=1 are independent of Jk+1. We have a.s:

E [ak+1,Sk | Jk+1] = ak+1 .

The vector ak+1 (resp. the σ-field Jk+1) may represent various objects, for instance : gk+1, gk+1, Ωk+1 (resp.Fk+1, Gk+1, Hk+1).

Proof. For any k ∈ N∗, we have that:

E [ak+1,Sk | Jk+1] = E

[1

pN

∑i∈Sk

aik+1

∣∣∣∣∣ Jk+1

]= E

[1

pN

N∑i=0

aik+1Bik

∣∣∣∣∣ Jk+1

]

=1

pN

N∑i=0

E[aik+1B

ik

∣∣ Jk+1

]by linearity of the expectation,

=1

pN

N∑i=0

aik+1E[Bik

∣∣ Jk+1

]because (aik+1)Ni=1 are Jk+1-measurable,

=1

N

N∑i=0

aik+1 = ak+1 because (Bik)Ni=1 are independent of Jk+1,

which allows to conclude.

Proposition S8 (Variance of device sampling). Let k ∈ N∗, with the same notation as Proposition S7, we havea.s:

V [ak+1,Sk | Jk+1] =1− ppN2

N∑i=0

∥∥aik+1

∥∥2.

Proof. Let k ∈ N∗,

V [ak+1,Sk | Jk+1] = V

[1

pN

N∑i=0

aik+1Bik

∣∣∣∣∣ Jk+1

]

=1

p2N2

N∑i=0

V[aik+1B

ik

∣∣ Jk+1

]because (Bik)Ni=1 are independent,

=1

p2N2

N∑i=0

∥∥aik+1

∥∥2 V[Bik

∣∣ Jk+1

]because (aik+1)Ni=1 are Jk+1-measurable,

=1− ppN2

N∑i=0

∥∥aik+1

∥∥2 because (Bik)Ni=1 are independent of Jk+1.

Page 19: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Figure S2: Left axis: upload and download speed for mobile and fixed broadband. Left axis: speeds (in Mbps),right axis: ratio (green bars). The dataset is gathered from Speedtest.net , see noa.

B Bandwidth speed

In a network configuration where download would be much faster than upload, bidirectional compression wouldpresent no benefit over unidirectional, as downlink communications would have a negligible cost. However, thisis not the case in practice: to assess this point, we gathered broadband speeds, for both download and uploadcommunications, for fixed broadband (cable, T1, DSL ...) or mobile (cellphones, smartphones, tablets, laptops...) from studies carried out in 2020 over the 6 continents by Speedtest.net [see noa]. Results are provided inFigure S2, comparing download and upload speeds. The ratios (averaged by continents) between upload anddownload speeds stand between 1 (in Asia, for fixed broadband) and 3.5 (in Europe, for mobile broadband): thereis thus no apparent reason to simply disregard the downlink communication, and bi-directionnal compression isunavoidable to achieve substantial speedup. More precisely, if we denote vd and vu the speed of download andupload (in Mbits per second), we typically have vd = ρvu, with 1 < ρ < 3.5. Using quantization with s = 1 (seeAppendix A.1), for unidirectional compression, each iteration takes O

(Ndρvu

)seconds, while for a bidirectional

one it takes only O(N√d log(d)vu

)seconds.

The dataset is pickled from a study carried out by Speedtest.net [see noa]. This study has measured the bandwidthspeeds in 2020 accross the six continents. In order to get a better understanding of this dataset, we illustrate thespeeds distribution on Figures S2, S3a, S3b and S4.

In Figures S3a, S3b and S4, unlike Figure S2, we do not aggregate data by countries of a same continents. Thisallows to analyse the speeds ratio between upload and download with the proper value of each countries. Lookingat Figures S3a, S3b and S4, it is noticeable that in the world, the ratio between upload and download speed isbetween 1 and 5, and not between 1 and 3.5 as Figure S2 was suggesting since we were aggregating data bycontinents. There are only nine countries in the world having a ratio higher than 5. In Europe : Malta, Belgiumand Montenegro. In Asia : South Korea. In North America : Canada, Saint Vincent and the Grenadines, Panamaand Costa Rica. In Africa : Western Sahara. The highest ratio is 7.7 observed in Malta.

Page 20: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

(a) Mobile broadband. (b) Fixed broadband.

Figure S3: Upload/download speed (in Mbps). Best seen incolors.

Figure S4: Distribution of the download/upload speeds ratio by continents. Best seen in colors.

C Experiments

In this section we provide additional details about our experiments. We recall that we use two kind of datasets:1) toy-ish synthetic datasets and 2) real datasets: superconduct [see Hamidieh, 2018, 21263 points, 81 features]and quantum [see Caruana et al., 2004, 50,000 points, 65 features]. The aim of using synthetic datasets is mainlyto underline the properties resulting from Theorems 1 to 3.

We use the same 1-quantization scheme (defined in Appendix A.1, s = 1 is the most drastic compression) forboth uplink and downlink, and thus, we consider that ωup

C = ωdwnC . For each figure, we plot the convergence

w.r.t. the number of iteration k or w.r.t. the theoretical number of bits exchanged after k iterations. On theY-axis we display log10(F (wk)−F (w∗)), with k in N. All experiments have been run 5 times and averaged beforedisplaying the curves. We plot error bars on all figures. To compute error bars we use the standard deviation ofthe logarithmic difference between the loss function at iteration k and the objective loss, that is we take standarddeviation of log10(F (wk)− F (w∗)). We then plot the curve ± this standard deviation.

All the code is available on this repository: https://github.com/philipco/artemis-bidirectional-compression.

C.1 Synthetic dataset

We build two different synthetic dataset for i.i.d. or non-i.i.d. cases. We use linear regression to tackle thei.i.d case and logistic regression to handle the non-i.i.d. settings. As explained in Section 1, each worker i holdsni observations (zij)1≤j≤ni = (xij , y

ij)1≤j≤ni = (Xi, Y i) following a distribution Di.

We use N = 10 devices, each holding 200 points of dimension d = 20 for least-square regression and d = 2 forlogistic regression. We ran algorithms over 100 epochs.

Choice of the step size for synthetic dataset. For stochastic descent, we use a step size γ = 1L√kwith k

the number of iteration, and for the batch descent we choose γ = 1L .

For i.i.d. setting, we use a linear regression model without bias. For each worker i, data points are generatedfrom a normal distribution (xij)1≤j≤ni ∼ N (0,Σ). And then, for all j in J1, niK, we have: yij =

⟨w∣∣ xij⟩+ ei with

ei ∼ N (0, λ2) and w the true model.

Page 21: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

(a) Superconduct dataset: 20 cluster. Each cluster hasbetween 250 and 3900 points with a median at 750 points.

(b) Quantum dataset: 20 clusters. Each cluster has be-tween 900 and 10500 points with a median at 2300 points.

Figure S5: TSNE representations. Best seen in colors.

To obtain σ∗ = 0, it is enough to remove the noise ei by setting the variance λ2 of the dataset distributionto 0. Indeed, using a least-square regression, for all i in J1, NK, the cost function evaluated at point w isFi(w) = 1

2‖XiTw − Y i‖2. Thus the stochastic gradient j in J1, niK on device i in J1, NK is gij(w) = (Xi

jTw−Y ij )Xi

j .On the other hand, the true gradient is ∇Fi(w) = EXiXiT (w − w∗). Computing the difference, we have for alldevice i in J1, NK and all j in J1, niK:

gij(w)− Fi(w) = (XijX

ij

T − EXiXiT )(w − w∗)︸ ︷︷ ︸multiplicative noise equal to 0 in w∗

+ (Xij

Tw∗ − Y ij )︸ ︷︷ ︸

∼N (0,λ2)

Xij (S3)

This is why, if we set λ = 0 and evaluate eq. (S3) at w∗, we get back Assumption 3 with σ∗ = 0, and as aconsequence, the stochastic noise at the optimum is removed. Remark that it remains a stochastic gradientdescent, and the uniform bound on the gradients noise is not 0. We set λ2 = 0(⇔ σ2

∗ = 0) in Figure S8.Otherwise, we set λ2 = 0.4.

For non-i.i.d., we generate two different datasets based on a logistic model with two different parameters:w1 = (10, 10) and w2 = (10,−10). Thus the model is expected to converge to w∗ = (10, 0). We have twodifferent data distributions x1 ∼ N (0,Σ1) and x2 ∼ N (0,Σ2), and for all i in J1, NK, for all k in J1, niK , yik =

R(Sigm

(⟨w(i mod 2)+1

∣∣∣ xk(i mod 2)+1

⟩))∈ −1,+1. That is, half the machines use the first distribution

N (0,Σ1) for inputs and model w1 and the other half the second distribution for inputs and model w2. Here,R is the Rademacher distribution and Sigm is the sigmoid function defined as Sigm:x 7→ ex

1 + ex. These two

distributions are presented on Figure S6.

C.1.1 Least-square regression

In this section, we present all figures generated using Least-Square regression. Figure S7 corresponds to Figure 3a.

As explained in the main of the paper, in the case of σ∗ 6= 0 (Figure S7), algorithm using memory (i.e Diana andArtemis) are not expected to outperform those without (i.e QSQGD and Bi-QSGD). On the contrary, they saturateat a higher level. However, as soon as the noise at the optimum is 0 (Figure S8), all algorithms (regardless ofmemory), converge at a linear rate exactly as classical SGD.

C.1.2 Logistic regression

In this section, we present all figures generated using a logistic regression model. Figure S9 corresponds toFigure 3b. Data is non-i.d.d. and we use a full batch gradient descent to get σ∗ = 0 to shed into light the impactof memory over convergence.

Page 22: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

(a) Dataset 1 (b) Dataset 2

Figure S6: Data distribution for logistic regression to simulate non-i.i.d. data. Half of the device hold first dataset,and the other half the second one.

(a) LSR: σ2∗ 6= 0 (b) X-axis in # bits.

Figure S7: Synthetic dataset, Least-Square Regression with noise (σ∗ 6= 0). In a situation where data isi.i.d., the memory does not present much interest, and has no impact on the convergence. Because σ2

∗ 6= 0, allalgorithms saturate ; and saturation level is higher for double compression (Artemis, Bi-QSGD), than for simplecompression (Diana, QSGD) or than for SGD. This corroborates findings in Theorem 1 and Theorem 3.

(a) LSR: σ2∗ = 0 (b) X-axis in # bits.

Figure S8: Synthetic dataset, Least-Square Regression without noise (σ∗ = 0). Without surprise, withi.i.d data and σ∗ = 0, the convergence of each algorithm is linear. Thus, in i.i.d. settings, the impact of thememory is negligible, but this will not be the case in the non-i.i.d. settings as underlined by Figure S9.

Page 23: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

(a) LR: σ2∗ = 0 (b) X-axis in # bits.

Figure S9: Synthetic dataset, Logistic Regression on non-i.i.d. data using a batch gradient descent (toget σ∗ = 0). The benefit of memory is obvious, it makes the algorithm to converge linearly, while algorithmswithout are saturating at a high level. This stress on the importance of using the memory in non-i.i.d. settings.

(a) LR: σ2∗ = 0 (b) X-axis in # bits.

Figure S10: Polyak-Ruppert averaging, synthetic dataset. Logistic Regression on non-i.i.d. data using abatch gradient descent (to get σ∗ = 0) and a Polyak-Ruppert averaging. The convergence is linear as predictedby Theorem 2 because σ∗ = 0. Best seen in colors.

Page 24: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

(a) Batch size b = 1, γ = 18L (b) Batch size b = 50, γ = 1

L

Figure S11: Superconduct . LSR, σ∗ 6= 0, non-i.i.d., X-axis in # bits. While we are increasing the batch size b,the noise is reduced, making the benefit of memory more obvious, this is a good illustration of Theorem 1 (seeTable 2). Best seen in colors.

Figure S10 is using same data and configuration as Figure S9, except that it is combined with a Polyak-Ruppertaveraging. Note that in the absence of memory the variance increases compared to algorithms using memory. Togenerate these figures, we didn’t take the optimal step size. But if we took it, the trade-off between variance andbias would be worse and algorithms using memory would outperform those without.

C.2 Real dataset: superconduct and quantum

In this section, we present details about experiments done on real-life datasets: superconduct (from Caruana et al.[2004]) which use a least-square regression, and quantum (from Hamidieh [2018]) with a logistic regression.

In order to simulate non-i.i.d. data and to make the experiments closer to real-life usage, we split the datasetin heterogeneous groups using a Gaussian mixture clustering on TSNE representations (defined by Maaten &Hinton [2008]). Thus, the data is highly non-i.i.d. and unbalanced over devices. We plot on Figure S5 the TSNErepresentation of the two real datasets.

There is N = 20 devices (except for experiments using devices random sampling ; they use N = 100 devices) forsuperconduct and quantum dataset. For superconduct, there is between 250 and 3900 points by worker, with amedian at 750 ; and for quantum, there is between 900 and 10500 points, with a median at 2300. On each figure,we indicate which step size γ has been used.

Figures S11b and S12b corresponds to Figure 4 and are compared to the case when b = 1.

We observe that with a pure stochastic descent (b = 1), the memory mechanism does not really improve theconvergence. But as soon as we increase the batch size (without however reaching a full batch gradient i.eσ∗ 6= 0), the benefit of the memory becomes obvious ! In the case of the quantum dataset (see Figure S12),Artemis is not only better than Bi-QSGD, but also better than QSGD. That is to say, we achieve to make analgorithm using a bidirectional compression, better than an algorithm handling a unidirectionalcompression.

The comparison of the algorithm behaviors when b = 1 and b > 1, is an excellent illustration of Theorem 1 and ofthe impact of the noise term E (see Table 2). Indeed, while we increase the batch size, the noise σ2

∗/b decreaseand thus has less impact on the convergence. Furthermore as the memory mechanism is making the terms in B2

to be removed, the variance is much lower, which follows that the saturation level is also much lower.

C.2.1 Partial participation

In this section we provide additional experiments on partial participation in the stochastic regime. Only half ofthe devices (randomly sampled) participate at each round.

Figure S13 presents the first approach PP1 that fail to properly converge. Our new algorithm PP2 is given in

Page 25: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

(a) Batch size b = 1, γ = 18L

(b) Batch size b = 400, γ = 1L

Figure S12: Quantum . LR, non-i.i.d., σ∗ 6= 0, X-axis in # bits. For b = 1, algorithms with bidirectionalcompression do not succeed to converge, but when the batch size is increased to 400, the advantage of memorybecomes clear. This illustrate Theorem 1 (see Table 2). Best seen in colors.

105 107Communicated bits (non-iid)

(N=20, d=82)

−1.5

−1.0

−0.5

log 1

0(F(wk )−F(w

* )) SGD

QSGDDianaBiQSGDArtemis

(a) Superconduct

105 107Communicated bits (non-iid)

(N=20, d=66)

−3

−2

−1

log 1

0(F(wk )−F(w

* ))

SGDQSGDDianaBiQSGDArtemis

(b) Quantum

Figure S13: Partial participation, stochastic regime - PP1 (non-i.i.d.): σ∗ 6= 0, N = 20 workers, p = 0.5,b > 1 (1000 iter.). With this variant, all algorithms are saturating at a high level.

Figure S14, the convergence of QSGD and Bi-QSGD is unchanged as their is no difference between the two approachin the absence of memory. However Artemis and SGD with memory present much better convergence. As aexample, SGD with memory matches the results of SGD in the case of full participation (Figures S11b and S12b).

The result in the full gradient regime is given in Section 5. On Figures 5 and 6, we can observe that our newalgorithm PP2 has a linear convergence unlike PP1.

C.2.2 Comparing Artemis with other existing algorithms

On Figure S15 we compare Artemis with other existing algorithms. We take γ = 1/(2L) because otherwiseFedSGD and FedPAQ diverge. These two algorithms present worse performance because they have not beendesigned for non-i.i.d. datasets.

Page 26: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

105 107Communicated bits (non-iid)

(N=20, d=82)

−2.0

−1.5

−1.0

−0.5

log 1

0(F(wk )−F(w

* ))

SGDSGDMemQSGDArt. ωdwn

C =0BiQSGDArtemis

(a) Superconduct

105 107Communicated bits (non-iid)

(N=20, d=66)

−4

−3

−2

−1

log 1

0(F(wk )(F(w

* ))

SGDSGDMemQSGDArt. ωdwn

C =0BiQSGDArtemis

(b) Quantum

Figure S14: Partial participation, stochastic regime - PP2 (non-i.i.d.): σ∗ 6= 0, N = 20 workers, p = 0.5,b > 1 (1000 iter.). With this novel algorithm, we observe the significant impact of memory in the partialparticipation settings with non-i.i.d. data. Compared to Figure S13, the saturation of algorithms with α differentto zero is much lower and classical SGD is outperformed by its variant using the memory mechanism.

105 107Communicated bits (non-iid)

(N=20, d=82)

−1.50

−1.25

−1.00

−0.75

−0.50

−0.25

log 1

0(F(wk ))F(w

* )) SGD

FedAvgFedPAQDianaArtemisDoreDblSq(

(a) Superconduct

105 107Communicated bits (non-iid)

(N=20, d=66)

−2.5

−2.0

−1.5

−1.0log 1

0(F(wk ))F(w

* ))

SGDFedAvgFedPAQDianaArtemisDoreDblSq(

(b) Quantum

Figure S15: Artemis compared to other existing algorithms, γ = 1/(2L). X-axis in # bits. We can observethat Double-Squeeze which only uses error-feedback is outperformed by Artemis. Dore that combined thismechanism with memory is identical to Artemis. It underlines that for unbiased operators of compression, theenhancement comes from the memory and not from the errof-feedback.

C.2.3 Optimized step size

In this section, we want to address the issue of the optimal step size. On Figure S16 we plot the minimal lossafter 250 iterations for each of the 5 algorithms, we can see that algorithms with memory clearly outperformthose without. Then, on Figure S17 we present the loss of Artemis after 250 iterations for various step size:N=20

2L , 5L ,

2L ,

1L ,

12L ,

14L ,

18L and 1

16L . This helps to understand which step size should be taken to obtain the bestaccuracy after k in J1, 250K iterations. Finally, on Figure S18, we plot the loss obtain with the optimal step sizeγopt of each algorithms (found with Figure S16) w.r.t the number of communicated bits.

This last figures allow to conclude on the significant impact of memory in a non-i.i.d. settings, and to claim thatbidirectional compression with memory is by far superior (up to a threshold) to the four other algorithm: SGD,QSGD, Diana and BiQSGD.

C.3 CPU usage and Carbon footprint

As part as a community effort to report the amount of experiments that were performed, we estimated that overallour experiments ran for 220 to 270 hours end to end. We used an Intel(R) Xeon(R) CPU E5-2667 processor with16 cores.

Page 27: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

(a) Superconduct (b) Quantum

Figure S16: Searching for the optimal step size γopt for each algorithm. It is interesting to note that thememory allows to attain the lowest loss with a bigger step size. So, the optimal step size is γopt = 1

L for Artemis ,but is γopt = 1

2L for BiQSGD. X-axis - value on step size, Y-axis - minimal loss after running 250 iterations.

(a) Superconduct (b) Quantum

Figure S17: Loss w.r.t. step size γ. Plotting the loss of Artemis after 250 iterations for different step size.As stressed by Figure S16, after 250 iteration, the best accuracy is indeed obtained for the both dataset withγopt = 1

L for Artemis, and the optimal step size decreases with the number of iterations (e.g., for quantum, it is2/L before 100 iterations and 1/2L after 250 iterations).

(a) Superconduct (b) Quantum

Figure S18: Optimal step size. X-axis in # bits. Plotting the loss of each algorithm obtained with the optimalgamma that attain the lowest error after 250 iteration (for instance γ = 1

L for Artemis, but γ = 2L for SGD). For

both superconduct and quantum dataset, taking the optimal step size of each algorithm, Artemis is superior toother variants w.r.t. accuracy and number of bits.

Page 28: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

The carbon emissions caused by this work were subsequently evaluated with Green Algorithm built by Lanne-longue et al. [2020]. It estimates our computations to generate 30 to 35 kg of CO2, requiring 100 to 125 kWh. Tocompare, it corresponds to about 160 to 200km by car. This is a relatively moderate impact, matching the goalto keep the experiments for an illustrative purpose.

D Technical Results

In this section, we introduce a few technical lemmas that will be used in the proofs. In Appendix D.1, we give foursimple lemmas, while in Appendix D.2 we present a lemma which will be invoked in Appendix E to demonstrateTheorems S5 to S7.

Notation. Let k∗ ∈ N and (aik+1)Ni=0 ∈ (Rd)N random variables independent of each other and Jk+1-measurable,for a σ-field Jk+1 s.t. (Bik)Ni=1 are independent of Jk+1.. Then in all the following demonstration, we noteak+1 = 1

N

∑Ni=1 a

ik+1 and ak+1,Sk = 1

pN

∑i∈Sk a

ik+1.

The vector ak+1 (resp. the σ-field Jk+1) may represent various objects, for instance : gk+1, gk+1, Gk+1 (resp.Fk+1, Gk+1, Hk+1).

Remark 4. We can add the following remarks on the assumptions.

• Assumption 6 can be extended to probabilities depending on the worker (pi)i∈J1,NK.

• Assumption 5 requires in fact to access a sequence of i.i.d. compression operators Cup/dwn,k for k ∈ N – butfor simplicity, we generally omit the k index.

• Assumption 4 in fact only requires that for any i ∈ J1, NK, E[‖gik+1,∗ −∇Fi(w∗)‖2

∣∣∣ Hk] ≤ (σ∗)2i

b , and the

results then hold for σ∗ = 1N

∑Ni=1(σ∗)

2i . In other words, the bounds does not need to be uniform over workers,

only the average truly matters.

D.1 Useful identities and inequalities

Lemma S1. Let N ∈ N and d ∈ N. For any sequence of vector (ai)Ni=1 ∈ Rd, we have the following inequalities:∥∥∥∥∥

N∑i=1

ai

∥∥∥∥∥2

(N∑i=1

‖ai‖

)2

≤ NN∑i=1

‖ai‖2 .

The first part of the inequality corresponds to the triangular inequality, while the second part is Cauchy’sinequality.

Lemma S2. Let α ∈ [0, 1] and x, y ∈ (Rd)2, then:

‖αx+ (1− α)y‖2 = α ‖x‖2 + (1− α) ‖y‖2 − α(1− α) ‖x− y‖2 .

This is a norm’s decomposition of a convex combination.

Lemma S3. Let X be a random vector of Rd, then for any vector x ∈ Rd:

E ‖X − EX‖2 = E ‖X − x‖2 − ‖EX − x‖2 .

This equality is a generalization of the well know decomposition of the variance (with x = 0).

Lemma S4. If F : X ⊂ Rd → R is strongly convex, then the following inequality holds:

∀(x, y) ∈ Rd, 〈∇F (x)−∇F (y) | x− y〉 ≥ µ ‖x− y‖2 .

This inequality is a consequence of strong convexity and can be found in [Nesterov, 2004, equation 2.1.22].

Page 29: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

D.2 Lemmas for proof of convergence

Below are presented technical lemmas needed to prove the contraction of the Lyapunov function for Theorems S5and S6. In this section we assume that Assumptions 1 to 6 are verified. In Appendices D.2.1 and D.2.2 weseparate lemmas that required only for the case with memory or without.

The first lemma is very simple and straightforward from the definition of ∆ik. We remind that ∆i

k is the differencebetween the computed gradient and the memory hold on device i. It corresponds to the information which willbe compressed and sent from device i to the central server.

Lemma S5 (Bounding the compressed term). The squared norm of the compressed term sent by each node tothe central server can be bounded as following:

∀k ∈ N , ∀i ∈ J1, NK ,∥∥∆i

k

∥∥2 ≤ 2(∥∥gik+1 − hi∗

∥∥2+∥∥hik − hi∗∥∥2

).

Proof. Let k ∈ N and i ∈ J1, NK, we have by definition:∥∥∆ik

∥∥2=∥∥gik+1 − hik

∥∥2=∥∥(gik+1 − hi∗) + (hi∗ − hik)

∥∥2.

Applying Lemma S1 gives the expected result.

Below, we show up a recursion over the memory term hik involving the stochastic gradients. This recursion willbe used in Lemma S13. The existence of recursion has been first shed into light by Mishchenko et al. [2019].

Lemma S6 (Expectation of memory term). The memory term hik+1 can be expressed using a recursion involvingthe stochastic gradient gik+1:

∀k ∈ N , ∀i ∈ J1, NK , E[hik+1

∣∣ Fk+1

]= (1− α)hik + αgik+1 .

Proof. Let k ∈ N and i ∈ J1, NK. We just need to decompose hik using its definition:

hik+1 = hik + α∆ik = hik + α(gik+1 − hik) = (1− α)hik + αgik+1 ,

and considering that E[gik+1

∣∣ Fk+1

]= gik+1 (Proposition S3), the proof is completed.

In Lemma S7, we rewrite ‖gk+1‖2 and∥∥gk+1 − hi∗

∥∥2 to make appears:

1. the noise over stochasticity

2. ‖gk+1 − gk+1,∗‖2 which is the term on which will later be applied cocoercivity (see Assumption 2).

Lemma S7 is required to correctly apply cocoercivity in Lemma S14.

Lemma S7 (Before using co-coercivity). Let k ∈ J0,KK and i ∈ J1, NK. The noise in the stochastic gradients asdefined in Assumptions 3 and 4 can be controlled as following:

1

N

N∑i=1

E[∥∥gik+1

∥∥2∣∣∣ Hk] ≤ 2

N

N∑i=1

(E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Hk]+ (

σ2∗b

+B2)

), (S4)

1

N

N∑i=1

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Hk] ≤ 2

N

N∑i=1

(E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Hk]+

σ2∗b

). (S5)

Page 30: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Proof. Let k ∈ N. For eq. (S4):

∥∥gik+1

∥∥2=∥∥gik+1 − gik+1,∗ + gik+1,∗

∥∥2

≤ 2(∥∥gik+1 − gik+1,∗

∥∥2+∥∥gik+1,∗

∥∥2)using inequality of Lemma S1.

Taking expectation with regards to filtration Hk and using Assumptions 3 and 4 gives the first result.

For eq. (S5), we use Lemma S1 and we write:

∥∥gik+1 − hi∗∥∥2

=∥∥(gik+1 − gik+1,∗

)+(gik+1,∗ −∇Fi(w∗)

)∥∥2

≤ 2(∥∥gik+1 − gik+1,∗

∥∥2+∥∥gik+1,∗ −∇Fi(w∗)

∥∥2) .

Taking expectation, we have:

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Hk] ≤ 2

(E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Hk]+ E

[∥∥gik+1,∗ −∇Fi(w∗)∥∥2∣∣∣ Hk])

≤ 2

(E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Hk]+

σ2∗b

)using Assumption 3.

Demonstrating that the Lyapunov function is a contraction requires to bound ‖gk+1,Sk‖2 which needs to control

each term (∥∥∥gik+1,Sk

∥∥∥2

)Ni=1 of the sum. This leads to invoke smoothness of F (consequence of Assumption 2).

Lemma S8. Regardless if we use memory, we have the following bound on the squared norm of the gradient, forall k in N:

E[‖gk+1,Sk‖

2∣∣∣ Ik] ≤ 1

pN2

N∑i=1

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Hk]+ L 〈∇F (wk) | wk − w∗〉 .

Proof. Let k ∈ N,

‖gk+1,Sk‖2

=

∥∥∥∥∥ 1

pN

∑i∈Sk

gik+1

∥∥∥∥∥2

=

∥∥∥∥∥ 1

pN

∑i∈Sk

(gik+1 −∇Fi(wk)

)+

1

pN

∑i∈Sk

∇Fi(wk)

∥∥∥∥∥2

.

Now taking conditional expectation w.r.t σ(Ik ∪ Bk) (including Bk in the σ-field allows to not consider therandomness associated to the device sampling):

E[‖gk+1,Sk‖

2∣∣∣ σ(Ik ∪ Bk)

]= E

∥∥∥∥∥ 1

pN

∑i∈Sk

gik+1 −∇Fi(wk) +1

pN

∑i∈Sk

∇Fi(wk)

∥∥∥∥∥2∣∣∣∣∣∣ σ(Ik ∪ Bk)

.

Page 31: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Expanding this squared norm:

E[‖gk+1,Sk‖

2∣∣∣ σ(Ik ∪ Bk)

]= E

∥∥∥∥∥ 1

pN

∑i∈Sk

gik+1 −∇Fi(wk)

∥∥∥∥∥2∣∣∣∣∣∣ σ(Ik ∪ Bk)

+ 2E

[⟨1

pN

∑i∈Sk

gik+1 −∇Fi(wk)

∣∣∣∣∣ 1

pN

∑i∈Sk

∇Fj(wk)

⟩ ∣∣∣∣∣ σ(Ik ∪ Bk)

]

+ E

∥∥∥∥∥ 1

pN

∑i∈Sk

∇Fi(wk)

∥∥∥∥∥2∣∣∣∣∣∣ σ(Ik ∪ Bk)

.Moreover, ∀i, j ∈ J1, NK2,E

[⟨gik+1 −∇Fi(wk)

∣∣ ∇Fj(wk)⟩ ∣∣ σ(Ik ∪ Bk)

]= 0 and ∇F (wk) is σ(Ik ∪ Bk)-

measurable:

E[‖gk+1,Sk‖

2∣∣∣ σ(Ik ∪ Bk)

]≤ E

∥∥∥∥∥ 1

pN

∑i∈Sk

gik+1 −∇Fi(wk)

∥∥∥∥∥2∣∣∣∣∣∣ σ(Ik ∪ Bk)

+ ‖∇Fsk(wk)‖2 . (S6)

To compute ‖∇FSk(wk)‖2, we apply cocoercivity (see Assumption 2) and next we take expectation w.r.t σ-algebraIk:

E[‖∇FSk(wk)‖2

∣∣∣ Ik] = L 〈E [∇FSk(wk) | Ik] | wk − w∗〉 = L 〈∇F (wk) | wk − w∗〉

Now, for sake of clarity we note Π =∥∥∥ 1pN

∑i∈Sk g

ik+1 −∇Fi(wk)

∥∥∥2

, then:

E [Π | σ(Ik ∪ Bk)] =1

p2N2

∑i∈Sk

E[∥∥gik+1 −∇Fi(wk)

∥∥2∣∣∣ σ(Ik ∪ Bk)

]+

1

p2N2

∑i,j∈Sk/i 6=j

E[⟨gik+1 −∇Fi(wk)

∣∣∣ gjk+1 −∇Fj(wk)⟩ ∣∣∣ σ(Ik ∪ Bk)

]︸ ︷︷ ︸

=0 by independence of (gik+1)Ni=0

=1

p2N2

∑i∈Sk

E[∥∥(gik+1 −∇Fi(w∗)) + (∇Fi(w∗)−∇Fi(wk))

∥∥2∣∣∣ σ(Ik ∪ Bk)

].

Developing the squared norm a second time:

E [Π | σ(Ik ∪ Bk)] =1

p2N2

∑i∈Sk

E[∥∥gik+1 −∇Fi(w∗)

∥∥2∣∣∣ σ(Ik ∪ Bk)

]+

2

p2N2

∑i∈Sk

E[⟨gik+1 −∇Fi(w∗)

∣∣ ∇Fi(w∗)−∇Fi(wk)⟩ ∣∣ σ(Ik ∪ Bk)

]+

1

p2N2

∑i∈Sk

‖∇Fi(wk)−∇Fi(w∗), ‖2

Then,

E [Π | σ(Ik ∪ Bk)] =1

p2N2

∑i∈Sk

E[∥∥gik+1 −∇Fi(w∗)

∥∥2∣∣∣ σ(Ik ∪ Bk)

]− 2

p2N2

∑i∈Sk

〈∇Fi(wk)−∇Fi(w∗) | ∇Fi(wk)−∇Fi(w∗)〉

+1

p2N2

∑i∈Sk

‖∇Fi(wk)−∇Fi(w∗)‖2

=1

p2N2

∑i∈Sk

E[∥∥gik+1 −∇Fi(w∗)

∥∥2∣∣∣ σ(Ik ∪ Bk)

]− ‖∇Fi(wk)−∇Fi(w∗)‖2

Page 32: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

applying cocoercivity (Assumption 2):

E [Π | σ(Ik ∪ Bk)] ≤ 1

p2N2

∑i∈Sk

E[∥∥gik+1 −∇Fi(w∗)

∥∥2∣∣∣ σ(Ik ∪ Bk)

]− L 〈∇Fi(wk)−∇Fi(w∗) | wk − w∗〉 .

(S7)

Now we consider the randomness associated to device sampling. Remember that because Ik ⊂ σ(Ik ∪ Bk), wehave E [Π | Ik] = E [E [Π | σ(Ik ∪ Bk)] | Ik]. Thus, we consider now Π w.r.t. the σ-field IkHandling first term of eq. (S7):

1

p2N2

∑i∈Sk

E[∥∥gik+1 −∇Fi(w∗)

∥∥2∣∣∣ Ik] =

1

p2N2

N∑i=1

BikE[∥∥gik+1 −∇Fi(w∗)

∥∥2∣∣∣ Ik]

=1

pN2

N∑i=1

E[∥∥gik+1 −∇Fi(w∗)

∥∥2∣∣∣ Ik] .

Handling second term of eq. (S7):

L

⟨1

p2N2

∑i∈Sk

∇Fi(wk)−∇Fi(w∗)

∣∣∣∣∣ wk − w∗⟩

= L

⟨1

p2N2

N∑i=0

(∇Fi(wk)−∇Fi(w∗))Bik

∣∣∣∣∣ wk − w∗⟩.

Taking expectation w.r.t σ-algebra Ik:

E

[L

⟨1

p2N2

∑i∈Sk

∇Fi(wk)−∇Fi(w∗)

∣∣∣∣∣ wk − w∗⟩ ∣∣∣∣∣ Ik

]

= L

⟨1

pN2

N∑i=0

∇Fi(wk)−∇Fi(w∗)

∣∣∣∣∣ wk − w∗⟩

=L

pN〈∇F (wk)−∇Fi(w∗) | wk − w∗〉 .

Injecting this in eq. (S7):

E [Π | Ik] ≤ 1

pN2

N∑i=1

E[∥∥gik+1 −∇Fi(w∗)

∥∥2∣∣∣ Ik]− L

pN〈∇F (wk) | wk − w∗〉 .

Recall that we note hi∗ = ∇Fi(w∗), returning to eq. (S6) and invoking again cocoercivity:

E[‖gk+1‖2

∣∣∣ Ik] ≤ 1

pN2

N∑i=1

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Ik]+ (1− 1

pN)L 〈∇F (wk) | wk − w∗〉 ,

which we simplify by considering that:

E[‖gk+1‖2

∣∣∣ Ik] ≤ 1

pN2

N∑i=1

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Ik]+ L 〈∇F (wk) | wk − w∗〉 .

Page 33: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

D.2.1 Lemmas for the case without memory

In this subsection, we give lemmas that are used only to demonstrate Theorem S5 (i.e. without memory).

Lemma S9 helps to pass for k in N from gk+1,Sk to (gik+1)Ni=1, this lemma will allow to invoke Lemma S10.

Lemma S9. In the case without memory, we have the following bound on the squared norm of the compressedgradient (randomly sampled), for all k in N:

E[‖gk+1,Sk‖

2∣∣∣ Gk+1

]=

(1− p)pN2

N∑i=0

∥∥gik+1

∥∥2+ ‖gk+1‖2 .

Proof. The proof is quite straightforward using the expectation and the variance of gk+1,Sk computed in Proposi-tions S7 and S8. We just need to decompose as following to easily obtain the result:

E[‖gk+1,Sk‖

2∣∣∣ Gk+1

]= V [gk+1,Sk | Gk+1] + ‖E [gk+1,Sk | Gk+1]‖2

Lemma S10 is used to remove the uplink compression noise straight after Lemma S9 has been applied.

Lemma S10 (Expectation of the squared norm of the compressed gradient when no memory). In the casewithout memory, we have the following bound on the squared norm of the compressed gradient (randomly sampled),for all k in N:

E[‖gk+1‖2

∣∣∣ Hk] ≤ ωupCN2

N∑i=0

E[∥∥gik+1

∥∥2∣∣∣ Hk]+

1

N2

N∑i=0

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Hk]+ L 〈∇F (wk) | wk − w∗〉 .

Proof. Let k in N, first, we write as following:

‖gk+1‖2 = ‖gk+1 − gk+1 + gk+1‖2

= ‖gk+1 − gk+1‖2 + 2 〈gk+1 − gk+1 | gk+1〉+ ‖gk+1‖2 .

Taking stochastic expectation (recall that gk+1 is Fk+1-measurable and that Hk ⊂ Fk+1):

E[E[‖gk+1‖2

∣∣∣ Fk+1

] ∣∣∣ Hk] = E[E[‖gk+1 − gk+1‖2

∣∣∣ Fk+1

] ∣∣∣ Hk]+ 2× E [E [〈gk+1 − gk+1 | gk+1〉 | Fk+1] | Hk]

+ E[‖gk+1‖2

∣∣∣ Hk] .(S8)

We need to find a bound for each of the terms of above eq. (S8). The last term is handled in Lemma S8, narrowingit down to the case p = 1.

Page 34: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

It follows that we just need to bound ‖gk+1 − gk+1‖2:

E[‖gk+1 − gk+1‖2

∣∣∣ Fk+1

]= E

[‖gk+1 − E [gk+1 | Fk+1]‖2

∣∣∣ Fk+1

]= E

∥∥∥∥∥ 1

N

N∑i=1

gik+1 − E[gik+1

∣∣ Fk+1

]∥∥∥∥∥2∣∣∣∣∣∣ Fk+1

=

1

N2

N∑i=0

E[∥∥gik+1 − gik+1

∥∥2∣∣∣ Fk+1

]+

1

N

∑i 6=j

E[⟨gik+1 − gik+1

∣∣∣ gjk+1 − gjk+1

⟩ ∣∣∣ Fk+1

]︸ ︷︷ ︸

=0 because (gik+1)Ni=1 are independents

=1

N2

N∑i=1

E[∥∥gik+1 − gik+1

∥∥2∣∣∣ Fk+1

].

Combining with Proposition S4, we hold that:

E[‖gk+1 − gk+1‖2

∣∣∣ Fk+1

]≤ωupCN2

N∑i=1

∥∥gik+1

∥∥2.

Now, we proved that:E[‖gk+1 − gk+1‖2

∣∣∣ Fk+1

]≤ ωup

CN2

∑Ni=1 E

[∥∥gik+1

∥∥2∣∣∣ Fk+1

]E [〈gk+1 − gk+1 | gk+1〉 | Fk+1] = 0 (Proposition S3)E[‖gk+1‖2

∣∣∣ Hk] ≤ 1N2

∑Ni=0 E

[∥∥gik+1 − hi∗∥∥2∣∣∣ Hk]

+L 〈∇F (wk) | wk − w∗〉 ( Lemma S8, with p = 1) .

Thus, we obtain from eq. (S8):

E[‖gk+1‖2

∣∣∣ Hk] ≤ ωupCN2

N∑i=1

E[∥∥gik+1

∥∥2∣∣∣ Hk]+

1

N2

N∑i=1

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Hk]+ L 〈∇F (wk) | wk − w∗〉 .

Lemma S11. In the case without memory, we have the following bound on the squared norm of the localcompressed gradient, for all k in N, for all i in J1, NK: E

[∥∥gik+1

∥∥2∣∣∣ Fk+1

]≤ (ωup

C + 1)∥∥gik+1

∥∥2

Proof. Let k in N and i in J1, NK:

E[∥∥gik+1

∥∥2∣∣∣ Fk+1

]= E

[∥∥gik+1 − gik+1 + gik+1

∥∥2∣∣∣ Fk+1

]= E

[∥∥gik+1 − gik+1

∥∥2∣∣∣ Fk+1

]+ 2E

[⟨gik+1 − gik+1

∣∣ gik+1

⟩ ∣∣ Fk+1

]︸ ︷︷ ︸=0

+E[∥∥gik+1

∥∥2∣∣∣ Fk+1

]

To obtain the result, we need to recall that ‖gk+1‖2 is Fk + 1-measurable, and then to use Proposition S6.

Page 35: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

D.2.2 Lemmas for the case with memory

In this subsection, we give lemmas that are used only to demonstrate Theorems S6 and S7 (i.e. with memory).

In order to derive an upper bound on the squared norm of ‖wk+1 − w∗‖2, for k in N, we need to control ‖gk+1,Sk‖2.

This term is decomposed as a sum of three terms depending on:

1. the recursion over the memory term (hik)

2. the difference between the stochastic gradient at the current point and at the optimal point (later controlledby co-coercivity)

3. the noise over stochasticity.

Lemma S12. In the case with memory, we have the following upper bound on the squared norm of the compressedgradient, for all k in N:

E[‖gk+1,Sk‖

2∣∣∣ Ik] ≤ 2

(2(ωupC + 1)

p− 1

)1

N2

N∑i=1

E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Ik]

+

(2(ωupC + 1)

p− 2

)1

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+ L 〈∇F (wk) | wk − w∗〉

+2σ∗Nb×(

2(ωupC + 1)

p− 1

).

Proof. We take the expectation w.r.t. the σ-algebra Ik+1: Doing a bias-variance decomposition:E[‖gk+1,Sk‖

2∣∣∣ Ik+1

]= E

[‖gk+1,Sk − gk+1‖2

∣∣∣ Ik+1

]+ E

[‖gk+1‖2

∣∣∣ Ik+1

].

First term.

E[‖gk+1,Sk − gk+1‖2

∣∣∣ Ik+1

]= E

∥∥∥∥∥ 1

Np

∑i∈Sk

∆ik + hk −

1

N

N∑i=1

∆ik + hk

∥∥∥∥∥2∣∣∣∣∣∣ Ik+1

= E

∥∥∥∥∥ 1

Np

N∑i=1

∆ik(Bi − p)

∥∥∥∥∥2∣∣∣∣∣∣ Ik+1

=

1

N2p2

N∑i=1

E[(Bi − p)2

∣∣ Ik+1

] ∥∥∥∆ik

∥∥∥2

,

by independence of device sampling and because (∆ik)Ni=1 are Ik+1-measurable. Next:

E[‖gk+1,Sk − gk+1‖2

∣∣∣ Ik+1

]=

1

N2p2

N∑i=1

p(1− p)E[∥∥∥∆i

k

∥∥∥2∣∣∣∣ Ik+1

].

Now we take expectation w.r.t. the σ-algebra Ik. Because Ik ⊂ Fk+1, we have for all i in 1, · · · , N,E[∥∥∆i

k

∥∥2∣∣∣ Ik] = E

[E[∥∥∆i

k

∥∥2∣∣∣ Fk+1

] ∣∣∣ Ik] and we use Proposition S4:

E[‖gk+1,Sk − gk+1‖2

∣∣∣ Ik] =(ωupC + 1)(1− p)

N2p

N∑i=1

E[∥∥∆i

k

∥∥2∣∣∣ Ik] , with Lemma S5

=2(ωupC + 1)(1− p)

N2p

N∑i=1

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Ik]+ E

[∥∥hik − hi∗∥∥2∣∣∣ Ik] .

Page 36: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Second term. Again, with a bias-variance decomposition:

E[‖gk+1‖2

∣∣∣ Ik+1

]= E

[‖gk+1‖2

∣∣∣ Ik+1

]+ E

[‖gk+1 − gk+1‖2

∣∣∣ Ik+1

].

We take expectation w.r.t. the σ-algebra Ik, thus the first term is handled with Lemma S8:

E[‖gk+1‖2

∣∣∣ Ik] ≤ 1

N2

N∑i=1

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Ik]+ L 〈∇F (wk) | wk − w∗〉 .

Considering the second term, by independence of the “N” compressions and using as previously Proposition S4(because Ik ⊂ Fk+1), we have:

E[‖gk+1 − gk+1‖2

∣∣∣ Ik] =1

N2

N∑i=1

E[∥∥∥∆i

k −∆ik

∥∥∥2∣∣∣∣ Ik] ≤ ωup

CN2

N∑i=1

∥∥∆ik

∥∥2, and with Lemma S5

≤2ωupC

N2

N∑i=1

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Ik]+ E

[∥∥hik − hi∗∥∥2∣∣∣ Ik] .

At the end:

E[‖gk+1‖2

∣∣∣ Ik] =2ωupC + 1

N2

N∑i=1

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Ik]+

2ωupC

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+ L 〈∇F (wk) | wk − w∗〉 .

We can combine the first and second term, and it follows that:

E[‖gk+1,Sk‖

2∣∣∣ Ik] ≤ (2(ωup

C + 1)(1− p)p

+ 2ωupC + 1

)1

N2

N∑i=1

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Ik]

+

(2(ωupC + 1)(1− p)

p+ 2ωup

C

)1

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+ L 〈∇F (wk) | wk − w∗〉 ,

we can now apply Lemma S7 to conclude the proof:

E[‖gk+1,Sk‖

2∣∣∣ Ik] ≤ 2

(2(ωupC + 1)(1− p)

p+ 2ωup

C + 1

)1

N2

N∑i=1

E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Ik]

+

(2(ωupC + 1)(1− p)

p+ 2ωup

C

)1

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+ L 〈∇F (wk) | wk − w∗〉

+2σ∗Nb×(

2(ωupC + 1)(1− p)

p+ 2ωup

C + 1

),

and simplifying each coefficient gives the result.

To show that the Lyapunov function is a contraction, we need to find a bound for each terms. Bounding‖wk+1 − w∗‖2, for k in N, flows from update schema (see eq. (3)) decomposition. However the memory term∥∥hik+1 − hi∗

∥∥2 involved in the Lyapunov function doesn’t show up naturally.

The aim of Lemma S13 is precisely to provide a recursive bound over the memory term to highlight the contraction.Like Lemma S6, the following lemma comes from Mishchenko et al. [2019].

Page 37: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Lemma S13 (Recursive inequalities over memory term). Let k ∈ N and let i ∈ J1, NK. The memory term usedin the uplink broadcasting can be bounded using a recursion:

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Ik] ≤ (1 + p(2α2ωup

C + 2α2 − 3α)))E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+ 2p(2α2ωup

C + 2α2 − α)E[‖gk+1 − gk+1,∗‖2

∣∣∣ Ik]+ 2p

σ2∗b

(2α2(ωup

C + 1)− α).

Proof. The proof is done in two steps:

1. First, we povide a recursive bound on memory when the device is used for the update (i.e such that for k inN, for i in 1, ...N, Bik = 1)

2. Then, we generalize to the case with all i in J1, N, K regardless to if they are used at the round k.

First part. Let k ∈ N and let i ∈ J1, NK such that Bik = 1

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Fk+1

]=∥∥E [hik+1

∣∣ Fk+1

]− hi∗

∥∥2

+ E[∥∥hik+1 − E

[hik+1

∣∣ Fk+1

]∥∥2∣∣∣ Fk+1

]using Lemma S3 ,

and now with Lemma S6:

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Fk+1

]=∥∥(1− α)hik + αgik+1 − hi∗

∥∥2+ E

[∥∥hik+1 − E[hik+1

∣∣ Fk+1

]∥∥2∣∣∣ Fk+1

].

Now recall that hik+1 = hik + α∆ik and E

[∆ik

∣∣∣ Fk+1

]= ∆i

k:

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Fk+1

]=∥∥(1− α)(hik − hi∗) + α(gik+1 − hi∗)

∥∥2+ α2E

[∥∥∥∆ik −∆i

k

∥∥∥2∣∣∣∣ Fk+1

].

Using Lemma S2 of Appendix D.1 and Proposition S4:

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Fk+1

]≤ (1− α)

∥∥hik − hi∗∥∥2+ α

∥∥gik+1 − hi∗∥∥2

− α(1− α)∥∥hik − gik+1

∥∥2+ α2ωup

C∥∥∆i

k

∥∥2.

Because hik − gik+1 = ∆ik:

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Fk+1

]≤ (1− α)

∥∥hik − hi∗∥∥2+ α

∥∥gik+1 − hi∗∥∥2

+ α (α(ωupC + 1)− 1)

∥∥∆ik

∥∥2,

and using Lemma S5:

≤ (1− α)∥∥hik − hi∗∥∥2

+ α∥∥gik+1 − hi∗

∥∥2

+ 2α (α(ωupC + 1)− 1)

(∥∥hik − hi∗∥∥2+∥∥gk+1 − hi∗

∥∥2)

≤(1 + 2α2ωup

C + 2α2 − 3α) ∥∥hik − hi∗∥∥2

+ α(2αωupC + 2α− 1)

∥∥gk+1 − hi∗∥∥2

.

Finally using eq. (S5) of Lemma S7 and writing that:

E[∥∥gk+1 − hi∗

∥∥2∣∣∣ Hk] = E

[E[∥∥gk+1 − hi∗

∥∥2∣∣∣ Fk+1

] ∣∣∣ Hk] (because Hk ⊂ Fk+1) ,

Page 38: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

we have:

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Hk] ≤ (1 + 2α2ωup

C + 2α2 − 3α︸ ︷︷ ︸=T1

)E[∥∥hik − hi∗∥∥2

∣∣∣ Hk]+ 2(2α2ωup

C + 2α2 − α︸ ︷︷ ︸T2

)E[‖gk+1 − gk+1,∗‖2

∣∣∣ Hk]

+ 2σ2∗b

(2α2(ωup

C + 1)− α)

︸ ︷︷ ︸T3

,

which conclude the first part of the proof. Now we take the general case with ∀i ∈ J1, NK, Bik = 0 or 1.

Second part. Let k ∈ N and let i ∈ J1, NK.

To resume, if the device participate to the iteration k, we have

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Hk] ≤ (1 + T1)E

[∥∥hik − hi∗∥∥2∣∣∣ Hk]+ 2T2E

[‖gk+1 − gk+1,∗‖2

∣∣∣ Hk]+ 2T3 ,

otherwise:E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Hk] = E

[∥∥hik − hi∗∥∥2∣∣∣ Hk] .

In other words, for all i in J1, NK:

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Hk] ≤ (1 + T1)BikE

[∥∥hik − hi∗∥∥2∣∣∣ Hk]+ 2T2B

ikE[‖gk+1 − gk+1,∗‖2

∣∣∣ Hk]+ 2T3Bik

+ (1−Bik)E[∥∥hik − hi∗∥∥2

∣∣∣ Hk]≤ (1 + T1B

ik)E

[∥∥hik − hi∗∥∥2∣∣∣ Hk]+ 2T2B

ikE[‖gk+1 − gk+1,∗‖2

∣∣∣ Hk]+ 2T3Bik

Taking expectation w.r.t σ-algebra Ik gives the result.

After successfully invoking all previous lemmas, we will finally be able to use co-coercivity. Lemma S14 showshow Assumption 2 is used to do it. After this stage, proof will be continued by applying strong-convexity of F .

Lemma S14 (Applying co-coercivity). This lemma shows how to apply co-coercivity on stochastic gradients.

∀k ∈ N ,1

N

N∑i=1

E[∥∥gik+1 − gk+1,∗

∥∥2∣∣∣ Hk] ≤ L 〈∇F (wk) | wk − w∗〉 .

Proof. Let k ∈ N.

1

N

N∑i=1

E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Hk] ≤ 1

N

N∑i=1

L⟨E[gik+1 − gik+1,∗

∣∣ Hk] ∣∣ wk − w∗⟩ using Assumption 2,

≤ L

⟨1

N

N∑i=1

E[gik+1 − gik+1,∗

∣∣ Hk]∣∣∣∣∣ wk − w∗

≤ L

⟨1

N

N∑i=1

∇Fi(wk)−∇Fi(w∗)

∣∣∣∣∣ wk − w∗⟩.

Page 39: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

E Proofs of Theorems

In this section we give demonstrations of all our theorems, that is to say, first the proofs of Theorems S5 and S6from which flow Theorem 1. Their demonstration sketch is drawn from Mishchenko et al. [2019]. And in a secondtime, we give a complete demonstration of theorems stated in the main paper: Theorems 2 and 3.

For the sake of demonstration, we define a Lyapunov function Vk [as in Mishchenko et al., 2019; Liu et al., 2020],with k in J1,KK:

Vk = ‖wk − w∗‖2 + 2γ2C1

N

N∑i=1

∥∥hik − hi∗∥∥2.

The Lyapunov function is defined combining two terms:

1. the distance from parameter wk to optimal parameter w∗

2. The memory term, the distance between the next element prediction hik and the true gradient hi∗ = ∇Fi(w∗).

The aim is to proof that this function is a (1− γµ) contraction for each variant of Artemis, and also when usingPolyak-Ruppert averaging. To show that it’s a contraction, we need three stages:

1. we develop the update schema defined in eq. (3) to get a first bound on ‖wk − w∗‖2

2. we find a recurrence over the memory term∥∥hik − hi∗∥∥2

3. and finally we combines the two equations to obtain the expected contraction using co-coercivity and strongconvexity.

E.1 Proof of main Theorem for Artemis - variant without memory

Theorem S5 (Unidirectional or bidirectional compression without memory). Considering that Assumptions 1to 6 hold. Taking γ such that

γ ≤ pN

L(ωdwnC + 1) (pN + 2(ωup

C + 1)),

then running Artemis with α = 0 (i.e without memory), we have for all k in N:

E ‖wk+1 − w∗‖2 ≤ (1− γµ)k+1 ‖w0 − w∗‖2 + 2γE

µpN,

with E = (ωdwnC + 1)

((ωupC + 1)

σ2∗b

+ (ωupC + 1− p)B2

). In the case of unidirectional compression (resp. no

compression), we have ωdwnC = 0 (resp. ωup/dwn

C = 0).

Proof. In the case of variant of Artemis with α = 0, we don’t have any memory term, thus p = 0 and we don’tneed to use the Lyapunov function.

Let k in N, we start by writing that by definition of eq. (3):

‖wk+1 − w∗‖2 = ‖wk − γΩk+1,Sk − w∗‖2

= ‖wk − w∗‖2 − 2γ 〈Ωk+1,Sk | wk − w∗〉+ γ2 ‖Ωk+1,Sk‖2,

with Ωk+1,Sk = Cdwn

(1N∑Ni=1 g

ik+1

). First, we have E [Ωk+1,Sk | σ(Gk+1 ∪ Bk)] = gk+1,Sk ; secondly considering

that:

E[‖Ωk+1,Sk‖

2∣∣∣ σ(Gk+1 ∪ Bk)

]= V(Ωk+1,Sk) + ‖E [Ωk+1,Sk | σ(Gk+1 ∪ Bk)]‖2 = (ωdwn

C + 1) ‖gk+1,Sk‖2,

Page 40: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

it leads to:

E[‖wk+1 − w∗‖2

∣∣∣ σ(Gk+1 ∪ Bk)]

= E[‖wk − w∗‖2

∣∣∣ σ(Gk+1 ∪ Bk)]− 2γ 〈gk+1,Sk | wk − w∗〉

+ γ2(ωdwnC + 1) ‖gk+1,Sk‖

2.

Now, if we take expectation w.r.t σ-algebra Ik ⊂ σ(Gk+1 ∪Bk) (the inclusion is true because Gk+1 contains Bk−1);with use of Propositions S2, S3 and S7 and Lemma S9 we obtain :

E[‖wk+1 − w∗‖2

∣∣∣ Ik] = E[‖wk − w∗‖2

∣∣∣ Ik] (S9)

− 2γ 〈∇F (wk) | wk − w∗〉

+ γ2(ωdwnC + 1)

(1− p)pN2

N∑i=0

E[∥∥gik+1

∥∥2∣∣∣ Ik]

+ γ2(ωdwnC + 1)E

[‖gk+1‖2

∣∣∣ Ik] .For sake of clarity we temporarily note: Π =

(1− p)pN2

∑Ni=0 E

[∥∥gik+1

∥∥2∣∣∣ Ik]+ E

[‖gk+1‖2

∣∣∣ Ik]. Recall that infact, Π is equal to E

[‖gk+1,Sk‖

2∣∣∣ Ik].

Using Lemmas S10 and S11, we have:

E [Π | Ik] ≤(1− p)(ωup

C + 1)

pN2

N∑i=0

E[∥∥gik+1

∥∥2∣∣∣ Ik]

+ωupCN2

N∑i=0

E[∥∥gik+1

∥∥2∣∣∣ Ik]+

1

N

N∑i=0

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Ik]

+ L 〈∇F (wk) | wk − w∗〉 .

≤(ωupC + 1− p)pN2

N∑i=0

E[∥∥gik+1

∥∥2∣∣∣ Ik]+

1

N

N∑i=0

E[∥∥gik+1 − hi∗

∥∥2∣∣∣ Ik]

+ L 〈∇F (wk) | wk − w∗〉 .

Lets introducing the noise at optimal point w∗ with the two equations of Lemma S7:

E [Π | Ik] ≤(ωupC + 1− p)pN2

N∑i=1

2

(E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Ik]+ (

σ2∗b

+B2)

)

+1

N2

N∑i=1

2

(E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Ik]+

σ2∗b

)+ L 〈∇F (wk) | wk − w∗〉 .

≤2(ωupC + 1)

pN2

N∑i=1

E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Ik]

+ L 〈∇F (wk) | wk − w∗〉+

2×(

(ωupC + 1)

σ2∗b

+ ωupC + 1− p)B2

)pN

.

Page 41: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Invoking cocoercivity (Assumption 2):

E [Π | Ik] ≤2(ωupC + 1)

pN2

N∑i=1

E[L⟨gik+1 − gik+1,∗

∣∣ wk − w∗⟩ ∣∣ Ik]

+ L 〈∇F (wk) | wk − w∗〉+

2×(

(ωupC + 1)

σ2∗b

+ ωupC + 1− p)B2

)pN

≤2(ωupC + 1)L

pN〈∇F (wk) | wk − w∗〉

+ L 〈∇F (wk) | wk − w∗〉+

2×(

(ωupC + 1)

σ2∗b

+ ωupC + 1− p)B2

)pN

. (S10)

Finally, we can inject eq. (S10) in eq. (S9) to obtain:

E[‖wk+1 − w∗‖2

∣∣∣ Ik] ≤ ‖wk − w∗‖2 − 2γ

(1−

γL(ωdwnC + 1)(ωup

C + 1)

pN− γL(ωdwn

C + 1)

2

)〈∇F (wk) | wk − w∗〉

+

=E︷ ︸︸ ︷(ωdwnC + 1)

((ωupC + 1)

σ2∗b

+ (ωupC + 1− p)B2

)pN

.

(S11)

We need 1 − γL(ωdwnC +1)(ωup

C +1)

pN − γL(ωdwnC +1)2 ≥ 0 in order to further apply strong convexity. This condition is

equivalent to:

γ ≤ 2pN

L(ωdwnC + 1) (pN + 2(ωup

C + 1)).

Finally, using strong convexity of F (Assumption 1), we rewrite Equation (S11):

E[‖wk+1 − w∗‖2

∣∣∣ Ik] ≤ ‖wk − w∗‖2 − 2γµ

(1−

γL(ωdwnC + 1)(ωup

C + 1)

pN− γL(ωdwn

C + 1)

2

)‖wk − w∗‖2

+ 2γ2 E

pN, equivalent to:

≤(

1− 2γµ

(1−

γL(ωdwnC + 1)(ωup

C + 1)

pN− γL(ωdwn

C + 1)s

2

))‖wk − w∗‖2 + 2γ2 E

pN.

To guarantee a convergence in (1− γµ), we need:

1

2≥γL(ωdwn

C + 1)(ωupC + 1)

pN+γL(ωdwn

C + 1)

2

⇐⇒ γ ≤ pN

L(ωdwnC + 1) (pN + 2(ωup

C + 1)),

which is stronger than the condition obtained to correctly apply strong convexity. Then we are allowed to write:

E ‖wk+1 − w∗‖2 ≤ (1− γµ)E ‖wk − w∗‖2 + 2γ2 E

pN

⇐⇒ E ‖wk+1 − w∗‖2 ≤ (1− γµ)k+1E ‖w0 − w∗‖2 + 2γ2 E

pN× 1− (1− γµ)k+1

γµ

⇐⇒ E ‖wk+1 − w∗‖2 ≤ (1− γµ)k+1 ‖w0 − w∗‖2 + 2γE

µpN,

Page 42: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

and the proof is complete.

E.2 Proof of main Theorem for Artemis - variant with memory

Theorem S6 (Unidirectional or bidirectional compression with memory). Considering that Assumptions 1 to 6hold. We use w∗ to indicate the optimal parameter such that ∇F (w∗) = 0, and we note hi∗ = ∇Fi(w∗). We definethe Lyapunov function:

Vk = ‖wk − w∗‖2 + 2γ2C1

N

N∑i=1

∥∥hik − hi∗∥∥2.

We defined C ∈ R∗, such that:

(ωdwnC + 1)

(ωupC + 1

p− 1

)αp (3− 2α(ωup

C + 1))≤ C ≤

N − γL(ωdwnC + 1)

(N +

4(ωupC +1)

p − 2)

4γLpα (2α(ωupC + 1)− 1)

. (S12)

Then, using Artemis with a memory mechanism (α 6= 0), the convergence of the algorithm is guaranteed if:

12(ωupC + 1)

≤ α < min

32(ωupC + 1)

,

3N − γL(ωdwnC + 1)

(3N +

8(ωupC + 1)

p− 2

)2(ωupC + 1)(N − γL(ωdwn

C + 1)(N + 2))

γ < min

1

(ωdwnC + 1)

(1 +

2

Np

)L

, 3

(ωdwnC + 1)

(3 +

8(ωupC − 1)− 2p

Np

)L

,

N

(ωdwnC + 1)

(N +

4(ωupC + 1)

p− 2

)L

.

(S13)

And we have a bound for the Lyapunov function:

EVk+1 ≤ (1− γµ)k+1(‖w0 − w∗‖2 + 2Cγ2B2

)+ 2γ

E

µN,

with

E =σ2∗b

((ωdwnC + 1)

(2(ωupC + 1)

p− 1

)+ 2pC

(2α2(ωup

C + 1)− α))

.

In the case of unidirectional compression (resp. no compression), we have ωdwnC = 0 (resp. ωup/dwn

C = 0).

Proof. Let k ∈ J1,KK, by definition of the update schema in the partial-participation setting: wk+1 = wk −γΩk+1,Sk , with Ωk+1,Sk = Cdwn

(1pN

∑i∈Sk(∆i

k + hik)), thus:

‖wk+1 − w∗‖2 = ‖wk − w∗ + γΩk+1,Sk‖2

= ‖wk − w∗‖ − 2γ 〈Ωk+1,Sk | wk − w∗〉+ γ2 ‖Ωk+1,Sk‖2.

First, we have E [Ωk+1,Sk | σ(Gk+1 ∪ Bk)] = gk+1,Sk ; secondly considering that:

E[‖Ωk+1,Sk‖

2∣∣∣ σ(Gk+1 ∪ Bk)

]= V(Ωk+1,Sk) + ‖E [Ωk+1,Sk | σ(Gk+1 ∪ Bk)]‖2 = (ωdwn

C + 1) ‖gk+1,Sk‖2,

Page 43: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

it leads to:

E[‖wk+1 − w∗‖2

∣∣∣ σ(Gk+1 ∪ Bk)]

= E[‖wk − w∗‖2

∣∣∣ σ(Gk+1 ∪ Bk)]− 2γ 〈gk+1,Sk | wk − w∗〉

+ γ2(ωdwnC + 1) ‖gk+1,Sk‖

2.

we can invoke Lemma S12 with the σ-algebras Ik:

E[‖wk+1 − w∗‖2

∣∣∣ Ik] ≤ ‖wk − w∗‖2 − 2γE [〈gk+1,Sk | wk − w∗〉 | Ik]

+ 2γ2(ωdwnC + 1)

(2(ωupC + 1)

p− 1

)1

N2

N∑i=1

E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Ik]

+ γ2(ωdwnC + 1)

(2(ωupC + 1)

p− 2

)1

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+ γ2(ωdwn

C + 1)L 〈∇F (wk) | wk − w∗〉

+2σ∗Nb× γ2(ωdwn

C + 1)

(2(ωupC + 1)

p− 1

).

(S14)

Note that in the case of unidirectional compression, we have Ωk+1 = gk+1, and the steps above are morestraightforward. Recall that according to Lemma S13 (and taking the sum), we have:

1

N2

N∑i=1

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Ik]

≤(1 + p(2α2ωup

C + 2α2 − 3α)) 1

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+ 2p(2α2ωup

C + 2α2 − α)1

N2

N∑i=1

E[∥∥gik+1 − gik+1,∗

∥∥2∣∣∣ Ik]

+2p

N

σ2∗b

(2α2(ωup

C + 1)− α)

(S15)

With a linear combination (S14) + 2γ2C (S15):

E[‖wk+1 − w∗‖2

∣∣∣ Ik]+ 2γ2C1

N2

N∑i=1

E[∥∥hik+1 − hi∗

∥∥2∣∣∣ Ik]

≤ ‖wk − w∗‖2 − 2γE [〈gk+1,Sk | wk − w∗〉 | Ik]

+ 2γ2

((ωdwnC + 1)

(2(ωupC + 1)

p− 1

)+ 2pC(2α2ωup

C + 2α2 − α)

)︸ ︷︷ ︸

:=AC

× 1

N2

N∑i=1

E[∥∥gik+1 − gik+1,∗

∥∥ ∣∣ Ik]+ 2γ2C

(ωdwnC + 1

2C

(2(ωupC + 1)

p− 2

)+ 1 + p(2α2ωup

C + 2α2 − 3α)

)︸ ︷︷ ︸

:=DC

1

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]

+ γ2(ωdwnC + 1)L 〈∇F (wk) | wk − w∗〉

+2γ2

N

(σ2∗b

((ωdwnC + 1)

(2(ωupC + 1)

p− 1

)+ 2pC

(2α2(ωup

C + 1)− α)))

︸ ︷︷ ︸:=E

.

Page 44: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Because Hk ⊂ Fk+1, because gk+1 is independent of Bk, and with Propositions S2 and S3:

E [gk+1 | Ik] = E [E [gk+1 | Fk+1] | Hk] = E [gk+1 | Hk] = ∇F (wk) .

We transform∥∥∥gik+1 − gik+1,∗

∥∥∥ applying co-coercivity (Lemma S14):

EVk+1 ≤ ‖wk − w∗‖2 − 2γ

(1− γL

(ωdwnC + 1

2+ACN

))〈∇F (wk) | wk − w∗〉

+ 2γ2CDC1

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+2γ2

NE .

(S16)

Now, the goal is to apply strong convexity of F (Assumption 1) using the inequality presented in Lemma S4. Butthen we must have:

1− γL(ωdwnC + 1

2+ACN

)≥ 0 ,

However, in order to later obtain a convergence in (1− γµ), we will use a stronger condition and, instead, statethat we need:

γL

(ωdwnC + 1

2+ACN

)≤ 1

2

⇐⇒ AC ≤(1− γL(ωdwn

C + 1))N

2γL

⇐⇒(

(ωdwnC + 1)

(2(ωupC + 1)

p− 1

)+ 2pC(2α2ωup

C + 2α2 − α)

)≤ (1− γL(ωdwn

C + 1))N

2γL

⇐⇒ C ≤N − γL(ωdwn

C + 1)(N +

4(ωupC +1)

p − 2)

4γLpα (2α(ωupC + 1)− 1)

.

This holds only if the numerator and the denominator are positive:N − γL(ωdwn

C + 1)(N +

4(ωupC +1)

p − 2)> 0⇐⇒ γ < N

(ωdwnC + 1)

(N +

4(ωupC + 1)

p− 2

)L

2α(ωupC + 1)− 1 ≤ 0⇐⇒ α ≥ 1

2(ωupC + 1)

.

Strong convexity is applied, and we obtain:

EVk+1 ≤(

1− 2γµ

(1− γL(ωdwn

C + 1)

2− γLAC

N

))‖wk − w∗‖2

+ 2γ2CDC1

N

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+ 2γ2 E

N.

(S17)

To guarantee a (1− γµ) convergence, constants must verify:

γL(ωdwnC + 1)2 − γLAC

N ≤ 12 (which is already verified)

DC ≤ 1− γµ⇐⇒ ωdwnC + 1

C

(ωupC + 1

p− 1

)≤ p(3α− 2α2ωup

C − 2α)− γµ

⇐⇒ C ≥(ωdwnC + 1)

(ωupC + 1

p− 1

)αp (3− 2α(ωup

C + 1))− γµ

.

Page 45: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

In the following we will consider that γµα = o

µ→0(1) which is possible because α is independent of µ (it depends

only of ωupC and ωdwn

C ) and it result to:

αp (3− 2α(ωupC + 1))− γµ ∼

µ→0αp (3− 2α(ωup

C + 1))

Thus, the condition on C becomes:

(ωdwnC + 1)

(ωupC + 1

p− 1

)αp (3− 2α(ωup

C + 1))≤ C ,

which is correct only if α ≤ 32(ωupC + 1)

.

And we obtain the following conditions on C:

(ωdwnC + 1)

(ωupC + 1

p− 1

)αp (3− 2α(ωup

C + 1))≤ C ≤

N − γL(ωdwnC + 1)

(N +

4(ωupC +1)

p − 2)

4γLpα (2α(ωupC + 1)− 1)

.

It follows, that the above interval is not empty if:

(ωdwnC + 1)

(ωupC + 1

p− 1

)αp (3− 2α(ωup

C + 1))≤N − γL(ωdwn

C + 1)(N +

4(ωupC +1)

p − 2)

4γLpα (2α(ωupC + 1)− 1)

.

For sake of clarity we denote momentarily γ = (ωdwnC + 1)γL and Π =

ωupC +1

p , hence the below condition becomes:

8α(ωupC + 1)(Π− 1)γ − 4(Π− 1)γ ≤ 3N − 3γ (N + 2 + 4Π− 2))

− 2α(ωupC + 1)N + 2αγ(ωup

C + 1) (N + 4Π− 2)

⇐⇒ 2α(ωupC + 1)(N − γ(N + 2)) ≤ 3N − γ (3N + 8Π− 2)

And at the end, we obtain:

α ≤3N − γL(ωdwn

C + 1)

(3N +

8(ωupC + 1)

p− 2

)2(ωupC + 1)(N − γL(ωdwn

C + 1)(N + 2)).

Again, this implies two conditions on gamma:3N − γL(ωdwn

C + 1)(

3N + 8ωupC +1

p − 2)> 0⇐⇒ γ < 3

(ωdwnC + 1)

(3 +

8(ωupC − 1)− 2p

Np

)L

N − γL(ωdwnC + 1)(N + 2) > 0⇐⇒ γ < 1

(ωdwnC + 1)

(1 +

2

N

)L

.

The constant C exists, and from eq. (S17) we are allowed to write:

EVk+1 ≤ (1− γµ)EVk + 2γ2 E

N

⇐⇒ EVk+1 ≤ (1− γµ)k+1EV0 + 2γ2 E

N× 1− (1− γµ)k+1

γµ

=⇒ EVk+1 ≤ (1− γµ)k+1V0 + 2γE

µN.

Page 46: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Because V0 = E ‖w0 − w∗‖2 + 2γ2C 1N

∑Ni=0

∥∥hi∗∥∥2 ≤ ‖w0 − w∗‖2 + 2Cγ2B2 (using Assumption 4), we can write:

EVk+1 = (1− γµ)k+1(‖w0 − w∗‖2 + 2Cγ2B2

)+ 2γ

E

µN.

Thus, we highlighted that the Lyapunov function

Vk = ‖wk − w∗‖2 + 2γ2C1

N

N∑i=1

∥∥hik − hi∗∥∥2

is a (1− γµ) contraction if C is taken in a given interval, with γ and α satisfying some conditions. This guaranteethe convergence of the Artemis using version 1 or 2 with α 6= 0 (algorithm with uni-compression or bi-compressioncombined with a memory mechanism).

E.3 Proof of Theorem - Polyak-Ruppert averaging

Theorem S7 (Unidirectional or bidirectional compression using memory and averaging). We suppose now that Fis convex, thus µ = 0 and we consider that Assumptions 2 to 6 hold. We use w∗ to indicate the optimal parametersuch that ∇F (w∗) = 0, and we note hi∗ = ∇Fi(w∗). We define the Lyapunov function:

Vk = ‖wk − w∗‖2 + 2γ2C1

N

N∑i=1

∥∥hik − hi∗∥∥2.

We defined C ∈ R∗, such that:

(ωdwnC + 1)

(ωupC + 1

p− 1

)αp (3− 2α(ωup

C + 1))≤ C ≤

N − γL(ωdwnC + 1)

(N +

4(ωupC +1)

p − 2)

4γLpα (2α(ωupC + 1)− 1)

. (S18)

Then running variant of Artemis with α 6= 0, hence with a memory mechanism, and using Polyak-Ruppertaveraging, the convergence of the algorithm is guaranteed if:

12(ωupC + 1)

≤ α < min

32(ωupC + 1)

,

3N − γL(ωdwnC + 1)

(3N +

8(ωupC + 1)

p− 2

)2(ωupC + 1)(N − γL(ωdwn

C + 1)(N + 2))

γ < min

1

(ωdwnC + 1)

(1 +

2

Np

)L

, 3

(ωdwnC + 1)

(3 +

8(ωupC − 1)− 2p

Np

)L

,

N

(ωdwnC + 1)

(N +

4(ωupC + 1)

p− 2

)L

.

(S19)

And we have the following bound:

F

(1

K

K∑k=0

wk

)− F (w∗) ≤

‖w0 − w∗‖2 + 2Cγ2B2

γK+ 2γ

E

N, (S20)

with E = (ωdwnC + 1)

(2(ωupC +1)

p − 1)σ2∗b + 2pC

(2α2(ωup

C + 1)− α) σ2∗b .

Equation (S20) can be written as in Theorem 2 if we take γ = min

(√Nδ202EK ; γmax

), where γmax is the maximal

possible value of γ as precised by Equation (S19):

F

(1

K

K∑k=0

wk

)− F (w∗) ≤ 2 max

(√2δ2

0E

NK;

δ20

γmaxK

)+

2γmaxCB2

K

Page 47: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Proof. Starting from eq. (S16) from the proof of Theorem S6:

EVk+1 ≤ ‖wk − w∗‖2 − 2γ

(1− γL

(ωdwnC + 1

2+ACN

))〈∇F (wk) | wk − w∗〉

+ 2γ2CDC1

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+2γ2

NE .

But this time, instead of applying strong convexity of F , we apply convexity (Assumption 1 but with µ = 0):

EVk+1 ≤ ‖wk − w∗‖2 − 2γ

(1− γL

(ωdwnC + 1

2+ACN

))(F (wk)− F (w∗))

+ 2γ2CDC1

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]+2γ2

NE .

(S21)

As in Theorem S6, we want:

γL

(ωdwnC + 1

2+ACN

)≤ 1

2

⇐⇒ C ≤N − γL(ωdwn

C + 1)(N +

4(ωupC +1)

p − 2)

4γLpα (2α(ωupC + 1)− 1)

.

(S22)

which holds only if the numerator and the denominator are positive:N − γL(ωdwn

C + 1)(N +

4(ωupC +1)

p − 2)> 0⇐⇒ γ < N

(ωdwnC + 1)

(N +

4(ωupC + 1)

p− 2

)L

2α(ωupC + 1)− 1 ≤ 0⇐⇒ α ≥ 1

2(ωupC + 1)

.

Returning to eq. (S21), taking benefit of eq. (S22) and passing F (wk)− F (w∗) on the left side gives:

γ(F (wk)− F (w∗)) ≤ ‖wk − w∗‖2 + 2γ2CDC1

N2

N∑i=1

E[∥∥hik − hi∗∥∥2

∣∣∣ Ik]− EVk+1 +2γ2

NE .

If DC ≤ 1, then:

γ(F (wk)− F (w∗)) ≤ EVk − EVk+1 +2γ2

NE ,

summing over all K iterations:

γ

(1

K

K∑k=0

F (wk)− F (w∗)

)≤ 1

K

K∑k=0

(EVk − EVk+1 + 2γ2 E

N

)≤ EV0 − EVK+1

K+ 2γ2 E

Nbecause E is independent of K.

Thus, by convexity:

F

(1

K

K∑k=0

wk

)− F (w∗) ≤

1

K

K∑k=0

F (wk)− F (w∗) ≤V0

γK+ 2γ

E

N.

Page 48: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Last step is to extract conditions over γ and α from requirement DC ≤ 1:

DC < 1⇐⇒ ωdwnC + 1

2C

(2(ωupC + 1)

p− 2

)< 3α− 2α2ωup

C − 2α⇐⇒ C >

(ωdwnC + 1)

(ωupC + 1

p− 1

)αp (3− 2α(ωup

C + 1)),

and the second inequality is correct only if α ≤ 32(ωupC + 1)

.

From this development follows the following conditions on p, which are equivalent to those obtain in Theorem S6

(ωdwnC + 1)

(ωupC + 1

p− 1

)αp (3− 2α(ωup

C + 1))≤ C ≤

N − γL(ωdwnC + 1)

(N +

4(ωupC +1)

p − 2)

4γLpα (2α(ωupC + 1)− 1)

.

This interval is not empty:

(ωdwnC + 1)

(ωupC + 1

p− 1

)αp (3− 2α(ωup

C + 1))≤N − γL(ωdwn

C + 1)(N +

4(ωupC +1)

p − 2)

4γLpα (2α(ωupC + 1)− 1)

⇐⇒ α ≤3N − γL(ωdwn

C + 1)

(3N +

8(ωupC + 1)

p− 2

)2(ωupC + 1)(N − γL(ωdwn

C + 1)(N + 2)).

Again, this implies two conditions on gamma:3N − γL(ωdwn

C + 1)(

3N + 8ωupC +1

p − 2)> 0⇐⇒ γ < 3

(ωdwnC + 1)

(3 +

8(ωupC − 1)− 2p

Np

)L

N − γL(ωdwnC + 1)(N + 2) > 0⇐⇒ γ < 1

(ωdwnC + 1)

(1 +

2

N

)L

.

which guarantees the existence of C and thus the validity of the above development.

As a conclusion:

F

(1

K

K∑k=0

wk

)− F (w∗) ≤

V0

γK+ 2γ

E

N≤ ‖w0 − w∗‖2 + 2Cγ2B2

γK+ 2γ

E

N.

≤ ‖w0 − w∗‖2

γK+ 2γ

(E

N+CB2

K

).

Next, our goal is to define the optimal step size γopt. With this aim, we bound 2γCB2

K by 2γmaxCB2

K . Thisleads to ignore this term when optimizing the step size and thus to obtain a simpler expression of γopt. Thisapproximation is relevant, because B2

K is “small”. And we obtain:

F

(1

K

K∑k=0

wk

)− F (w∗) ≤

‖w0 − w∗‖2

γK+ 2γ

E

N+ 2γmax

CB2

K.

This is valid for all variants of Artemis, with step-size in table 3 and E, p in Theorem 1. Subsequently, the“optimal” step size (at least the one minimizing the upper bound) is

γopt =

√‖w0 − w∗‖2N

2EK,

Page 49: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

resulting in a convergence rate as 2

√2‖w0−w∗‖2E

NK + 2γmaxCB2

K , if this step size is allowed. If√‖w0−w∗‖2N

2EK ≥ γmax(=⇒ 2γmaxE

N ≤ ‖w0−w∗‖2γmaxK

), then the bias term dominates and the upper bound is 2‖w0−w∗‖2

γmaxK+ 2γmaxCB

2

K . Overall,the convergence rate is given by:

F

(1

K

K∑k=0

wk

)− F (w∗) ≤ 2 max

√2 ‖w0 − w∗‖2ENK

;‖w0 − w∗‖2

γmaxK

+2γmaxCB

2

K.

E.4 Proof of Theorem 3 - convergence in distribution

In this section, we give the proof of Theorem 3 in a full participation setting. The theorem is decomposed intotwo main points, that are respectively derived from Propositions S9 and S10, given in Appendices E.4.2 and E.4.3.We first introduce a few notations in Appendix E.4.1.

We consider in this section the Stochastic Sparsification compression operator Cq, which is defined as follows:for any x ∈ Rd, Cq(x)

dist= 1

q (x1B1, . . . , xdBd), with (B1, . . . , Bd) ∼ B(q)⊗n i.i.d. Bernoullis with mean q. That is,each coordinate is independently assigned to 0 with probability 1− q or rescaled by a factor q−1 in order to getan unbiased operator.Lemma S15. This compression operator satisfies Assumption 5 with ωC = q−1 − 1.

Moreover, if I consider a random variable (B1, . . . , Bd) ∼ B(q)⊗n and define almost surely Cq(x)a.s.=

1q (x1B1, . . . , xdBd), then we also have that for any x, y ∈ Rd, Cq(x)− Cq(y) = Cq(x− y).

E.4.1 Background on distributions and Markov Chains

We consider Artemis iterates (wk, (hik)i∈J1,NK)k∈N ∈ Rd(1+N) with the following update equation:wk+1 = wk − γCdwn

(1N∑Ni=1 Cup

(gik+1 − hik

)+ hik

)∀i ∈ J1, NK, hik+1 = hik + αCup

(gik+1 − hik

) (S23)

We see the iterates, for a constant step size γ, as a homogeneous Markov chain, and denote Rγ,v the Markovkernel, which is the equivalent for continuous spaces of the transition matrix in finite state spaces. Let Rγ,v bethe Markov kernel on (Rd(1+N),B(Rd(1+N))) associated with the SGD iterates (wk, τ(hik)i∈J1,NK)k≥0 for a variantv of Artemis, as defined in Algorithm 1 and with τ a constant specified afterwards, where B(Rd(1+N)) is theBorel σ-field of Rd(1+N). Meyn & Tweedie [2009] provide an introduction to Markov chain theory. For readability,we now denote (hik)i for (hik)i∈J1,NK.

Definition 3. For any initial distribution ν0 on B(Rd(1+N)) and k ∈ N, ν0Rkγ,v denotes the distribution of

(wk, τ(hik)i) starting at (w0, τ(hi0)i) distributed according to ν0.

We can make the following comments:

1. Initial distribution. We consider deterministic initial points, i.e., (w0, τ(hi0)i) follows a Dirac at point(w0, τ(hi0)i). We denote this Dirac δw0

⊗⊗Ni=1δτhi0not= δw0

⊗ δτh10⊗ · · · ⊗ δτhN0 .

2. Notation in the main text: In the main text, for simplicity, we used Θk to denote the distribution ofwk when launched from (w0, τ(hi0)i). Thus Θk corresponds to the distribution of the projection on first dcoordinates of ((δw0

⊗⊗Ni=1δτhi0)Rkγ).

3. Case without memory: In the memory-less case, we have (hik)k∈N ≡ 0, and could restrict ourselves to aMarkov kernel on (Rd,B(Rd)).

For any variant v of Artemis, we prove that (wk, (hik)i)k≥0 admits a limit stationary distribution

Πγ,v = πγ,v,w ⊗ πγ,v,(h) (S24)

Page 50: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

and quantify the convergence of ((δw0 ⊗⊗Ni=1δτhi0)Rkγ)k≥0 to Πγ,v, in terms of Wasserstein metric W2.

Definition 4. For all probability measures ν and λ on B(Rd), such that∫Rd ‖w‖

2dν(w) < +∞ and∫

Rd ‖w‖2

dλ(w) ≤ +∞, define the squared Wasserstein distance of order 2 between λ and ν by

W22 (λ, ν) := inf

ξ∈Γ(λ,ν)

∫‖x− y‖2ξ(dx, dy), (S25)

where Γ(λ, ν) is the set of probability measures ξ on B(Rd × Rd) satisfying for all A ∈ B(Rd), ξ(A× Rd) = ν(A),ξ(Rd × A) = λ(A).

E.4.2 Proof of the first point in Theorem 3

We prove the following proposition:

Proposition S9. Under Assumptions 1 to 5, for Cp the Stochastic Sparsification compression operator, for anyvariant v of the algorithm, there exists a limit distribution Πγ,v, which is stationary, such that for any k in N, forany γ satisfying conditions given in Theorems S5 and S6:

W22 ((δw0

⊗⊗Ni=1δτhi0)Rkγ ,Πγ,v) ≤

(1− γµ)k∫

(w′,h′)∈Rd(1+N)

∥∥(w0, τ(hi0)i)− (w′, τ(hi)′i)∥∥2

dΠγ,v(w′, (hi)′i).

Point 1 in Theorem 3 is derived from the proposition above using πγ,v = πγ,v,w, with πγ,v,w as in Equation (S24),the limit distribution of the main iterates (wk)k∈N and the observation that:

W22 (Θk, πγ,v) ≤ W2

2 ((δw0 ⊗⊗Ni=1δτhi0)Rkγ,v,Πγ,v)

≤ (1− γµ)k∫

(w′,h′)∈Rd(1+N)

∥∥(w0, τ(hi0)i)− (w′, τ(hi)′i)∥∥2

dΠγ,v(w′, (hi)′i)

= (1− γµ)kC0.

The sketch of the proof is simple:

• We introduce a coupling of random variables following respectively νa0Rkγ,v and νb0Rkγ,v, and show that underthe assumptions given in the proposition:

W22 (νa0R

k+1γ,v , ν

b0R

k+1γ,v ) ≤ (1− γµ)W2

2 (νa0Rkγ,v, ν

b0R

kγ,v).

This proof follows the same line as the proof of Theorems S5 and S6.

• We deduce that ((δw0 ⊗ ⊗Ni=1δτhi0))Rkγ,v) is a Cauchy sequence in a Polish space, thus the existence andstability of the limit, we show that this limit is independent from (δw0

⊗⊗Ni=1δτhi0)) and conclude.

Proof. We consider two initial distributions νa0 and νb0 for (w0, τ(hi0)i) with finite second moment and γ > 0.Let (wa0 , τ(hi,a0 )i) and (wb0, τ(hi,b0 )i) be respectively distributed according to νa0 and νb0. Let (wak , τ(hi,ak )i)k≥0 and(wbk, τ(hi,bk )i)k≥0 the Artemis iterates, respectively starting from (wa0 , τ(hi,a0 )i) and (wb0, τ(hi,b0 )i), and sharing thesame sequence of noises, i.e.,

• built with the same gradient oracles gi,ak+1 = gi,bk+1 for all k ∈ N, i ∈ J1, NK.

• the compression operator used for both recursions is almost surely the same, for any iteration k, and bothuplink and downlink compression. We denote these operators Cdwn,k and Cup,k the compression operators atiteration k for respectively the uplink compression and downlink compression.

Page 51: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

We thus have the following updates, for any u ∈ a, b: wuk+1 = wuk − γCdwn,k

(1N∑Ni=1 Cup,k

(gik+1 − h

i,uk

)+ hi,uk

)∀i ∈ J1;nK hi,uk+1 = hi,uk + αCup,k

(gik+1 − h

i,uk

) (S26)

The proof is obtained by induction. For a k in N, let(

(wak , τ(hi,ak )i), (wbk, τ(hi,bk )i)

)be a coupling of random

variable in Γ(νa0Rkγ,v, ν

b0R

kγ,v) – as in Definition 4 –, that achieve the equality in the definition, i.e.,

W22 (νa0R

kγ,v, ν

b0R

kγ,v) = E

[∥∥∥(wak , τ(hi,ak )i)− (wbk, τ(hi,bk )i)∥∥∥2]. (S27)

Existence of such a couple is given by [Villani, 2009, theorem 4.1].

Then(

(wak+1, τ(hi,ak+1)i), (wbk+1, τ(hi,bk+1)i)

)obtained after one update from Equation (S26) belongs to

Γ(νa0Rk+1γ,v , ν

b0R

k+1γ,v ), and as a consequence:

W22 (νa0R

k+1γ,v , ν

b0R

k+1γ,v ) ≤ E

[∥∥∥(wak+1, τ(hi,ak+1)i)− (wbk+1, τ(hi,bk+1)i))∥∥∥2]

= E[∥∥wak+1 − wbk+1

∥∥2]

+ τ2N∑i=1

E[∥∥∥hi,ak+1 − h

i,bk+1

∥∥∥2]

= E[∥∥wak+1 − wbk+1

∥∥2]

+ 2γ2 C

N

N∑i=1

E[∥∥∥hi,ak+1 − h

i,bk+1

∥∥∥2],

with τ2 = 2γ2 CN , where C depends on the variant as in Theorem 1.

We now follow the proof of the previous theorems to control respectively E[∥∥wak+1 − wbk+1

∥∥2]

and

E[∥∥∥hi,ak+1 − h

i,bk+1

∥∥∥2].

First, following the proof of Equation (S14), we get, using the fact that the compression operator is randomsparsification, thus that C(x)− C(y) = C(x− y):

E[∥∥wak+1 − wbk+1

∥∥2 |Hk]≤∥∥wak − wbk∥∥2 − 2γ

⟨∇F (wak)−∇F (wbk)

∣∣ wak − wbk⟩+

2(2ωupC + 1)(ωdwn

C + 1)γ2

N2

N∑i=1

E[∥∥gik+1(wak)− gik+1(wbk)

∥∥2∣∣∣ Hk]

+2ωupC (ωdwn

C + 1)γ2

N2

N∑i=1

E[∥∥∥hi,ak − hi,bk ∥∥∥2

∣∣∣∣ Hk]+ γ2(ωdwn

C + 1)L⟨∇F (wak)−∇F (wbk)

∣∣ wak − wbk⟩.This expression is nearly the same as in Equation (S14), apart from the constant term depending on σ2

∗ thatdisappears.

Note that with a more general compression operator, for example for quantization, it is not possible to derivesuch a result.

Similarly, we control E[∥∥∥hi,ak+1 − h

i,bk+1

∥∥∥2]using the same line of proof as for Equation (S15), resulting in:

1

N2

N∑i=0

E[∥∥∥ha,ik+1 − h

b,ik+1

∥∥∥2∣∣∣∣ Hk] ≤ (1 + p

(2α2ωup

C + 2α2 − 3α)) 1

N2

N∑i=0

E[∥∥∥ha,ik − hb,ik ∥∥∥2

∣∣∣∣ Hk]

+ 2(2α2ωupC + 2α2 − α)

1

N2

N∑i=0

E[∥∥gik+1(wak)− gik+1(wbk)

∥∥2∣∣∣ Hk] .

Page 52: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Combining both equations, and using Assumptions 1 and 2 and Equation (S27) we get, under conditions on thelearning rates α, γ similar to the ones in Theorems S5 and S6, that

W22 (νa0R

k+1γ,v , ν

b0R

k+1γ,v ) ≤ (1− γµ)W2

2 (νa0Rkγ,v, ν

b0R

kγ,v).

And by induction:

W22 (νa0R

k+1γ,v , ν

b0R

k+1γ,v ) ≤ (1− γµ)k+1W2

2 (νa0 , νb0).

From the contraction above, it is easy to derive the existence of a unique stationnary limit distribution: we usePicard fixed point theorem, as in Dieuleveut et al. [2019]. This concludes the proof of Proposition S9.

E.4.3 Proof of the second point of Theorem 3

To prove the second point, we first detail the complementary assumptions mentioned in the text, then show theconvergence to the mean squared distance under the limit distribution, and finally give a lower bound on thisquantity.

Complementary assumptions.To prove the lower bound given by the second point, we need to assume that the constants given in the assumptionsare tight, in other words, that corresponding lower bounds exist in Assumptions 3 to 5.

Assumption 7 (Lower bound on noise over stochastic gradients computation). The noise over stochastic gradientsat optimal global point for a mini-batch of size b is lower bounded. In other words, there exists a constant σ∗ ∈ R,such that for all k in N, for all i in J1, NK , we have a.s:

E[‖gik+1,∗ −∇Fi(w∗)‖2

∣∣ Hk] ≥ σ2∗b.

Assumption 8 (Lower bound on local gradient at w∗). There exists a constant B ∈ R, s.t.:

1

N

N∑i=1

‖∇Fi(w∗)‖2 ≥ B2.

Assumption 9 (Lower bound on the compression operator’s variance). There exists a constant ωC ∈ R∗ suchthat the compression operators Cup and Cdwn verify the following property:

∀∆ ∈ Rd ,E[‖Cup ,dwn(∆)−∆‖2] = ωup ,dwnC ‖∆‖2 .

This last assumption is valid for Stochastic Sparsification.

Moreover, we also assume some extra regularity on the function. This restricts the regularity of the functionbeyond Assumption 2 and is a purely technical assumption in order to conduct the detailed asymptotic analysis.It is valid in practice for Least Squares or Logistic regression.

Assumption 10 (Regularity of the functions). The function F is also times continuously differentiable withsecond to fifth uniformly bounded derivatives: for all k ∈ 2, . . . , 5, supw∈Rd ‖F (k)(w)‖ <∞.

Convergence of moments.We first prove that E[‖wk − w∗‖2] converges to Ew∼πγ,v [‖w − w∗‖2] as k increases to ∞.

We have that the difference satisfies, for random variables wk and w following distributions δw0Rkγ,v and πγ,v,

Page 53: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

and coupled such that they achieve the equality in Equation (S25):

∆E,k : = E[‖wk − w∗‖2]− Ew∼πγ,v [‖w − w∗‖2]

= Ewk,w∼πγ,v[‖wk − w∗‖2 − ‖w − w∗‖2

]= Ewk,w∼πγ,v [(‖wk − w∗‖ − ‖w − w∗‖)(‖wk − w∗‖+ ‖w − w∗‖)]C.S≤(Ewk,w∼πγ,v

[(‖wk − w∗‖ − ‖w − w∗‖)2

]Ewk,w

[(‖wk − w∗‖+ ‖w − w∗‖)2

])1/2T.I.≤(Ewk,w∼πγ,v

[(‖wk − w‖)2

]Ewk,w∼πγ,v

[(‖wk − w∗‖+ ‖w − w∗‖)2

])1/2(i)≤(Ewk,w∼πγ,v

[(‖wk − w‖)2

]2L)1/2

(ii)≤(W2

2 (δw0Rkγ,v, πγ,v)2L

)1/2(iii)→ 0.

Where we have used Cauchy-Schwarz inequality at line C.S., triangular inequality at line T.I., the fact that themoments are bounded by a constant L at line (i), the fact that the distributions are coupled such that theyachieve the equality in Equation (S25) at line (ii), and finally Proposition S9 for the conclusion at line (iii).

Overall, this shows that the mean squared distance (i.e., saturation level) converges to the mean squared distanceunder the limit distribution.

Evaluation of Ew∼πγ,v [‖w − w∗‖2].In this section, we denote Ξk+1(wk, hk) the global noise, defined by

Ξk+1(wk, hk) = ∇F (wk)− Cdwn

(1

N

N∑i=1

Cup(gik+1(wk)− hik) + hik

),

such that wk+1 = wk − γ∇F (wk) + γΞk+1(wk, hk).

In the following, we denote a⊗2 := aaT the second order moment of a. We define Tr the trace operator andCov the covariance operator such that Cov(Ξ(w, h)) = E

[(Ξ(w, h))⊗2

], where the expectation is taken on the

randomness of both compressions and the gradient oracle. We make a final technical assumption on the regularityof the covariance matrix.Assumption 11. We assume that:

1. Cov(Ξ(w, h)) is continuously differentiable, and there exists constants C and C ′ such that for all w, h ∈Rd(1+N), maxo=1,2,3 Cov(o)(w, h) ≤ C + C ′||(w, h)− (w∗, h∗)||2.

2. (Ξ(w∗, h∗)) has finite order moments up to order 8.

Remark: with the Stochastic Sparsification operator, this assumption can directly be translated into anassumption on the moments and regularity of gik. Note that Point 2 in Assumption 11 is an extension ofAssumption 3 to higher order moments, but still at the optimal point. Under this assumption, we have thefollowing lemma:Lemma S16. Under Assumptions 1 to 5, 10 and 11, we have that

Eπγ,v[‖w − w∗‖2

]=γ→0

γTr(A Cov(Ξ(w∗, h∗))) +O(γ2), (S28)

with A := (F ′′(w∗)⊗ I + I ⊗ F ′′(w∗))−1.

The intuition of the proof is natural: using the stability of the limit distribution, we have that if we start fromthe stationary distribution, i.e., (w0, h0) ∼ Πγ,v, then (w1, h1) ∼ Πγ,v.

We can thus write:

Eπγ,v[(w − w∗)⊗2

]= E

[(w1 − w∗)⊗2

]= E

[(w0 − w∗ − γ∇F (w0) + γΞ(w0, h0))⊗2

].

Page 54: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Then, expanding the right hand side and using the fact that E[Ξ(w0, h0)|H0] = 0, then the fact thatE[(w1 − w∗)⊗2

]= E

[(w0 − w∗)⊗2

], and expanding the derivative of F around w∗ (this is where we require the

regularity assumption Assumption 10), we get that:

γ (F ′′(w∗)⊗ I + I ⊗ F ′′(w∗) +O(γ))Eπγ,v[(w − w∗)⊗2

]=γ→0

γ2E(w,h)∼Πγ,v

[Ξ(w, h)⊗2

].

Thus:

Eπγ,v[(w − w∗)⊗2

]=γ→0

γAE(w,h)∼Πγ,v

[Ξ(w, h)⊗2

]+O(γ2).

⇒ Eπγ,v[‖(w − w∗)‖2

]=γ→0

γTr(AE(w,h)∼Πγ,v

[Ξ(w, h)⊗2

])+O(γ2).

Finally, we use that E(w,h)∼Πγ,v [Cov(Ξ(w, h))] =γ→0

Cov(Ξ(w∗, h∗)) +O(γ) (which is derived from Assumption 11)

to get Lemma S16.

More formally, we can rely on Theorem 4 in Dieuleveut et al. [2019]: under Assumptions 1 to 5, 10 and 11, allassumptions required for the application of the theorem are verified and the result follows.

To conclude the proof, it only remains to control Cov(Ξ(w∗, h∗)). We have the following Lemma:

Lemma S17. Under Assumptions 7 to 9, we have that, for any variant v of the algorithm, with the constant Egiven in Theorem 1 depending on the variant:

Tr (Cov(Ξ(w∗, h∗))) = Ω

(γE

µN

). (S29)

Combining Lemmas S16 and S17 and using the observation that A is lower bounded by 12L independently of

γ,N, σ∗, B, we have proved the following proposition:

Proposition S10. Under Assumptions 1 to 5 and 7 to 11, we have that

E[‖wk − w∗‖2] →k→∞

Eπγ,v[‖w − w∗‖2

]=γ→0

Ω

(γE

µN

)+O(γ2), (S30)

where the constant in the Ω is independent of N, σ∗, γ, B (it depends only on the regularity of the operator A).

Before giving the proof, we make a couple of observations:

1. This shows that the upper bound on the limit mean squared error given in Theorem 1 is tight with respectto N, σ∗, γ, B. This underlines that the conditions on the problem that we have used are the correct ones tounderstand convergence.

2. The upper bound is possibly not tight with respect to µ, as is clear from the proof: the tight bound isactually Tr(ACov(Ξ(w∗, h∗))). Getting a tight upper bound involving the eigenvalue decomposition of Ainstead of only µ is an open direction.

3. In the memory-less case, h ≡ 0 and all the proof can be carried out analyzing only the distribution of theiterates (wk)k and not necessarily the couple (wk, (h

ik)i)k.

We now give the proof of Lemma S17.

Page 55: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Proof. With memory, we have the following:

Tr (Cov(Ξ(w∗, h∗))) = E

∥∥∥∥∥Cdwn

(1

N

N∑i=1

Cup(gi1(w∗)− hi∗) + hi∗

)∥∥∥∥∥2

(i)= (1 + ωdwn

C )E

∥∥∥∥∥ 1

N

N∑i=1

Cup(gi1(w∗)− hi∗) + hi∗

∥∥∥∥∥2

(ii)=

(1 + ωdwnC )

N2

N∑i=1

E[∥∥Cup(gi1(w∗)− hi∗)

∥∥2]

(iii)=

(1 + ωdwnC )

N2

N∑i=1

E[∥∥BikCup(gi1(w∗)− hi∗)

∥∥2]

(iv)=

(1 + ωdwnC )

N2

N∑i=1

(1 + ωupC )E

[∥∥gi1(w∗)− hi∗∥∥2]

(v)≥ (1 + ωdwn

C )

N(1 + ωup

C )σ2∗b.

At line (i) we use Assumption 9 for the downlink compression operator with constant ωdwnC . At line (ii) we use the

fact that∑Ni=1 h

i∗ = ∇F (w∗) = 0, the independence of the random variables Cup(gi1(w∗)− hi∗), Cup(gj1(w∗)− hj∗)

for i 6= j and the fact that they have 0 mean. Line (iii) makes appear the bernouilli variable Bik that mark if aworker is activate or not at round k. We use Assumption 9 for the uplink compression operator with constantωupC in line (iv); and finally Assumption 7 at line (v) to lower bound the variance of the gradients at the optimum.

This proof applies to both simple and double compression with ωdwnC = 0 or not.

Remark that for the variant 2 of Artemis, the constant E given in Theorem 1 has a factor α2C(ωC+1): combiningwith the value of C, this term is indeed of the order of (1 + ωdwn

C )(1 + ωupC ).

Without memory, we have the following computation:

Tr (Cov(Ξ(w∗, 0))) = E

∥∥∥∥∥Cdwn

(1

N

N∑i=1

Cup(gi1(w∗))

)∥∥∥∥∥2

(i)= (1 + ωdwn

C )E

∥∥∥∥∥ 1

N

N∑i=1

Cup(gi1(w∗))− hi∗

∥∥∥∥∥2

(ii)=

(1 + ωdwnC )

N2

N∑i=1

E[∥∥Cup(gi1(w∗))− hi∗

∥∥2]

(iii)=

(1 + ωdwnC )

N2

N∑i=1

E[∥∥Cup(gi1(w∗))− gi1(w∗)

∥∥2+∥∥gi1(w∗)− hi∗

∥∥2]

At line (i) we use Assumption 9 for the downlink compression operator with constant ωdwnC and the fact that∑N

i=1 hi∗ = ∇F (w∗) = 0, then at line (ii) the independence of the random variables Cup(gi1(w∗))− hi∗ with mean

0, then a Bias Variance decomposition at line (iii).

Tr (Cov(Ξ(w∗, 0)))(iv)=

(1 + ωdwnC )

N2

N∑i=1

E[ωupC∥∥(gi1(w∗))

∥∥2+∥∥gi1(w∗)− hi∗

∥∥2]

(v)=

(1 + ωdwnC )

N2

N∑i=1

E[ωupC

(∥∥gi1(w∗)− h∗i∥∥2

+ ‖h∗i ‖2)

+∥∥gi1(w∗)− hi∗

∥∥2]

(vi)=

(1 + ωdwnC )

N

((ωupC + 1)

σ2∗b

+ ωupC B

2

).

Page 56: Abstract arXiv:2006.14591v1 [cs.LG] 25 Jun 2020Table1: Comparisonofframeworksformainalgorithmshandling(bidirectional)compression. Bestseenincolors. QSGD [3] Diana [25] Dore [21] Double

Bidirectional compression in heterogeneous settings for distributed or federated learning

Next we use Assumption 9 for the uplink compression operator with constant ωupC at line (iv). Line (v) is another

Bias-Variance decomposition and we finally conclude by using Assumptions 7 and 8 at line (vi) and reorganizingterms.

We have showed the lower bound both with or without memory, which concludes the proof.