abstract arxiv:2003.11666v1 [cs.lg] 25 mar 2020 · pipelined backpropagation at scale: training...

PIPELINED BACKPROPAGATION AT SCALE:TRAINING LARGE MODELS WITHOUT BATCHES

Atli Kosson * 1 Vitaliy Chiley * 2 Abhinav Venigalla 2 Joel Hestness 2 Urs Koster 1

ABSTRACTNew hardware can substantially increase the speed and efficiency of deep neural network training. To guidethe development of future hardware architectures, it is pertinent to explore the hardware and machine learningproperties of alternative training algorithms. In this work we evaluate the use of small batch, fine-grained PipelinedBackpropagation, an asynchronous pipeline parallel training algorithm that has significant hardware advantages.We introduce two methods, Spike Compensation and Linear Weight Prediction, that effectively mitigate thedownsides caused by the asynchronicity of Pipelined Backpropagation and outperform existing techniques in oursetting. We show that appropriate normalization and small batch sizes can also aid training. With our methods,fine-grained Pipelined Backpropagation using a batch size of one can match the accuracy of SGD for multiplenetworks trained on CIFAR-10 and ImageNet. Simple scaling rules allow the use of existing hyperparamaters fortraditional training without additional tuning.

1 INTRODUCTION

In recent years, the compute requirements for training stateof the art deep neural networks have rapidly increased(Amodei & Hernandez, 2018). To manage the increasedcompute requirements, new and efficient hardware archi-tectures are being developed for accelerating deep learning.Traditional deep learning accelerators rely on batch paral-lelism (sometimes called data parallelism) to hide memorybandwidth and latency issues. This can scale to large batches(Shallue et al., 2019) but can have considerable overheadsthat limit efficiency. New hardware architectures could sup-port other, potentially more efficient, training techniques andmay not be well suited for traditional methods. To guide thedevelopment of future hardware, it is therefore important toevaluate alternative training techniques and understand theiradvantages and limitations compared to traditional training.

One alternative to batch parallelism is pipeline parallelism(Figure 1), which divides the model into sequential segmentswe call pipeline stages. Each worker is assigned to onestage and groups of inputs called micro-batches proceedsequentially through the stages, similar to an assembly line.This form of parallelism has the advantage that each workeronly performs a subset of the computation which allowsthem to specialize.

Fine-grained pipeline parallelism assigns one layer of thenetwork to each stage, maximizing opportunities for spe-

*Equal contribution 1Work done while at Cerebras Systems2Cerebras Systems, Inc. Correspondence to: Atli Kosson <[email protected]>, Vitaliy Chiley <[email protected]>.

Workers

F1 F2 F3 F4 B4 B3 B2 B1

... ...

F1 F2 F3 F4 B4 B3

Time

F1 F2 F3 F4 B4 B3 B2 B1 F1 F2 F3 F4 B4 B3

F1 F2 F3 F4 B4 B3 B2 B1 F1 F2 F3 F4 B4 B3

F1 F2 F3 F4 B4 B3 B2 B1 F1 F2 F3 F4 B4 B3

Workers

F1

F2

F3

F4 B4

B3

B2

B1

... ...

F1

F2

F3

F4 B4

B3

B2

B1F1

F2

F3

F4 B4

B3

B2

B1F1

F2

F3

F4 B4

B3

B2

B1F1

F2

F3

F4 B4

B3

F1

F2

F3

F4

F1

F2

F3

F4 B4

B3

B2

B1

B4

B3

B2

B1

B2

B1

Figure 1. Top: batch parallelism. Bottom: pipeline parallelism.We show four workers and a network that can be split into four se-quential transformations F1, F2, F3, F4 with corresponding back-wards operations B1, B2, B3, B4. The steady state is shown andfor simplicity F and B are shown taking the same time. Theprocessing of four inputs is highlighted. In pipeline parallelismworkers can specialize to perform a subset of the transformations.

cialization. Since each worker only processes one layer itonly needs to access a small set of weights which may fitin local memory. This eliminates bandwidth and latencyissues associated with fetching the weights. For highly con-figurable architectures, the limited logic set required foreach worker allows for an optimized allocation of computeresources between workers, increasing overall utilization.Zhang et al. (2019c) find that fine-grained pipelining canenable speedups of up to 3.5x in their setting. Li & Pe-dram (2017) and Chen et al. (2016) both report energy sav-

arX

iv:2

003.

1166

6v2

[cs

.LG

] 1

4 O

ct 2

020

Pipelined Backpropagation at Scale

ings of up to 3x. Fine-grained pipelining can also enableefficient sparse processing which Chen et al. (2019) showcan result in up to a 42.5x and an 11.3x improvement inthroughput and energy efficiency, respectively.

Pipeline parallel training commonly implements a formof mini-batch SGD (Huang et al., 2018). This is done bydividing the batch into micro-batches that are sequentiallyfed into the pipeline (filling it) and waiting for the resultinggradients of all micro-batches (draining the pipeline). Whenthe pipeline is empty, the parameters are updated and theprocess is then repeated for the next batch. Filling anddraining the pipeline for each update can significantly lowerhardware utilization when the update size is small comparedto the number of pipeline stages (Figure 2). Although itis possible to train with large batch sizes, this makes thehyperparameters harder to tune for a given compute budget(Shallue et al., 2019) and hyperparameters commonly foundin literature are tuned for modest batch sizes.

Pipelined Backpropagation (PB) (Petrowski et al., 1993)is an asynchronous training technique that avoids fill anddrain overhead by updating the weights without draining thepipeline first. This results in weight inconsistency, the useof different weights on the forward and backward passesfor a given micro-batch. The weights used to produce aparticular gradient may also have been updated when thegradient is applied, resulting in stale (or delayed) gradients.For these reasons PB resembles Asynchronous SGD (Lianet al., 2015) and is not equivalent to standard SGD. Fine-grained pipelining increases the number of pipeline stagesand hence increases the weight inconsistency and delay. Fora more detailed discussion of Pipelined Backpropagationand how it gives rise to weight inconsistency and gradientstaleness, see Appendix A.

The formulation of Pipelined Backpropagation we usethroughout this work performs an optimization step afterevery micro-batch. Unless otherwise specified, we use amicro-batch size of one and adjust the learning rate andmomentum to keep the magnitude of a given weight updateproportional to the batch size. This has several advantages.First, smaller micro-batches result in smaller weight updatesand consequently reduce the effect of weight inconsistencyand gradient delays in PB. Secondly, in pipeline parallelismthe activation memory requirements have a quadratic de-pendence on the number of pipeline stages. Fine-grainedpipeline parallelism with large micro-batches may not fitinto memory. Small batch processing may pose an issue fortraditional deep learning accelerators, such as GPUs, whichcommonly rely on large batch sizes to hide low memorylatency and amortize the bandwidth required for loadingweights. New hardware architectures with different memorycharacteristics may not suffer from these issues, especiallywhen the weights are stored in local memory. Finally, small

Pipeline Step

...

...

...

Pipe

line

Stag

ePi

pelin

e St

age

Pipe

line

Stag

e

Figure 2. Utilization of different pipeline parallel modes. Idleworkers are depicted in red, fully utilized workers in green andpartially utilized workers (only processing either the forward orbackward pass while filling or draining the pipeline) in yellow.Top: Small batch size fill and drain SGD. Middle: Large batchsize fill and drain SGD. Bottom: Pipelined Backpropagation. Thered and blue lines show the forward and backward pass of a singlesample. The grey lines show the delays for two of the stages.

batch training has the added benefit of stabilizing training,increasing generalization performance (Masters & Luschi,2018), and easing hyperparameter tuning (Li et al., 2018).

1.1 Related Works

GPipe (Huang et al., 2018) and Megatron-LM (Shoeybiet al., 2019) combine data and pipeline parallel training butincur the fill and drain overhead of pipelined training. Drain-ing the pipeline is required because of the locking natureof SGD (Jaderberg et al., 2017). Unlocking the forward,backward, and update operations is an active area of re-search (Jaderberg et al., 2017; Huo et al., 2018a;b; Xu et al.,2019; Belilovsky et al., 2019; Gaunt et al., 2017). Petrowskiet al. (1993) propose Pipelined Backpropagation which up-dates the weights without draining the pipeline to avoid thefill and drain overhead but introduces weight inconsistencyand stale gradients.

Recent works have also explored mitigating these issueswhen training networks through a combination of data par-allelism and PB (Chen et al., 2012; Harlap et al., 2018;Chen et al., 2018; Yang et al., 2019; Zhuang et al., 2019).PipeDream (Harlap et al., 2018) proposes weight stashing(WS) and vertical sync (VS). Weight stashing saves theweights used on the forward pass for use on the backwardspass and vertical sync updates the weights at the same mo-ment in time. Chen et al. (2018) show how weight stashingis ineffective in their setting and propose a form of weightprediction called SpecTrain to mitigate the effects of bothstale gradients and inconsistent weights. PipeMare (Yanget al., 2019) applies discrepancy correction (a form of back-ward weight prediction) to address inconsistent weightsand learning rate rescheduling (a new form of learning rate


warmup) to help with stale gradients. Zhuang et al. (2019)propose Gradient Shrinking, which exponentially decaysthe gradients for each stage based on the delay.

Unlike prior work, we completely replace batch parallelismwith fine-grained pipelined parallelism at batch size one.Compared to previous works this increases the number ofpipeline stages and allows for greater worker specializa-tion. We find that in our setting existing mitigation methodssuch as Weight Stashing, Gradient Shrinking and SpecTrainare insufficient while some others, such as Features Re-play (Huo et al., 2018a), do not apply since each stage has asingle layer. Our contributions are as follows:

• We explore the hardware aspects of small batch size,fine-grained, Pipelined Backpropagation for trainingdeep neural networks and how Coarse-Grained Recon-figurable Arrays (Podobas et al., 2020) can be particu-larly well suited for this sort of training.

• We propose two methods, Spike Compensation andLinear Weight Prediction to mitigate the drawbacksof PB: inconsistent weights and stale gradients. Ananalysis on a simplified model shows how they cancounteract the effects of stale gradients and restore thebenefits of momentum in the presence of delays. Weshow that our methods outperform existing techniquesin the fine-grained small batch PB setting, withoutadditional hyperparameters to tune.

• We show the importance of small batches and propernormalization for this type of training. Simple scalingrules enable the use of existing hyperparamaters for tra-ditional training without further tuning. With our meth-ods, this makes fine-grained PB a drop-in replacementfor mini-batch SGD training on standard image clas-sification benchmarks, CIFAR-10 (Krizhevsky et al.,2009) and ImageNet (Deng et al., 2009).

2 FINE GRAINED PIPELINE PARALLELISMON HARDWARE

Fine grained pipeline parallel training can allow workersto specialize, increasing utilization and accelerating neuralnetwork training. A Coarse-Grained Reconfigurable Array(CGRA) is an example of a hardware architecture that cansignificantly benefit from this. CGRAs are a grid of locallyconnected cores with fast interconnects and a distributedmemory architecture. Pipeline parallelism can be performedby spatially distributing the layers over the compute array(Figure 3). Due to its distributed nature, the overall on-chipmemory in a CGRA can be significantly larger than on tra-ditional architectures, allowing workers to store weightslocally. This enables persistent kernel execution eliminatingany kernel launch overheads as well as overcomes band-width and latency issues for small batch processing. Usinga small micro-batch size limits the memory required for

w w w w w w w w w w ww w w w w w w w w w w

w w w w w w w w w w ww w w w w w w w w w

w w w w w w w w w w ww w w w w w w w w w ww w w w w w w w w w ww w w w w w w w w w ww w w w w w w w w w ww w w w w w w w w w ww w w w w w w w w w ww w w w w w w w w w w

w

ww

w

wwwwwwww

w

Figure 3. Neural network (Left) computation spatially distributedacross a compute fabric (Right) with many workers (w). Eachworker is specialized and an appropriate amount of compute (work-ers) is allotted to each layer of the NN. The pipeline depth is de-fined by the network topology and not the compute architecture.

storing activations limiting the overall memory required.

Standard layer sequential, large batch size training can beperformed on CGRAs but may require adding off-chip mem-ory. Layer sequential execution eliminates the benefits ofpersistent kernel execution and worker specialization. Dis-tributing model weights and reducing gradients across toall cores creates latency issues similar to those encounteredin distributed training that can be difficult to mask. Fine-grained, small micro-batch size pipelining may therefore bethe best way to achieve high utilization levels and has beenshown to increase processing speed and energy efficiency.Architectures with distributed low latency memory archi-tectures, such as (Sean Lie, 2020; Vassilieva, 2020; KunleOlukotun, 2020; Nicol, 2017; Jia et al., 2019), could benefitfrom pipelined training approaches. Kunle Olukotun (2020)shows how this spatial distribution of layers can be used toaccelerate neural networks on SambaNova’s ReconfigurableDataflow Architecture. GraphCore presents how pipelinedtraining improves the IPU’s efficiency (Graphcore, 2020),and Sean Lie (2020) and Vassilieva (2020) discuss how fine-grained pipelined parallelism can accelerate neural networktraining on the Cerebras Wafer-Scale Engine.

2.1 Simulating Pipelined Parallelism on GPUs

Since suitable hardware architectures are not widely avail-able yet, we are not able to test our methods in a hardwaresetup where we expect speedups. GPUs are designed forlayer sequential batch processing and are not efficient forpipeline parallel training with small batch sizes. However,they are readily available so we opt to use them for this ex-ploratory work. In this section we describe how we simulatepipeline parallel training on GPUs. Appendix C discuses theGPU hardware limitations that prevent this mode of trainingfrom being efficient on GPUs. Therefore, our goal is onlyto make our experiments feasible on GPUs, not to speed uptraining compared to well tuned batch parallelism.

In particular we are interested in simulating fully pipeline


Stage K

FWD

BWD

GRADW dW

Output K

Stage K+1Stage K-1

Output K-1

Output Delta K+1

Input K+1

Input Delta K+1

Input K

Output Delta K

Input Buffer K+1Input Buffer K

Input Delta K-1

Figure 4. Pipeline Stage. The network is divided into stages thatcorrespond to pipeline stages. Each stage contains logic for the for-ward, backward, and gradient computation for the correspondingnetwork component. The use of buffers enables parallel execution,one stage can compute a new output while another stage is usingthe previous output.

parallel training on networks such as ResNet-50 with amaximal number of pipeline stages and no batch parallelism.Most modern deep learning frameworks are not well suitedfor such experimentation. To enable efficient simulation offully pipeline parallel training, we built a mini-frameworkimplemented in C++ using cuDNN (Chetlur et al., 2014)kernels, custom CUDA kernels, and Thrust.

The network is split into structures we call stages that act aspipeline stages. Each stage manages all resources neededto compute the forward and backward passes for the cor-responding part of the network (Figure 4). The forward,backward, and gradient computations within a stage arecompletely unlocked and run in parallel. In our experimentswe sometimes group several network components togetherinto a single stage. One example of this is grouping con-volution, normalization, and ReLU into one stage. We useCUDA streams to run the stages in parallel. The frameworkalso supports splitting the network over multiple GPUs anduses a different thread to launch the stages on each GPU.

3 METHODS

We introduce two compensation methods for Pipeline Back-propagation: Linear Weight Prediction and Spike Com-pensation. We formulate them for SGD with momentum1

(SGDM) which we write as:

vt+1 = mvt + gt (1)wt+1 = wt − ηvt+1 (2)

where wt are weights (parameters) at time t, vt is the ve-locity (sometimes called momentum), m is the momentumcoefficient, η is the learning rate, and gt represents the gradi-ent estimate applied at time t. This estimate may be delayed,and could be calculated with inconsistent weights.

PB partitions the network into S stages. Each stage hasits own delay, Ds for s ∈ [0, ..., S − 1], determined by thenetwork architecture. We describe and analyze our methods

1Both methods require momentum and can be adapted for othermomentum based optimizers.

0Time

0

1

Impu

lse R

espo

nse

0 DTime

0 DTime

0

1

Figure 5. Left: Momentum exponentially smooths gradients overtime so the contribution of each gradient to future weights updates(the impulse response) is an exponentially decaying function fromthe time it arrives. Middle: A delayed gradient has an impulseresponse shifted by the delayD. The dotted line shows the baselinewithout delay. Right: With spike compensation (SCD) the impulseresponse has a spike (denoted with an arrow) and then matches theno-delay case. The size of the spike matches that of the missedupdates compared to the baseline shown in light gray.

for a constant delay, D, without modeling the pipeline orinconsistency. When we use the methods for PB we applythem to each stage separately, with the corresponding delayset to the number of steps between the forward and back-wards passes for that stage. To simplify notation we dropthe superscript s representing the stage index. We write thegradient as a function of the weights alone, whereas in SGDthe gradient may also depend on inputs or other data.

3.1 Spike Compensation

We introduce Spike Compensation (SC) to mitigate the ef-fects of delayed gradients in PB. The method uses a modifiedweight update which increases the contribution of the latestgradient relative to the velocity. For a delay of D this cangenerally be written as:

gt = G(wt−D) (3)vt+1 = mvt + gt (4)wt+1 = wt − η · (avt+1 + bgt) (5)

where a and b are functions of the delay. We could absorb ei-ther a or b into η but use this form to keep η consistent withother methods. We refer to this form as generalized SpikeCompensation (GSC). To reason about sensible choices fora and b we can look at the contribution of each gradient overtime in the no-delay case vs the delay case (see Figure 5).When a gradient g is obtained with some delay D, this gra-dient would already have contributed to D weight updatesin the no-delay case. The total contribution of the gradientso far would have been:

D−1∑t=0

mtg =1−mD

1−mg (6)

This inspires our default choice of a and b for Spike Com-


pensation which we will refer to as SCD:

a = mD and b =1−mD

1−m(7)

For this choice, the missing weight update is applied imme-diately and the contribution of the gradient at later time stepswill match that of the no-delay case. The total contributionof each gradient to the weights over the course of training isunchanged, this only changes how the gradients are appliedover time. The modified weight update can equivalentlybe seen as approximating the velocity in the no-delay casewith avt+1 + bgt. This uses the latest gradient to approxi-mate the gradient terms in the velocity that have not beenobserved due to the delay. For a delay of zero, SCD reducesto SGDM (2).

3.2 Linear Weight Prediction

Both the weight inconsistency and gradient delay arise fromthe fact that we can not access the (future) weights used onthe backwards pass when we compute the forward pass. Thegoal of weight prediction is to approximate the backwardsweights on the forward pass. We want to approximate:

wt+D = wt − ηD−1∑k=0

vt+k+1 (8)

where D is the delay (number of update steps between theforward and backwards passes). The future velocities areunknown but can be approximated by assuming a constantgradient g over the prediction horizon, i.e. the number ofiterations over which the prediction is made. This gives:

vt+k+1≈mkvt+1 + g

k−1∑i=0

mi = mkvt+1 +1−mk

1−mg (9)

which results in predicted weights:

wt+D=wt−η1−mD

1−mvt+1−

ηg

1−m

(D− 1−mD

1−m

)(10)

We have several good choices g including setting it to zeroor estimating it based on recent gradients. In this work wefocus on weight prediction where the direction of the veloc-ity does not change, i.e. g is collinear with vt. We refer tothis as linear weight prediction (LWP). The approximationfor the weights at time t and delay D can then be written interms of past weights and velocities as:

w(t,D, T ) = wt−D − ηTvt−D =: wv(t,D, T ) (11)

Where T is a hyperparameter we call the horizon of theweight prediction. For SGDM without modifications, wecan equivalently write the approximate in terms of the pre-vious weights alone:

w(t,D, T ) = wt−D + T · (wt−D − wt−D−1)

=: ww(t,D, T ) (12)

When combined with Spike Compensation, or potentiallywhen using other optimizers, the predictions given by equa-tions (11) and (12) differ. When this is the case we refer tothe two types as LWPv (velocity form) and LWPw (weightdifference form), respectively. The update step is:

gt = G (w(t,D, T )) (13)vt+1 = mvt + gt (14)wt+1 = wt − ηvt+1 (15)

In the rest of this paper we use LWPD to denote LWP withour default choice of T = D. This is equivalent to choosingg = (1−m)vt+1 in (10) which would result in a constant ve-locity. This form is closely related to the weight predictionused in SpecTrain (Chen et al., 2018) which extends the pre-diction horizon and also predicts weights on the backwardspass (see Appendix F).

3.3 Combined Mitigation

Spike Compensation and weight prediction can be combinedresulting in the following update step:

gt = G (w(t,D, T )) (16)vt+1 = mvt + gt (17)wt+1 = wt − η · (avt+1 + bgt) (18)

where, as before, T is the horizon of the weight predictionand a and b are the coefficients for the Spike Compensation.When combined with Spike Compensation wv(t,D, T ) 6=ww(t,D, T ). In the combination ww(t,D, T ) can be inter-preted as using Spike Compensation to approximate thevelocity used in the weight prediction which also corre-sponds to a different choice of g in equation (10) using themost recent gradient estimate.

3.4 Analysis for a Convex Quadratic

In this section we analyze the optimization of a convexquadratic loss with gradient delay. We find that delay causeseffects that can generally not be mitigated through hyperpa-rameter tuning but our methods:

• Improve convergence for large condition numbers• Allow higher learning rates for large momentum values• Restore the benefits of momentum for poorly condi-

tioned losses

Although the loss surfaces of convex quadratics are muchsimpler then those of neural networks, they can be a use-ful tool to gain a high level understanding of various as-pects of neural network optimization. For instance, Zhanget al. (2019a) accurately model the effect of batch sizeon neural network optimization through a noisy quadraticmodel. Venigalla et al. (2020), Giladi et al. (2019), andYang et al. (2019) use quadratic models to analyze gradient


1 10 5

1 10 4

1 10 3

1 10 2

1 10 1

0

mom

entu

mGDM for D=0 GDM for D=1 SCD for D=1

10 9 10 6 10 3 100

1 10 5

1 10 4

1 10 3

1 10 2

1 10 1

0

mom

entu

m

Nesterov for D=0

10 9 10 6 10 3 100

LWPD for D=1

10 9 10 6 10 3 100

LWPwD+SCD for D=1

0

1 10 1

1 10 2

1 10 3

1 10 4

1 10 5

1 10 6

|r max

|

Figure 6. These plots show the magnitude of the dominant root of the characteristic polynomials given in equations (20)-(23) as a functionof the normalized rate ηλ and the momentum m. The two leftmost plots show zero delay baselines and the other plots use a delayof D = 1. The blacked out region has roots with magnitudes larger than one and is therefore unstable. For a delay of one, Nesterovmomentum is equivalent to Spike Compensation, but for larger delays this does not hold and Nesterov is only marginally better thanGDM.

delay mitigation methods for use in asynchronous deep net-work training. In general we find that the insights from theconvex quadratic hold surprisingly well for neural networksas shown in our experiments and the appendix.

We follow a similar approach as (O’donoghue & Candes,2015; Goh, 2017) and write the loss in terms of an eigenbasisof the quadratic as:

L(φ) = φTΛφ, Λ = diag(λ(1), ..., λ(N)) (19)

where φ = [φ(1), ..., φ(N)]T correspond to the parametersbeing optimized and λ(1) ≥ ... ≥ λ(N) > 0 are the eigen-values of the quadratic. As shown in e.g. Goh (2017),any positive definite quadratic can be written in this formthrough a coordinate transformation. Since Λ is diagonal,each coordinate of the gradient ∇φL(φ) = Λφ is indepen-dent of other coordinates. This allows us to analyze theconvergence for each coordinate separately. For simplicitywe assume that the gradient is deterministic. A similar anal-ysis would hold for the expected values of φ if each gradientsample was assumed to be noisy but unbiased.

In Appendix H we derive the state transition equations forSGDM with delay and our methods. Since the gradient hereis linear, and the coordinates are independent, inserting itinto the transition equations results in a linear recurrencerelation for each coordinate. For component φ(k), with as-sociated eigenvalue λ = λ(k), the characteristic polynomialfor the recurrence relation of each method is:

GDM: p(z) = zτ+1 − (1 +m)zτ +mzτ−1 − ηλ (20)

GSC: p(z) = zτ+2 − (1 +m)zτ+1 +mzτ

+ ηλ · (a+ b)z − ηλmb (21)

LWP: p(z) = zτ+2 − (1 +m)zτ+1 +mzτ

+ ηλ · (1 + T )z − ηλT (22)LWPw+GSC:

p(z) = zτ+3 − (1 +m)zτ+2 +mzτ+1

+ ηλ · (a+ b)(T + 1)z2

− ηλ · ((T + 1)mb+ T · (a+ b)) z

+ ηλTmb (23)

where GDM stands for gradient descent with momentum,GSC is general Spike Compensation, LWP is linear weightprediction, z parameterizes the polynomials and other sym-bols have the same meaning as in Section 3.3. Note thatsince the gradient is linear, GSC and LWP are equivalentfor a certain choice of a, b and T as shown in Appendix H.Even though this is the case, the characteristic polynomialof the combination cannot be obtained from either method.

Linear recurrence relations have a well known solution interms of the roots of the corresponding characteristic equa-tion. The resulting sequence for component φ(i), corre-sponding to the characteristic polynomial p(z) with rootsr1, ..., rn, can be written as:

φ(i)t =

n∑k=1

qk(t)rtk (24)

where qk(t) is a polynomial. The order of the polynomial isone less than the multiplicity of the corresponding root rk.The coefficients of the polynomials are determined by theinitial conditions.

For our analysis we assume that all components start withsome error and look at the rate of convergence in the limitt→∞. A component φ(i) converges to the optimal valueof 0 if |rmax| = maxk(|rk|) < 1. In the limit, the slowest


100 101 102 103 104 105 106

Condition Number

100

101

102

103

104

105

106M

inim

um H

alfli

fe

GDM D=1SCD D=1LWPD D=1LWPw

D+SCD D=1GDM D=0

Figure 7. The half-life of the error as a function of the conditioningnumber when optimizing a convex quadratic with delay D. Allmethods improve the convergence rate, LWP w

D+SCD performs best.

term of equation (24) will dominate so the error for thiscomponent, ε(i) will be:

ε(i)t = |φ(i)t − 0| ∝ |rmax|t (25)

The overall rate of convergence is determined by the slowestcomponent. The slowest component depends on the rootsof the characteristic polynomial. These can be difficult todetermine analytically, so we turn to computational anal-ysis. For a given delay, we can compute the roots of thecharacteristic polynomials (20)-(23), including |rmax|, as afunction of the normalized rate λη and the momentum m.Figure 6 shows heatmaps of |rmax| for each method for adelay of one and our default values of a, b and T . Note thatthe region of stability is significantly reduced by the delay,especially for large momentum values. Our compensationmethods counteract this, allowing larger learning rates to beused for high momentum values. SCD in particular strictlyincreases the region of stability, the other methods slightlydecrease it for small momentum coefficients.

Figure 6 also allow us to reason about more than a singlecomponent at a time. Let’s assume that we have multiplecomponents, a condition number κ = λ1/λN and a densespectrum of eigenvalues between λ1 and λN . The samelearning rate η and momentum m are used for all compo-nents. The overall convergence rate is determined by thecomponent with the largest |rmax|. This corresponds to thelargest value in a horizontal line segment between ηλN andηλ1 on the root heatmaps. With a log scale the line segmenthas a constant length determined by κ.

Figure 7 shows the convergence speed as a function ofκ for the different methods. We measure the half-life− ln 2/ ln |r∗| where |r∗| is obtained by finding the low-est max magnitude over all intervals of sufficient length.The methods improve the rate of convergence compared tothe delayed baseline. The combination performs the bestwhich also holds for larger delays as is shown in Figure 8.

As mentioned earlier, GSC and LWP can be equivalent for aconvex quadratic. The fact that LWPD slightly outperforms

0 2 4 6 8 10 12 14 16Delay

100

101

102

103

104

Min

imum

Hal

flife

GDMLWPDLWPw

D+SCD

Figure 8. The optimal half-life of the error for different delayswhen optimizing a convex quadratic with κ = 103.

1 10 51 10 41 10 31 10 21 10 11 100

momentum

102

103

104

Min

imum

Hal

flife

LWP T=0LWP T=3LWP T=5LWP T=10LWP T=20LWPw

D+SCD

Figure 9. The effect of momentum and the horizon T for weightprediction on the optimal half-life when optimizing a convexquadratic with κ = 103 for a delay D = 5.

SCD indicates that our selection of T = D is better than theselection of a and b as given in equation (7) in this case. Fig-ure 9 shows the effect of different values of T . It shows thatvalues close to T = 2D are optimal but do not outperformthe combination LWP w

D+SCD. This seems to indicate that“overcompensating” for the delays, by predicting weightsfurther out in LWP or equivalently by using larger spikes inSC, seems to produce better optimization trajectories. Theresulting root heatmaps resemble the ones for the no-delayNesterov baseline (see LWP w

D+SCD in Figure 6, LWP withT = 2D looks similar). Note that adding Nesterov to thedelay is not sufficient to get this effect. In Appendix I weshow the effect of extended horizons for both the convexquadratic and a neural network.

Figure 9 also reveals that without mitigation (T = 0 isequal to GDM with delay), the optimal momentum is zero.In the no-delay case the optimal momentum is given bym = ((

√κ− 1)/(

√κ+ 1))

2 (Zhang & Mitliagkas, 2017)which increases with the condition number. Our compensa-tion methods restore the benefits of momentum for high con-dition numbers. Overall the combined mitigation performsthe best. Extended horizons for LWP or the equivalent coef-ficients for GSC also outperform our default choice in thiscase but are unable to match the combination LWP w

D+SCD.


100 120 140 160 180 200 220 240Epoch

91.0

91.2

91.4

91.6

91.8

92.0

92.2

92.4

Valid

atio

n Ac

cura

cy (%

)

Batch Size Val Accuracy128 92.2 1 92.2

Figure 10. SGDM training of CIFAR-10 ResNet-20 with ON vali-dation curves (mean±std.dev of 3 runs). Hyperparameter scalingrules described in Appendix D produce equivalent training.

1 2 4 8 16Batch Size

90.0

90.5

91.0

91.5

92.0

Valid

atio

n Ac

cura

cy (%

)

SGDMPB

Figure 11. Pipelined Backpropagation training of CIFAR-10ResNet-20 using Online Normalization. Final validation accu-racy decreases as batch size is increased.

Table 1. CIFAR-10 final validation accuracy (mean±std.dev of 3runs) for ResNet (RN) using Online Normalization.

NETWORK STAGES SGDM PB

RN20 34 92.19±0.26 92.13±0.18RN32 52 92.86±0.04 93.05±0.13RN44 70 92.98±0.22 93.16±0.13RN56 88 93.21±0.07 93.37±0.13RN110 169 93.83±0.11 93.74±0.13

4 EXPERIMENTS

The goal of our experiments is to investigate the conver-gence properties of small batch, fine-grained Pipeline Back-propagation and compare it to standard mini-batch gradi-ent descent. As mentioned before, the experiments areperformed on GPUs which do not benefit from pipelinedtraining. Therefore, we do not compare wall-time to conver-gence, but Zhang et al. (2019c), Li & Pedram (2017), andChen et al. (2016) have shown significant improvement inthroughput and processing efficiency when pipelining neuralnetwork training on appropriate hardware. We experimentwith two families of networks, VGG (Simonyan & Zisser-man, 2014) and pre-activation ResNets (He et al., 2016b)on two commonly used image classification benchmarks,CIFAR-10 (Krizhevsky et al., 2009) and ImageNet (Denget al., 2009). We adopt the data prepossessing and hyperpa-rameter settings for VGG and ResNet from Fu (2019) andChiley et al. (2019) respectively. The delay created by PBis defined by the depth of the network architecture so wetest various network depths. When applicable, each con-volutional layer is combined with its associated activationfunction and normalization layer into a single pipeline stage.In our implementation the summation of a residual branchand a skip connection in residual networks also becomesa stage. In each row of Tables 1-4, the values within onestandard error of the maximum accuracy are highlighted.Other details about our experimental setup can be found inAppendix L.

0 10 20 30 40 50 60 70 80 90Epoch

50

55

60

65

70

75

Valid

atio

n Ac

cura

cy (%

)

Training Method Val AccuracySGDM 75.7% PB 75.1% PB+LWPD 75.2% PB+SCD 75.6% PB+LWPv

D+SCD 75.8%

Figure 12. ImageNet ResNet-50 validation accuracy.

4.1 Small Batch Sizes and Effective Normalization

The memory requirements of pipeline parallelism have aquadratic dependency on the number of stages. To keep theoverall memory requirements reasonable, we use a batchsize of one in our PB experiments unless otherwise stated.We take the learning rate and momentum used at a referencebatch size and scale them according to the rules providedin Appendix D. The rules attempt to keep the impulse re-sponse of each gradient (the contribution to weight updatesover time) similar at different batch sizes. Figure 10 showstraining curves for ResNet-20 on CIFAR-10 at a batch sizeof 1 and 128 when using Stochastic Gradient Descent withMomentum (SGDM). The curves are near identical, suggest-ing that the hyperparameter scaling rules produce similartraining trajectories when using different batch sizes. Thisenables a fair comparison between PB and SGDM eventhough different batch sizes are used. For our SGDM base-lines, we use a batch size of 128 for CIFAR-10 and 32 forImageNet since SGDM training at batch size 1 on GPUs isslow and expensive.

In PB the delay (number of optimization steps) between theforward and backwards passes for a given stage is deter-mined by the network architecture. The effect of the delaydepends on the total weight change that occurs over thecourse of the delay. When the learning rate and momentum


Table 2. CIFAR-10 final validation accuracy (mean±std.dev of 5 runs) for ResNet (RN) with group normalization and VGG training.

NETWORK STAGES SGDM PB PIPEDREAM PB+LWPD PB+SCD PB+LWPvD+SCD

VGG11 29 91.16±0.19 90.83±0.20 90.93±0.12 91.05±0.11 91.08±0.19 91.12±0.18VGG13 33 92.57±0.15 92.59±0.15 92.30±0.24 92.51±0.11 92.38±0.27 92.56±0.14VGG16 39 92.24±0.19 92.06±0.21 59.31±45.012 92.22±0.24 92.45±0.30 92.38±0.27

RN20 34 90.63±0.31 90.44±0.24 90.36±0.06 90.68±0.30 90.80±0.29 90.92±0.25RN32 52 91.68±0.23 91.46±0.09 91.40±0.28 91.66±0.10 91.55±0.14 92.04±0.13RN44 70 92.19±0.14 91.71±0.25 91.72±0.14 92.00±0.14 92.13±0.16 92.16±0.26RN56 88 92.39±0.20 91.89±0.40 91.82±0.19 92.31±0.14 92.33±0.16 92.48±0.11RN110 169 92.77±0.22 91.81±0.15 91.92±0.33 92.76±0.05 92.28±0.29 92.41±0.16

Table 3. CIFAR-10 validation accuracy (mean±std.dev of 5 runs)when tuning the learning rate (LR) for ResNet-20 with GN training.The learning rate shown is used for batch size 128 SGDM trainingand is adjusted for batch size one PB training.

LR SGDM PB PB+LWPvD+SCD

0.0125 88.76±0.45 88.77±0.22 89.32±0.260.025 89.88±0.32 89.55±0.35 90.06±0.230.05 90.47±0.22 90.10±0.40 90.80±0.370.1 90.63±0.31 90.44±0.24 90.92±0.250.2 90.69±0.25 90.22±0.11 90.89±0.280.4 89.54±0.32 88.82±0.32 89.93±0.200.8 69.16±33.082 83.53±1.39 88.01±0.56

are scaled as we do, the magnitude of the weight changedepends on the number of samples processed between theforward and backwards passes. As a result, decreasingthe batch size produce smaller changes in the weights andtherefore helps mitigate the effects of delay and weight in-consistency in PB. Figure 11 shows that PB performs bestat batch size one.

Batch normalization has been shown to perform poorly atsmall batch sizes (Singh & Krishnan, 2019). We experimentwith two alternatives, Group Normalization (GN) from Wu& He (2018) and Online Normalization (ON) proposed byChiley et al. (2019), which both work at batch size one.With ON we find that PB training does not cause a loss ofaccuracy when training ResNets on CIFAR-10 with a smallbatch size (Table 1). Group Normalization does not per-form as well for CIFAR-10 in general, the baseline SGDMaccuracy is significantly lower. Online Normalization andBatch Normalization may benefit from a regularization ef-fect caused by cross-sample noise in the normalization statis-tics. The original Group Normalization work shows goodperformance on ImageNet where such a regularization effectmight be less important. Aside from the baseline accuracydegradation, performing PB with GN also suffers from anadditional degradation depending on the pipeline depth (Fig-

2Unstable training.

PB 0.99 0.97 0.95 0.9Shrinking Factor

80

82

84

86

88

90

92

Valid

atio

n Ac

cura

cy (%

)

SGDMPB+GS

Figure 13. CIFAR-10 PB with Gradient Shrinking validation accu-racy (mean±std.dev of 5 runs) for ResNet-44 with GN.

ure 123, Table 2). The choice of normalization can thereforehave a significant impact on the performance of PB.

4.2 Mitigation Methods

Pipelined Backpropagation can suffer from a loss of accu-racy or instability in some settings. This is more prominentin ResNets with GN and VGG networks so we use themto measure the effect of our mitigation methods and com-pare them to other existing methods found in literature. ForLWP and SC the default hyperparameters suggested by ourconvex quadratic analysis (Section 3.4) are used withoutfurther tuning. The results can potentially be improved witha hyperparameter search.

Table 3 shows a learning rate sweep for training ResNet-20 with GN on CIFAR-10 with both SGDM and PB. Wefind that PB suffers from a small accuracy degradation thatcannot be tuned away by adjusting the learning rate. Theoriginal learning rate of 0.1 adopted from He et al. (2016a)and the appropriately scaled version for batch size one train-ing is optimal in both cases. We find that PB tolerates higherlearning rates than SGDM. This could be an advantage ofusing small batches which Li et al. (2018) suggest makes

3Wu & He (2018) report an accuracy of 75.9%. They do thisby extending and modifying the learning rate schedule we usedwhich we adopted from (He et al., 2016a).


Table 4. CIFAR-10 (C10) validation accuracy (mean±std.dev of five runs) and ImageNet (I1k) validation accuracy (single run) comparingSpecTrain and our methods for ResNet (RN) and VGG training.

NETWORKS(DATASET) SGDM PB PB+LWPvD+SCD SPECTRAIN

VGG13 (C10) 92.57±0.15 92.59±0.15 92.56±0.14 92.49±0.12RN20 (C10) 90.63±0.31 90.44±0.24 90.92±0.25 90.93±0.09RN56 (C10) 92.39±0.20 91.89±0.40 92.48±0.11 92.72±0.10

RN50 (I1K) 75.7 75.1 75.8 75.3

hyperparameters less sensitive than for larger batch sizes.Table 3 shows our methods improve convergence and thatthey are effective across a wide range of learning rates.

Table 2 and Figure 12 compare the accuracy of SGDM andPB training for several networks and different mitigationmethods. PB suffers from an accuracy degradation depend-ing on the network depth. Both Spike Compensation andLinear Weight Prediction help mitigate the loss of accu-racy but their combination generally performs the best assuggested by our analysis (Section 3.4). The combinationrecovers the full SGDM accuracy for ResNet-50 trainingand for every CIFAR-10 experiment except for the deepestnetwork, ResNet-110. In that case it still closes most of thegap but LWPD performs better and produces results com-petitive with SGDM. Overcompensating for the delays isthus helpful in most cases but may be less effective for thelarge delays encountered in the deepest networks. Trainingwith large delays may be more sensitive to individual pre-dictions and corrections, benefiting the comparatively moreconservative LWPD.

Like Chen et al. (2018), we find that the Weight Stashingmethod used in PipeDream (Harlap et al., 2018) to addressweight inconsistency does not aid our training (Table 2).In Appendix E we show that the effects of weight incon-sistency only become significant at large delays that arenot encountered in our setting. Figure 13 shows that Gra-dient Shrinking (Zhuang et al., 2019) is also not effectivefor this type of training, no value of the shrinking factoris able to improve PB accuracy for RN44. Out of existingmethods, SpecTrain (Chen et al., 2018) performs the best(Table 4). Similar to LWPv

D+SCD it is able to recover oreven improve accuracy on CIFAR-10. However, it is notable to recover accuracy for ResNet-50 training on Ima-geNet and doubles the memory and compute overhead ofour combined mitigation (Appendix G). Out of the mitiga-tion methods tested, LWPv

D+SCD is thus the only one thatis competitive to SGDM training in both the CIFAR-10 andImageNet experiments.

5 CONCLUSION

Fine-grained Pipelined Backpropagation has several advan-tages for hardware compared to traditional training using

batch parallel Stochastic Gradient Descent. As discussed inSections 1 and 2 this can give large efficiency improvementsfor hardware architectures that can properly exploit theseproperties, such as Coarse-Grained Reconfigurable Arrays.However, traditional PB training can suffer from accuracydegradation and instability compared to standard trainingdue to delayed gradients and weight inconsistency.

We show that small micro-batch sizes are crucial for makingfine-grained Pipelined Backpropagation viable. Combinedwith an appropriate choice (or scaling) of hyperparameters,small batches reduce the negative effects of gradient delayand weight inconsistency. The use of small micro-batchesalso reduces the memory requirements that could other-wise be excessive. Unlike traditional training, fine-grainedpipelined backpropagation can be efficient with small micro-batch sizes when combined with persistent kernels that donot need to amortize weight loading.

A good choice of normalization can also significantly aidPipelined Backpropagation training. We experiment withtwo normalization options that work at batch size one, On-line Normalization and Group Normalization. We observethat ON is significantly more robust to and helps stabilizetraining.

For cases where fine-grained, small batch, Pipelined Back-propagation suffers from an accuracy degradation, wepresent two new mitigation approaches, SC and LWP. Weanalyze their workings on a quadratic model which suggeststhat the methods can increase stability, accelerate conver-gence, and restore the beneficial effects of momentum in thepresence of gradient delays. The analysis also suggests thatcombining the methods and thus “overcompensating” forthe delays can improve convergence. Our neural networkexperiments with PB confirm these advantages. We find thatthe combined mitigation outperforms existing mitigationstrategies, allowing our PB training to match the referenceaccuracy on both ImageNet and CIFAR-10 with minimaloverhead and without the need of additional hyperparametertuning.

With our methods, PB is a promising alternative to tradi-tional training. Future hardware architectures could reapsignificant efficiency gains from using small batch size, fine-grained Pipelined Backpropagation.


ACKNOWLEDGEMENTS

We are grateful to Vithursan Thangarasa, Ron Estrin, Na-talia Vassilieva, and Dennis DeCoste for their feedback onthe manuscript. We thank Min Xu for his help with thedataloader used in our GProp experiments and Chuan-YungTsai for insightful discussions.

REFERENCES

Amodei, D. and Hernandez, D. AI and compute,2018. URL https://openai.com/blog/ai-and-compute/.

Avron, H., Druinsky, A., and Gupta, A. Revisitingasynchronous linear solvers: Provable convergence ratethrough randomization. Journal of the ACM (JACM), 62(6):51, 2015.

Belilovsky, E., Eickenberg, M., and Oyallon, E. De-coupled greedy learning of cnns. arXiv preprintarXiv:1901.08164, 2019.

Chen, C.-C., Yang, C.-L., and Cheng, H.-Y. Efficient androbust parallel dnn training through model parallelism onmulti-gpu platform. ArXiv, abs/1809.02839, 2018.

Chen, X., Eversole, A., Li, G., Yu, D., and Seide, F.Pipelined back-propagation for context-dependent deepneural networks. In Interspeech. ISCA, September 2012.

Chen, Y.-H., Emer, J., and Sze, V. Eyeriss: A spatial ar-chitecture for energy-efficient dataflow for convolutionalneural networks. ACM SIGARCH Computer ArchitectureNews, 44(3):367–379, 2016.

Chen, Y.-H., Yang, T.-J., Emer, J., and Sze, V. Eyeriss v2: Aflexible accelerator for emerging deep neural networks onmobile devices. IEEE Journal on Emerging and SelectedTopics in Circuits and Systems, 9(2):292–308, 2019.

Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J.,Tran, J., Catanzaro, B., and Shelhamer, E. cuDNN:Efficient primitives for deep learning. arXiv preprintarXiv:1410.0759, 2014.

Chiley, V., Sharapov, I., Kosson, A., Koster, U., Reece, R.,Samaniego de la Fuente, S., Subbiah, V., and James, M.Online normalization for training neural networks. InAdvances in Neural Information Processing Systems 32,pp. 8431–8441. Curran Associates, Inc., 2019.

Chiley, V., Kosson, A., and Koster, U. Error com-pensation mechanism in online normalization, Apr2020. URL https://www.cerebras.net/error-compensation-mechanism-in-online-normalization/.

Dauphin, Y. N. and Schoenholz, S. MetaInit: Initializinglearning by learning to initialize. In Advances in NeuralInformation Processing Systems, pp. 12624–12636, 2019.

De, S. and Smith, S. L. Batch normalization biases deepresidual networks towards shallow paths. arXiv preprintarXiv:2002.10444, 2020.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical ImageDatabase. In CVPR09, 2009.

Diamos, G., Sengupta, S., Catanzaro, B., Chrzanowski,M., Coates, A., Elsen, E., Engel, J., Hannun, A., andSatheesh, S. Persistent rnns: Stashing recurrent weightson-chip. In Balcan, M. F. and Weinberger, K. Q. (eds.),Proceedings of The 33rd International Conference onMachine Learning, volume 48 of Proceedings of MachineLearning Research, pp. 2024–2033, New York, New York,USA, 20–22 Jun 2016. PMLR.

Fu, C.-Y. pytorch-vgg-cifar10, May 2019. URLhttps://github.com/chengyangfu/pytorch-vgg-cifar10.

Gaunt, A. L., Johnson, M. A., Riechert, M., Tarlow, D.,Tomioka, R., Vytiniotis, D., and Webster, S. AMPNet:Asynchronous model-parallel training for dynamic neuralnetworks. arXiv preprint arXiv:1705.09786, 2017.

Giladi, N., Nacson, M. S., Hoffer, E., and Soudry, D. Atstability’s edge: How to adjust hyperparameters to pre-serve minima selection in asynchronous training of neuralnetworks? arXiv preprint arXiv:1909.12340, 2019.

Goh, G. Why momentum really works. Distill, 2017. doi:10.23915/distill.00006.

Graphcore. Graphcore documents, 2020. URL https://docs.graphcore.ai/projects/tf-model-parallelism/en/latest/pipelining.html.

Hakimi, I., Barkai, S., Gabel, M., and Schuster, A. Tamingmomentum in a distributed asynchronous environment.arXiv preprint arXiv:1907.11612, 2019.

Harlap, A., Narayanan, D., Phanishayee, A., Seshadri,V., Devanur, N. R., Ganger, G. R., and Gibbons, P. B.PipeDream: Fast and efficient pipeline parallel dnn train-ing. ArXiv, abs/1806.03377, 2018.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),June 2016a.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappingsin deep residual networks. In European conference oncomputer vision, pp. 630–645. Springer, 2016b.

https://openai.com/blog/ai-and-compute/

https://openai.com/blog/ai-and-compute/

https://www.cerebras.net/error-compensation-mechanism-in-online-normalization/



https://github.com/chengyangfu/pytorch-vgg-cifar10

https://github.com/chengyangfu/pytorch-vgg-cifar10

https://docs.graphcore.ai/projects/tf-model-parallelism/en/latest/pipelining.html




Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le,Q. V., and Chen, Z. GPipe: Efficient training of gi-ant neural networks using pipeline parallelism. ArXiv,abs/1811.06965, 2018.

Huo, Z., Gu, B., and Huang, H. Training neural networks us-ing features replay. In Bengio, S., Wallach, H., Larochelle,H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information Processing Systems 31,pp. 6659–6668. Curran Associates, Inc., 2018a.

Huo, Z., Gu, B., qian Yang, and Huang, H. Decoupledparallel backpropagation with convergence guarantee. InDy, J. and Krause, A. (eds.), Proceedings of the 35th In-ternational Conference on Machine Learning, volume 80of Proceedings of Machine Learning Research, pp. 2098–2106, Stockholmsmassan, Stockholm Sweden, 10–15 Jul2018b. PMLR.

Ioffe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In Proceedings of the 32nd International Conference onInternational Conference on Machine Learning - Volume37, ICML’15, pp. 448–456. JMLR.org, 2015.

Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O.,Graves, A., Silver, D., and Kavukcuoglu, K. Decoupledneural interfaces using synthetic gradients. In Precup, D.and Teh, Y. W. (eds.), Proceedings of the 34th Interna-tional Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pp. 1627–1635, International Convention Centre, Sydney, Australia,06–11 Aug 2017. PMLR.

Jia, Z., Tillman, B., Maggioni, M., and Scarpazza, D. P.Dissecting the graphcore ipu architecture via microbench-marking. arXiv preprint arXiv:1912.03413, 2019.

Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images, 2009.

Kunle Olukotun. Accelerating software 2.0, 2 2020.URL https://info.matroid.com/scaledml-media-archive-2020.

Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visu-alizing the loss landscape of neural nets. In Advances inNeural Information Processing Systems, pp. 6389–6399,2018.

Li, Y. and Pedram, A. CATERPILLAR: coarse grain recon-figurable architecture for accelerating the training of deepneural networks. CoRR, abs/1706.00517, 2017.

Lian, X., Huang, Y., Li, Y., and Liu, J. Asynchronousparallel stochastic gradient for nonconvex optimization.In Advances in Neural Information Processing Systems,pp. 2737–2745, 2015.

Masters, D. and Luschi, C. Revisiting small batch trainingfor deep neural networks. ArXiv, abs/1804.07612, 2018.

Nicol, C. A coarse grain reconfigurable array (cgra) forstatically scheduled data flow computing, 2017.

O’donoghue, B. and Candes, E. Adaptive restart for accel-erated gradient schemes. Foundations of computationalmathematics, 15(3):715–732, 2015.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,L., et al. Pytorch: An imperative style, high-performancedeep learning library. In Advances in Neural InformationProcessing Systems, pp. 8024–8035, 2019.

Petrowski, A., Dreyfus, G., and Girault, C. Performanceanalysis of a pipelined backpropagation parallel algo-rithm. IEEE transactions on neural networks, 4 6:970–81,1993.

Podobas, A., Sano, K., and Matsuoka, S. A survey on coarse-grained reconfigurable architectures from a performanceperspective. arXiv preprint arXiv:2004.04509, 2020.

Qiao, S., Wang, H., Liu, C., Shen, W., and Yuille, A. L.Weight standardization. CoRR, abs/1903.10520, 2019.

Sean Lie. Wafer-scale ml, 2 2020. URLhttps://info.matroid.com/scaledml-media-archive-2020.

Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J.,Frostig, R., and Dahl, G. E. Measuring the effects ofdata parallelism on neural network training. Journal ofMachine Learning Research, 20(112):1–49, 2019.

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J.,and Catanzaro, B. Megatron-lm: Training multi-billionparameter language models using gpu model parallelism.arXiv preprint arXiv:1909.08053, 2019.

Simonyan, K. and Zisserman, A. Very deep convolu-tional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556, 2014.

Singh, S. and Krishnan, S. Filter response normalizationlayer: Eliminating batch dependence in the training ofdeep neural networks. arXiv preprint arXiv:1911.09737,2019.

Vassilieva, N. Neural network parallelism atwafer scale, Apr 2020. URL https://www.cerebras.net/data-model-pipeline-parallel-training-neural-networks/.

Venigalla, A., Kosson, A., Chiley, V., and Koster, U. Adap-tive Braking for Mitigating Gradient Delay. In Interna-tional Conference on Machine Learning Workshop on

https://info.matroid.com/scaledml-media-archive-2020




https://www.cerebras.net/data-model-pipeline-parallel-training-neural-networks/




Beyond First-Order Optimization Methods in MachineLearning, 2020.

Wu, Y. and He, K. Group normalization. In The Euro-pean Conference on Computer Vision (ECCV), September2018.

Xu, A., Huo, Z., and Huang, H. Diversely stale pa-rameters for efficient training of cnns. arXiv preprintarXiv:1909.02625, 2019.

Yang, B., Zhang, J., Li, J., Re, C., Aberger, C. R., andDe Sa, C. Pipemare: Asynchronous pipeline parallelDNN training. arXiv preprint arXiv:1910.05124, 2019.

Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl,G., Shallue, C., and Grosse, R. B. Which algorithmicchoices matter at which batch sizes? insights from a noisyquadratic model. In Advances in Neural InformationProcessing Systems, pp. 8196–8207, 2019a.

Zhang, H., Dauphin, Y. N., and Ma, T. Fixup initialization:Residual learning without normalization. In 7th Inter-national Conference on Learning Representations, ICLR2019, pp. 1–16, 2019b.

Zhang, J. and Mitliagkas, I. Yellowfin and the art of mo-mentum tuning. arXiv preprint arXiv:1706.03471, 2017.

Zhang, Y., Rucker, A., Vilim, M., Prabhakar, R., Hwang,W., and Olukotun, K. Scalable interconnects for recon-figurable spatial architectures. In 2019 ACM/IEEE 46thAnnual International Symposium on Computer Architec-ture (ISCA), pp. 615–628. IEEE, 2019c.

Zhuang, H., Wang, Y., Liu, Q., and Lin, Z. Fully decoupledneural network learning using delayed gradients. ArXiv,abs/1906.09108, 2019.


A PIPELINED BACKPROPAGATION

Pipeline parallelism is an interesting alternative or supple-ment to data parallelism (Appendix B). To perform SGDtraining using pipeline parallelism, the same weights mustbe used on the forward and backwards passes. To satisfy thisthe pipeline needs to be empty before updating the weights.While the pipeline is filling or draining some workers sitidle which lowers utilization. The fill and drain overhead isillustrated in Figure 2.

Assume a pipeline has S pipeline stages and each stageperforms a single forward and a single backward transfor-mation at each time step. Each sample is processed in 2S−1time steps. Performing a mini-batch SGD update with Nsamples takes roughly N + 2S − 2 ≈ N + 2S steps. 4 Thework performed only corresponds to N fully utilized stepsso the overall utilization is upper bounded by

N

N + 2S(26)

Unless N � S this represents a significant overhead.

Pipelined Backpropagation (Petrowski et al., 1993) avoidsthe fill and drain overhead by relaxing the constraint that thesame weights must be used for the forward and backwardspasses. In PB the pipeline is not drained before an updateis applied, instead the parameters are updated as soon asN gradients have been obtained. This keeps all workersutilized after the pipeline is filled for the first time (Figure2). We assume an update size (N ) of one and scale hyper-parameters appropriately (Appendix D). We compare theweight updates of PB and SGD. We write SGD as:

θt+1 = θt − η∇L(xt; θt) (27)

where θ is the set of all model weights, xt is the sampleat time t, η is the learning rate, and L is the loss function.For PB we define wsi to be the weights for pipeline stages ∈ [0, ..., S − 1] as seen by the ith sample, xi, as it prop-agates backwards through the network. Wi is defined asthe concatenation (denoted by ||) of wsi for all stages andcorresponds to the weights on the blue line in Figure 2:

Wi = w0i || w1

i || · · · || wS−1i (28)

The weight update for xi can then be written as:

Wi+1 = Wi − ηG(xi;Fi,Wi) (29)

where G approximates the gradient and Fi is the networkstate used for the forward pass of the network and corre-sponds to the weights on the red line in Figure 2. For PBwith N = 1:

Fi = w0i−2(S−1) || w

1i−2(S−2) || · · · || w

S−1i (30)

4This is assuming the workers are unable to speed up process-ing when they only perform one of the transformations, otherwiseit may be about N + S.

(28) - (30) reveal that PB differs from SGD in two ways:inconsistent weights and stale gradients.

Inconsistent Weights Different weight are used during theforward and backwards pass, Wi 6= Fi. The resulting sam-ple gradient is not the true sample gradient. The inconsis-tency is greater for earlier stages in the pipeline. If WeightStashing (Harlap et al., 2018) is used to mitigate weightinconsistency the update is:

Wi+1 = Wi − ηG(xi;Fi, Fi) = Wi − η∇L(xi;Fi) (31)

Weight Stashing (WS) requires the overhead of storing pa-rameter versions along with the activations.

Stale Gradients In PB each gradient is obtained usingweights from various time steps. When the gradient is ob-tained the weights have been updated. This results in stalegradients (aka. delayed gradients), an issue that also occursin asynchronous SGD training (Lian et al., 2015; Avronet al., 2015). The gradient staleness varies by stage, earlierstages suffer from a greater degree of staleness. The lengthof the grey lines in Figure 2 is proportional to the age ofthe weights, which is also a measure of the gradient delayfor each stage. The depth of the pipeline determines themaximum delay. Weight Stashing does not address gradientdelay because Fi in (31) is a delayed version of Wi.

B BATCH PARALLEL VS PIPELINEPARALLEL COMPUTATION

Pipeline parallelism differs from batch parallelism in severalways:

• The training memory requirements differ. In both caseswe assume an L layer network trained with W work-ers. During neural network training, the activations ofmany layers must be stored for the gradient calculation.For batch parallelism the activation memory required isO(LW ). To compute the backwards pass, each workerhas to store activations for roughly every layer. In thepipeline parallel setting, each worker is responsiblefor storing the activations of approximately L/W lay-ers. The first worker must store its activations for 2Wsteps. The second worker needs to keep activations for2(W − 1) steps and so on. The total activation mem-ory comes out to be approximately the same, O(LW ),however, the per worker memory requirements can bevery different. Pipeline parallelism generally requiresless memory for storing model parameters potentiallyrequiring only a single copy of each parameter. Unlessspecial methods are used, batch parallelism may needto keep W copies of the model.

• The communication pattern is different. In pipelineparallelism each worker sends activations and the cor-responding gradients to their neighbors. In distributed


mini-batch training every worker must send the gra-dients for all model parameters and receive updatedvalues after every batch. The bandwidth requirementsin each case depend on the exact model used, the batchsize, as well as other factors.

• Both pipeline parallel training and synchronized dis-tributed batch parallel training can suffer from workerbalancing bottlenecks. When using pipeline paral-lelism, care must be taken to balance the throughputof all workers since the overall speed is determined bythe slowest worker. This load balancing issue couldbe handled in software (Harlap et al., 2018) withoutrequiring users to manually specify the model division.In synchronized distributed SGD care must be taken tobalance the throughput and master node communica-tion of all workers since the overall speed is determinedby the slowest worker.

• Batch normalization (Ioffe & Szegedy, 2015) requiresbatch parallelism. In our work we are interested inreplacing batch parallelism with fine-grained pipelineparallelism. We therefore operate at a micro-batchsize of one which does not work well with Batch Nor-malization. Newer normalization techniques such asGroup Normalization (Wu & He, 2018), Weight Stan-dardization (Qiao et al., 2019), Filter Response Nor-malization (Singh & Krishnan, 2019) and Online Nor-malization (Chiley et al., 2019; 2020) are alternativenormalization techniques which work well and can beused with small batch sizes. Alternatively initializationmethods can be used to enable training without normal-ization (Zhang et al., 2019b; Dauphin & Schoenholz,2019; De & Smith, 2020).

C GPU BOTTLENECKS FORFINE-GRAINED PIPELINED TRAINING

A GPU consists of multiple cores with limited local memoryand large off-chip main memory. The main memory hashigh latency and relatively low bandwidth compared to thearithmetic throughput of the cores. To achieve high computeutilization, a GPU kernel must be designed to work aroundthese memory bandwidth and latency limitations. Latencylimitations are often managed by running a large number ofthreads in parallel and context switching to another threadwhen waiting on memory. Large batch operations can pro-vide a workload with enough threads to hide latency issuesbut the arithmetic intensity of operations must also be highenough to alleviate bandwidth bottlenecks. Without suffi-cient cache reuse bandwidth becomes an issue. This canhappen with small batch sizes and certain operations such asdepth-wise convolutions, element-wise functions and sparsematrix multiplications.

Due to the activation memory requirements of pipelined

training, modern network using this paradigm must usesmall batch sizes to fit in GPU memory. At small batchsizes, the amount of computation per kernel might be insuf-ficient to utilize all compute resources, but a large number ofkernels can run in parallel which can significantly increasecompute utilization. The compute throughput of the GPU isequal to the rate at which kernels are launched multipliedby the work done by each kernel. As the work per kernelis decreased, the kernel launch rate must be increased tomaintain compute throughput. At batch size one it may notbe possible to launch kernels at a sufficient rate to keeputilization high on small networks such as ResNet-20 forCIFAR-10. Larger networks tend to have more work doneper kernel and do not suffer from this problem.

Without significant weight reuse, GPUs become memorybandwidth limited. For convolutional layers the weightsare reused over the spatial and batch dimensions. Weightreuse increases as the spatial dimensions of the inputs in-crease. This makes bandwidth less of an issue for ImageNet(i1k) scale networks when compared to CIFAR-10 scalenetworks.

There are a few other challenges to small batch sized train-ing. At small batch sizes optimizer overheads become signif-icant. Each optimizer step requires loading the entire model,consuming significant memory bandwidth. At large batchsizes this is amortized over the batch size. For a batch sizeof one the optimizer steps consume a large fraction of thetotal memory bandwidth. Similarly, the time required forany new memory allocations cannot be amortized over thebatch size. While persistent kernels can enable some of theadvantages and speedups of worker specialization (Diamoset al., 2016), the total on-chip memory of modern GPUs istoo limited to run large models using fine-grained pipelinedparalellism with persistent kernels; running multiple kernelson multiple threads concurrently limits the resources avail-able to each threads making it impossible to keep weightsin local memory or use persistent kernels for our work.

D SMALL BATCH SIZE TRAINING

We define the per-worker batch size to be the number ofsamples that each pipeline stage processes at a time and theupdate-size to be the number of samples that contribute tothe gradient in each update. We set both of these to one inour experiments. Alternatively the update-size could be setto match some reference for which known hyperparametersexist but this is outside the scope of this work.

Since the optimal learning rate and momentum depend onthe update size N , we scale the values used by the SGDMreference according to Chiley et al. (2019). This correspondsto scaling the effective learning rate linearly with the updatesize while scaling the momentum such that the decay per


0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Epoch

10

20

30

40

50

60

70

80

Accu

racy

1128

(a) Training accuracy.

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Epoch

30

40

50

60

70

80

Accu

racy

1128

(b) Validation accuracy.

Figure 14. Hyperparameter comparison using CIFAR-10 VGG11. Showing mean (shading is standard deviation) of ten runs.

sample is the same. This allows for a fair comparison oftechniques even though different update sizes are used. Thescaling rules are:

m = mN/Nrr , η =

(1−m)N

(1−mr)Nrηr (32)

where ηr, mr and Nr are the reference learning rate, mo-mentum coefficient and update size and η, m and N are thenew values (we use N = 1).

Figures 14a and 14b shows that the hyperparameters pro-duced using these scaling rule result in batch size 1 trainingcurves similar to the reference when training VGG11 on theCIFAR-10 dataset. A similar result can be seen for ResNettraining in Figure 10.

E INCONSISTENT WEIGHTS VS STALEGRADIENTS

In Pipelined Backpropagation gradients are delayed andcomputed with inconsistent weights. This can lead to ac-curacy degradation and instability. In this section we in-vestigate the relative importance of the effects. We do thisby comparing training with delayed gradients using eitherinconsistent or consistent weights. In Appendix K we de-scribe how we can simulate this in PyTorch (Paszke et al.,2019) without using Pipelined Backpropagation.

Figure 15 shows the effects of delay on the final accuracyof CIFAR-10 ResNet-20 training with or without inconsis-tent weights. As can be seen, even modest delays affectthe final accuracy of training. Weight inconsistency doesnot cause an additional loss of accuracy for small delaysbut causes a rapid loss of accuracy beyond a certain delay.This transition point where weight inconsistency starts toaffect training will depend on the dataset and architecture.Harlap et al. (2018) and Chen et al. (2018) make oppos-ing claims about the effect of weight inconsistency. Harlapet al. (2018) introduce Weight Stashing to fix weight in-

0 1 2 3 4 5 6 7 8Delay [batches of 32]

76

78

80

82

84

86

88

90

92

Fina

l Val

idat

ion

Accu

racy

Forward Delay OnlyConsistent Delay

Figure 15. The effect of weight inconsistency on the final valida-tion accuracy of CIFAR-10 ResNet-20 (with Group Normalization)for different delays. Consistent Delay uses the same old versionof the weights for both the forward and backward passes. This pro-duces delayed gradients. Forward Delay Only uses old versionsof the weights on the forward pass and current weights on the back-wards pass, resulting in weight inconsistency. Delayed gradientsresult in a loss of final accuracy. Adding weight inconsistency onlyincurs additional degradation for large delays.

consistency and claim its use is necessary for convergence.Chen et al. (2018) show that Weight Stashing has no effecton training in their experiments so it should not be used toavoid memory overhead. Our results suggest that the effectsof weight inconsistency depend on the magnitude of delaysreconciling the two claims.

We also investigate the effect of weight inconsistency inour fine-grained Pipelined Backpropagation setup. Table 2compares PB training with and without Weight Stashing.The results suggest that Weight Stashing is not beneficialin our setup so we do not use it in other experiments. Thisindicates that weight inconsistency is likely not an issue andthe accuracy losses of PB primarily stem from the gradientdelay. As mentioned in the conclusion, the small batchsizes we use combined with the hyperparameter scalingmay reduce the effects of the delay. For larger batch sizesweight inconsistency may be a bigger issue.


F FORMS OF WEIGHT PREDICTION

The goal of weight prediction is to approximate futureweights to combat gradient delay and weight inconsistency.Linear Weight Prediction (LWP) gives a general form forpredicting the network state T steps into the future by usingthe velocity. In Pipelined Backpropagation the delay variesfor different stages. By default (LWPD) we set T equalto the delay for every stage (see red arrows in Figure 16left). Other works have proposed related forms of weightprediction.

LWP is closely related to the weight prediction proposed inSpecTrain (Chen et al., 2018). SpecTrain extends the pre-diction horizon such that all stages predict to the same timestep. This form of time synchronization is first described byHarlap et al. (2018) as Vertical Sync. The forward predic-tion horizon is depicted in green in Figure 16 left. With theextended prediction horizon, SpecTrain must also predictsweights on the backwards pass to address inconsistency. Theprediction horizon for the backward pass weights is depictedin blue in Figure 16 left. This can be seen as using a stagedependent extended prediction horizon (Appendix I).

Discrepancy correction (Yang et al., 2019) can be seen asa form of weight prediction. Whereas LWP and SpecTrainpredict weights into the future to mitigate for gradient de-lay and weight inconsistency, PipeMare approximates theweights used on the forward pass during the backward pass.This can only deal with weight inconsistency, but potentiallyprovides a more accurate prediction. Discrepancy correc-tion uses a separate exponential tracker for their prediction.LWP uses the optimizer velocity directly. In Appendix E weshow that weight inconsistency is not a significant issue inour setting so we primarily focus on mitigating the effectsof gradient delay.

DANA (Hakimi et al., 2019) is another variant of weightprediction that has been used in the ASGD setting but is notdirectly applicable to Pipelined Backpropagation.

G MITIGATION METHODS OVERHEAD

Figure 16 highlights two network stages and their associatedoverheads for different mitigation. While only two of thestages are highlighted these overheads exist for every stagein the network. The two stages are highlighted with reddashed lines. The yellow dashed line shows a vertical syn-chronization boundary. Figure 16 left depicts the forwardand backward weight predictions of SpecTrain in green andblue, respectively. SpecTrain has the memory overheadof storing the forward and backward state as well as thecompute overhead of predicting those states. The forwardweight prediction of LWP is shown in red which has ex-actly half the compute and memory overhead of SpecTrain.Pipemare’s discrepancy correction technique is a form of

Pipeline Step

Pipe

line

Stag

e

Pipeline Step

Pipe

line

Stag

e

Figure 16. Compute and Memory Overhead of PB Mitigations.Two of the stages are highlighted in red dashed lines and theyellow dashed line depicts a vertical synchronization boundary.Left: Shows the forward and backward prediction horizons ofSpecTrain in green and blue respectively. The forward predictionhorizon of LWPD is shown in red and the backward predictionhorizon of Pipemare’s discrepancy correction is shown in purple.Right: show the buffer lengths for Pipedreams Weight Stashing inblue and the gradient buffers of Vertical Sync in green.

backward weight prediction (shown in purple) and has thecompute and memory overhead of storing and predictingthe forward pass network state for use in the backwardspass; they also introduce a state transition tracker to makethis prediction. Their proposed warmup epochs would alsoincur the fill and drain overhead of pipelined training. Fig-ure 16 right shows the weight and gradient tensor buffersrequired for Pipedream’s Weight Stashing and Vertical Syncin blue and green respectively. While Pipedream has byfar the largest memory overhead, their methods which haveno compute overhead. SC and Gradient Shrinking simplymodify the weight update equations and have no memoryor compute overhead.

H STATE TRANSITION EQUATIONS

In order to analyze and compare our methods, we view theoptimization as a dynamical system in terms of its state tran-sition equation. A similar approach is used in (O’donoghue& Candes, 2015; Goh, 2017; Giladi et al., 2019). We assumethat L(wt) is the underlying loss function we are trying tominimize where wt are the weights at time t. For neuralnetworks, L could be the mean training loss, the expectedloss over all training samples. We assume that for a givensample or time step, the gradient with respect to the weightsis ∇L(wt) + R where R = R(wt) is a random variable.The expectation of R (over all samples) is assumed to bezero.

We are interested in comparing the dynamics of delayedSGDM, weight prediction, Spike Compensation and thecombined mitigation. These can all be seen as special casesof the combined mitigation given in Section 3.3 for theappropriate choice of a, b and T . The velocity form of thecombined mitigation, LWPv+SC, results in a complicatedstate transition equation which can not be easily analyzedwithout further simplifications. The velocity form can beapproximated with the weight difference form, LWPw+SC.


This form is simple to analyze so we use it for the rest ofthe analysis.

We analyze the systems in expectation and do not try toestimate the variance. Let wt and vt be the expected weightsand velocity at time t. We can then write the expected stateupdate for the combined mitigation at time t in terms ofprevious expected values as:

vt+1 = E[mvt + gt]

= mvt + gt (33)wt+1 = E[wt − η · (avt+1 + bgt)]

= wt − η · (avt+1 + bgt) (34)

where a, b are the coefficients for General Spike Compen-sation and gt := E[gt] is the expected gradient arriving attime t. This gradient is calculated using weight predictionwith horizon T from weights delayed by D time steps:

gt = E[∇L (wt−D + T · (wt−D − wt−D−1)) +R]

= ∇L (wt−D + T · (wt−D − wt−D−1)) (35)

We can isolate vt+1 from equation (34):

vt+1 =−1

ηa(wt+1 − wt)−

b

agt (36)

Shifting the time index we obtain an expression for vt whichwe can insert into equation (33):

vt+1 =−mηa

(wt − wt−1)− bm

agt−1 + gt (37)

Combining equations (34), (35) and (37) we obtain a statetransition equation in terms of the expected weights withoutthe velocity:

wt+1 = (1 +m)wt −mwt−1 (38)− η · (a+ b)∇L ((T + 1)wt−D − Twt−D−1))

+ ηmb∇L ((T + 1)wt−D−1 − Twt−D−2))

By inserting appropriate values for T , a and b we can ob-tain the state transition equations for General Spike Com-pensation (GSC, T = 0), linear weight prediction (LWP,a = 1, b = 0) and SGDM with delay (a = 1, b = 0, T = 0):

SGDM: wt+1 = (1 +m)wt −mwt−1 (39)− η∇L(wt−D)

GSC: wt+1 = (1 +m)wt −mwt−1 (40)− η · (a+ b)∇L(wt−D)

+ ηmb∇L(wt−D−1)

LWP: wt+1 = (1 +m)wt −mwt−1 (41)− η∇L ((T + 1)wt−D − Twt−D−1)

We note that unlike state transition equation of SGDM theequations for LWP and GSC both contain wt−D−1. Thismeans that the mitigation methods generally do not corre-spond to a simple change in the hyperparameter values ofSGDM. Similarly, the combination of GSC and LWP hasan additional wt−D−2 term and thus does not simply corre-spond to a different setting of a, b or T for either method.

H.1 Comparing LWP and GSC

The equations for LWP and GSC contain the same weightterms which could indicate that they operate in similar ways.If the gradient is well approximated as a linear function onthe line segment:

{wt−D−1 + α(T + 1)(wt−D − wt−D−1) | α ∈ [0; 1]}

we have:

∇L ((T + 1)wt−D − Twt−D−1)

≈ (T + 1)∇L(wt−D)− T∇L(wt−D−1)(42)

In this case GSC and LWP are equivalent for the samelearning rate and momentum if:

a+ b = 1 + T (43)mb = T (44)

When the approximation in equation (42) holds, LWP isequivalent to our default choice of a and b (see equation (7))if:

T = m1−mD

1−m(45)

This is equivalent to assuming zero future gradient over theprediction horizon in equation (10) instead of a constantvelocity. GSC is equivalent to LWP with horizon T for thesame learning rate if the approximation in (42) holds and:

a = 1− 1−mm

T, b =T

m(46)

This shows that LWP and GSC are closely related. Bothmethods compensate for a delay but at different points intime. Weight prediction changes how the gradient is com-puted, Spike Compensation changes how it is applied. Eachmethod has its advantages. Spike Compensation has mini-mal overhead and doesn’t require an estimate of the delayahead of time. Weight prediction might introduce memoryoverhead by adding a new copy of the weights (dependingon the implementation and hardware), but may help reduceweight inconsistency. The combination of the two methodscan be useful in cases where we want to overcompensatefor the delay. A similar effect can be achieved with ei-ther method by changing the horizon but their combinationoffers increased weight consistency without requiring anadditional weight prediction on the backwards pass.


I EXTENDED WEIGHT PREDICTIONHORIZONS

In Section 3.4 we discuss how overcompensating for delayscan help improve convergence speed. One way to do thisis to predict weights more than D (the delay) steps intothe future with linear weight prediction. Figure 17 showsthe effect of scaling the weight prediction horizon on theconvergence rate when optimizing a convex quadratic. Wesee that horizon lengths of around T = 2D seem to give thebest results.

We repeated this experiment for ResNet-20 (with groupnormalization) trained on CIFAR-10 using the simplifieddelay setup described in Appendix K. We used a delayD = 4 for all layers with consistent weights and a batchsize of 32 for a total delay of 128 samples (which is inthe range of many of our PB experiments). The learningrate and momentum were scaled according to (32) usingthe default reference values referenced in the experimentssection. The results can be seen in Figure 18. We can seethat the training loss curve looks somewhat similar to theconvergence speed for the convex quadratic, with the lowestloss obtained for T ≈ 2D. The validation accuracy alsopeaks for T ≈ 2D.

0 2 4 6 8 10Prediction Scale

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Min

inum

Hal

flife

(log

10) = 103, D=4

= 103, D=10= 105, D=4

Figure 17. The convergence speed for a convex quadratic with dif-ferent condition numbers (κ) and delays (D). A weight predictionwith horizon T = αD is used where α is the prediction scaleshown on the horizontal axis.

We also test this hypothesis in the Pipelined Backpropaga-tion setting. We explore the use of weight prediction witha horizon which is double that of the delay (LWP2D). Wealso experiment with overcompensating for the delay bydoubling the effect of Spike Compensation (SC2D which re-places D with 2D in (7)). We observe that overcompensatingcan improve the final accuracy in most cases (Table 55). Wenote that in these networks weight inconsistency does notseem to be an issue (see Appendix E). In cases where weightinconsistency is an issue, doubling the prediction horizoncan reduce training stability. The same may apply to net-works with large delays. One such example may be training

5In each row, in each column pair, the values within one stan-dard error of the maximum accuracy are highlighted.

0 2 4 6 8 10Prediction Scale

89.00

89.25

89.50

89.75

90.00

90.25

90.50

Fina

l Val

idat

ion

Accu

racy

[%]

0.0250.0300.0350.0400.0450.0500.0550.060

Fina

l Tra

inin

g Lo

ss

Figure 18. The effects of different weight prediction horizons onthe final loss and accuracy when training ResNet-20 on CIFAR-10.A prediction scale of α scales the horizon to be T = αD whereD = 4 is the delay. The delay is the same for all layers andconsistent weights are used. Each point is the mean of multipleruns, 25 for 1.75 ≤ α ≤ 2.5, and 10 for other α values.

ResNet-110 on CIFAR-10 (Table 5) where standard weightprediction outperforms methods which overcompensate fordelay.

J EFFECTS OF MOMENTUM SCALING

Throughout this work we heuristically scale the momentumand learning rate for small batch size training accordingto (32). This enables us to use Pipelined Backpropaga-tion without further hyperparameter tuning for existing net-works which is important for the practicality of PB training.These rules increase the momentum significantly comparedto other heuristics which might keep it constant or lower it.In Section 3.4 we show that momentum loses some of itsbenefits with delays. However, our compensation methods,Spike Compensation and Linear Weight Prediction, likelybenefit from high momentum. In this section we look at theeffects of different momentum values, while keeping thetotal contribution from each gradient the same. We do thisby selecting a specific value of m in (32) (ignoring the firstexpression) and then scaling the learning rate according tothe second expression.

The experiments involve training ResNet-20 (with groupnormalization) on CIFAR-10 using the simplified delaysetup described in Appendix K. We use a batch size of8 and a delay of 12 for all layers for a total delay of 96 sam-ples (which is in the range of many of our PB experiments).Figure 19a shows this when consistent weights are used. Wecan see that for the baseline with no delay a wide range ofmomentum values can be used, including no momentum,but very large values cause accuracy loss. With delay, smallvalues of momentum are better and the accuracy falls offrelatively quickly for larger values. With our compensationmethods the best accuracy is obtained for large momentumvalues. Spike Compensation has no effect for low (zero) mo-mentum values and therefore matches the delayed baseline


Table 5. CIFAR-10 validation accuracy (mean±std.dev of five runs) for ResNet (RN) and VGG training with overcompensation.

NETWORK SGDM PB PB+LWPD PB+LWP2D PB+SCD PB+SC2D

VGG11 91.16±0.19 90.83±0.20 91.05±0.11 91.27±0.14 91.08±0.19 91.03±0.22VGG13 92.57±0.15 92.59±0.15 92.51±0.11 92.57±0.21 92.38±0.27 92.60±0.17VGG16 92.24±0.19 92.06±0.21 92.22±0.24 92.28±0.18 92.45±0.30 92.42±0.21RN20 90.63±0.31 90.44±0.24 90.68±0.30 91.05±0.10 90.80±0.29 90.95±0.40RN32 91.68±0.23 91.46±0.09 91.66±0.10 91.98±0.22 91.55±0.14 91.96±0.24RN44 92.19±0.14 91.71±0.25 92.00±0.14 92.29±0.09 92.13±0.16 92.21±0.21RN56 92.39±0.20 91.89±0.40 92.31±0.14 92.41±0.17 92.33±0.16 92.68±0.23RN110 92.77±0.22 91.81±0.15 92.76±0.05 71.83±36.912 92.28±0.29 92.35±0.85

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0-log10(1-m)

86

87

88

89

90

91

Fina

l Val

idat

ion

Accu

racy

No Delay BaselineD=12SCD, D=12LWPD, D=12LWPv

D+SCD, D=12

(a) Using consistent weights.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0-log10(1-m)

84

86

88

90

Fina

l Val

idat

ion

Accu

racy

No Delay BaselineD=12SCD, D=12LWPD, D=12LWPv

D+SCD, D=12

(b) Using inconsistent weights.

Figure 19. Effect of momentum on CIFAR-10 ResNet-20 training with delay. Showing the mean of three runs (six for the no-delay case).

for small momentum values. Weight prediction for smallmomentum values tries to predict future weights based onrecent gradients without sufficient smoothing and performsworse than the baseline. The combined mitigation exceedsthe best results for the no-delay baseline for a range of largemomentum values.

Figure 19b shows the same experiment performed withinconsistent weights (using the most recent weights on thebackwards pass instead of the delayed weights used on theforward pass). Most of the observations from the previousexperiment hold in this case as well. The most notabledifference is the poor performance of all methods when lowmomentum is used. This suggests that small momentumvalues adversely affect weight consistency. These runs donot use a tuned learning rate or a learning rate warmupwhich could likely help stabilize lower momentum values.Using our formulation of momentum causes a warmup in thestep size while the velocity is building up. This effect couldcontribute to larger momentum values performing better.Another factor may be the exponential smoothing of weightupdates with momentum. Without this, a couple of relativelylarge gradients could cause a large weight inconsistency forsome time steps, potentially destabilize training.

K SIMULATING DELAYED GRADIENTS

Weight inconsistency and delayed gradients are potentialissues in Pipelined Backpropagation. To better understand

the issues we simulated weight inconsistency and delayedgradients in a PyTorch (Paszke et al., 2019) environmentusing a modified optimizer. The modified optimizer has abuffer of old parameter values. To apply a delay D, themodel is loaded with parameters from D time steps ago, aforward and backward pass is performed. The resulting gra-dients are then used to update a master copy of the weights.Weight inconsistency is simulated by loaded the model withparameters from D time steps ago, doing the forward passthen loading the model with the master weights before do-ing the backwards pass. While this was not an exact modelof PB, this setup allows for the simulation of PB’s issuesand fast iterate of potential methods to overcome the issues.This technique can also be used to simulate PB by havingdifferent delays for different layers based on the depth ofthe layer. This simulation method does not allow simul-taneously launching multiple kernels and is therefore notefficient for small batch sizes. Our simulations are doneusing a constant delay across layers. This upper boundsthe effect of weight inconsistency and delayed gradients.This setup can also be used to simulated ASGD training bymaking D a random variable which models the distributionof GPU communications with the master node in ASGD.


0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Epoch

10

20

30

40

50

60

70

80

90

Accu

racy

PyTorchSGDSGDFill&Drain_SGD

(a) Training accuracy.

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Epoch

30

40

50

60

70

80

Accu

racy

PyTorchSGDSGDFill&Drain_SGD

(b) Validation accuracy.

Figure 20. Validation of the GProp framework using CIFAR-10 VGG11. Showing mean (shading is standard deviation) of ten runs.

Table 6. CIFAR-10 validation accuracy (mean±std.dev of five runs) comparing LWPvD and LWPw

D on ResNet (RN) and VGG training.

NETWORK SGDM PB PB+LWPvD+SCD PB+LWPw

D+SCD

VGG11 91.16±0.19 90.83±0.20 91.12±0.18 90.93±0.15VGG13 92.57±0.15 92.59±0.15 92.56±0.14 92.55±0.08VGG16 92.24±0.19 92.06±0.21 92.38±0.27 92.09±0.10RN20 90.63±0.31 90.44±0.24 90.92±0.25 90.85±0.41RN32 91.68±0.23 91.46±0.09 92.04±0.13 91.99±0.16RN44 92.19±0.14 91.71±0.25 92.16±0.26 92.20±0.36RN56 92.39±0.20 91.89±0.40 92.48±0.11 92.32±0.06RN110 92.77±0.22 91.81±0.15 92.41±0.16 91.85±0.16

L EXPERIMENT DETAILS

L.1 VGG Experiments

Simonyan & Zisserman (2014) do not provide a setup fortraining VGG on CIFAR-10. We adopt the VGG model,hyperparameters, and data preprocessing from Fu (2019).

L.2 GProp validation

To validate our framework implementation, we comparebatch parallel SGD, and fill & drain SGD training. Wetrained each setting, as well as the same network in PyTorch,10 times to validate similar behavior. Figure 20 shows theoptimization of the different SGD training modes for thefirst 20 epochs. Numerical precision, network initialization,and data loading / augmentation randomness makes a nu-merical comparison for distinct runs impractical. Instead weshow the mean and standard deviation of 10 runs. The dif-ferent SGD modes in GProp are consistent and also matchPyTorch’s SGD convergence.

L.3 ResNetv2

He et al. (2016b) modified the original ResNet formulationgiven by He et al. (2016a) by introducing the ResNet pre-activation block. We adopt the hyperparameters and data

preprocessing from Chiley et al. (2019). Our experimentsare done at batch size one where Batch Normalization is noteffective. We replace Batch Normalization with Group Nor-malization or Online Normalization. For ImageNet ResNet-50 training, we used an initial group size of two as outlinedin the Group Normalization paper. Wu & He (2018) donot tune Group Normalization for CIFAR-10 training. Weuse the same initial group size of two for our CIFAR-10experiments. For our Online Normalization experiments weuse the default forward and backward decay factors.

L.4 LWPvD vs LWPw

D

Table 65 shows the results of using the two variants of LWP.When combined with SC, LWPv

D outperforms LWPwD. When

the weight form is used the most recent gradient has a largeeffect on the velocity estimate used for the weight prediction.For small batch sizes this approximation might be noisydecreasing the effectiveness of LWP. A similar effect canbe observed for LWP in general (Appendix J) when verysmall momentum values are used which also leads to noisypredictions.

abstract arxiv:2003.11666v1 [cs.lg] 25 mar 2020 · pipelined backpropagation at scale: training...

Documents