arxiv:1903.08857v1 [cs.dc] 21 mar 2019faculty.washington.edu/wlloyd/courses/tcss562/papers...1for...

OverSketched Newton: Fast Convex Optimization for

Serverless Systems

Vipul Gupta1, Swanand Kadhe1, Thomas Courtade1, Michael W. Mahoney2, andKannan Ramchandran1

1Department of EECS, University of California, Berkeley2ICSI and Statistics Department, University of California, Berkeley

Abstract

Motivated by recent developments in serverless systems for large-scale machine learn-ing as well as improvements in scalable randomized matrix algorithms, we developOverSketched Newton, a randomized Hessian-based optimization algorithm to solvelarge-scale smooth and strongly-convex problems in serverless systems. OverSketchedNewton leverages matrix sketching ideas from Randomized Numerical Linear Algebra tocompute the Hessian approximately. These sketching methods lead to inbuilt resiliencyagainst stragglers that are a characteristic of serverless architectures. We establishthat OverSketched Newton has a linear-quadratic convergence rate, and we empiricallyvalidate our results by solving large-scale supervised learning problems on real-worlddatasets. Experiments demonstrate a reduction of ∼50% in total running time on AWSLambda, compared to state-of-the-art distributed optimization schemes.

1 Introduction

Compared to first-order optimization algorithms, second-order methods—which use thegradient as well as Hessian information—enjoy superior convergence rates, both in theoryand practice. For instance, Newton’s method converges quadratically for strongly convexand smooth problems, compared to the linear convergence of gradient descent. Moreover,second-order methods do not require step-size tuning and unit step-size provably works formost problems. These methods, however, are not vastly popular among practitioners, partlysince they are more sophisticated, and partly due to higher computational complexity periteration. In each iteration, they require the calculation of the Hessian followed by solving alinear system involving the Hessian, and this can be prohibitive when the training data islarge.

Recently, there has been a lot of interest in distributed second-order optimization methodsfor more traditional server-based systems [1–6]. These methods can substantially decreasethe number of iterations, and hence reduce communication, which is desirable in distributedcomputing environments, at the cost of more computation per iteration. However, thesemethods are approximate, as they do not calculate the exact Hessian, due to data being

1

arX

iv:1

903.

0885

7v1

[cs

.DC

] 2

1 M

ar 2

019

Stragglers

Figure 1: Job times for 3000 AWS Lambda nodes where the median job time is around 40seconds, and around 5% of the nodes take 100 seconds, and two nodes take as much as 375seconds to complete the same job (figure borrowed from [11]).

stored distributedly among workers. Instead, each worker calculates an approximation of theHessian using the data that is stored locally. Such methods, while still better than gradientdescent, forego the quadratic convergence property enjoyed by exact Newton’s method.

In recent years, there has been tremendous growth in users performing distributedcomputing operations on the cloud due to extensive and inexpensive commercial offeringslike Amazon Web Services (AWS), Google Cloud, Microsoft Azure, etc. Serverless platforms—such as AWS Lambda, Cloud functions and Azure Functions—penetrate a large user baseby provisioning and managing the servers on which the computation is performed. Thisabstracts away the need for maintaining servers since this is done by the cloud provider andis hidden from the user—hence the name serverless. Moreover, allocation of these servers isdone expeditiously which provides greater elasticity and easy scalability. For example, upto ten thousand machines can be allocated on AWS Lambda in less than ten seconds [7–9].Using serverless systems for large-scale computation has been the central theme of severalrecent works [10–13].

Due to several crucial differences between HPC/server-based and serverless architectures,existing distributed algorithms cannot, in general, be extended to serverless computing.Unlike server-based computing, the number of inexpensive workers in serverless platformsis flexible, often scaling into the thousands [9]. This heavy gain in the computationpower, however, comes with the disadvantage that the commodity workers in the serverlessarchitecture are ephemeral, have low memory and do not communicate amongst themselves1.The workers read/write data directly from/to a single data storage entity (for example,cloud storage like AWS S3).

Furthermore, unlike HPC/server-based systems, nodes in the serverless systems sufferdegradation due to what is known as system noise. This can be a result of limited availability

1For example, serverless nodes in AWS Lambda, Google Cloud Functions and Microsoft Azure Functionshave a maximum memory of 3 GB, 2 GB and 1.5 GB, respectively, and a maximum runtime of 900 seconds,540 seconds and 300 seconds, respectively (these numbers may change over time).

2

of shared resources, hardware failure, network latency, etc. [14, 15]. This results in job timevariability, and hence a subset of much slower nodes, often called stragglers. These stragglerssignificantly slow the overall computation time, especially in large or iterative jobs. In [11],the authors plotted empirical statistics for worker job times for 3000 workers as shown inFig. 1 for the AWS Lambda system. Such experiments consistently demonstrate that atleast 2% workers take significantly longer than the median job time, severely degrading theoverall efficiency of the system.

Moreover, each call to serverless computing platforms such as AWS Lambda requiresinvocation time (during which AWS assigns Lambda workers), setup time (where the requiredpython packages are downloaded) and communication time (where the data is downloadedfrom the cloud). The ephemeral nature of the workers in serverless systems requires thatnew workers should be invoked every few iterations and data should be communicated tothem. Due to the aforementioned issues, first-order methods tend to perform poorly ondistributed serverless architectures. In fact, their slower convergence is made worse onserverless platforms due to persistent stragglers. The straggler effect incurs heavy slow downdue to the accumulation of tail times as a result of a subset of slow workers occurring ineach iteration.

In this paper, we argue that second-order methods are highly compatible with serverlesssystems that provide extensive computing power by invoking thousands of workers but arelimited by the communication costs and hence the number of iterations [9,10]. To address thechallenges of ephemeral workers and stragglers in serverless systems, we propose and analyzea randomized and distributed second-order optimization algorithm, called OverSketchedNewton. OverSketched Newton uses the technique of matrix sketching from RandomizedNumerical Linear Algebra (RandNLA) [16, 17] to obtain a good approximation for theHessian.

In particular, we use the sparse sketching scheme proposed by [11], which is basedon the basic RandNLA primitive of approximate randomized matrix multiplication [18],for straggler-resilient Hessian calculation in serverless systems. Such randomization alsoprovides inherent straggler resiliency while calculating the Hessian. For straggler mitigationduring gradient calculation, we use the recently proposed technique of employing error-correcting codes to create redundant computation [19, 20]. We prove that OverSketchedNewton has a linear-quadratic convergence when the Hessian is calculated using the sparsesketching scheme proposed in [11]. Moreover, we show that at least a linear convergence isassured when the optimization algorithm starts with any random point in the constraintset. Our experiments on AWS Lambda demonstrate empirically that OverSketched Newtonis significantly faster than existing distributed optimization schemes and vanilla Newton’smethod that calculates exact Hessian.

1.1 Related Work

Existing Straggler Mitigation Schemes: Strategies like speculative execution have beentraditionally used to mitigate stragglers in popular distributed computing frameworks likeHadoop MapReduce [21] and Apache Spark [22]. Speculative execution works by detecting

3

workers that are running slower than expected and then allocating their tasks to newworkers without shutting down the original straggling task. The worker that finishes firstcommunicates its results. This has several drawbacks. For example, constant monitoringof tasks is required, where the worker pauses its job and provides its running status.Additionally, it is possible that a worker will straggle only at the end of the task, say, whilecommunicating the results. By the time the task is reallocated, the overall efficiency of thesystem would have suffered already.

Recently, many coding-theoretic ideas have been proposed to introduce redundancy intothe distributed computation for straggler mitigation [19,20,23–27], many of them catering todistributed matrix-vector multiplication [19,20,23]. In general, the idea of coded computationis to generate redundant copies of the result of distributed computation by encoding theinput data using error-correcting-codes. These redundant copies can then be used to decodethe output of the missing stragglers. We use tools from [20] to compute gradients in adistributed straggler-resilient manner using codes, and we compare the performance withspeculative execution.

Approximate Newton Methods: Several works in the literature prove convergenceguarantees for Newton’s method when the Hessian is computed approximately using ideasfrom RandNLA [28–32]. However, these algorithms are designed for a single machine. Ourgoal in this paper is to use ideas from RandNLA to design a distributed approximate Newtonmethod for a serverless system that is resilient to stragglers.

Distributed Second-Order Methods: There has been a growing research interestin designing and analyzing distributed (synchronous) implementations of second-ordermethods [1–6]. However, these implementations are tailored for server-based distributedsystems. Our focus, on the other hand, is on serverless systems. Our motivation behindconsidering serverless systems stems from their usability benefits, cost efficiency, and extensiveand inexpensive commercial offerings [9,10]. We implement our algorithms using the recentlydeveloped serverless framework called PyWren [9]. While there are works that evaluateexisting algorithms on serverless systems [12, 33], this is the first work that proposes a large-scale distributed optimization algorithm for serverless systems. We exploit the advantagesoffered serverless systems while mitigating the drawbacks such as stragglers and additionaloverhead per invocation of workers.

2 OverSketched Newton

2.1 Problem Formulation and Newton’s Method

We are interested in solving a problem of the following form on serverless systems in adistributed and straggler-resilient manner:

f(w∗) = minw∈C

f(w), (1)

where f : Rd → R is a closed and twice-differentiable convex function bounded from below,and C ⊆ Rd is a given convex and closed set. We assume that the minimizer w∗ exists and is

4

uniquely defined. Let γ and β denote, respectively, the minimum and maximum eigenvaluesof the Hessian of f evaluated at the minimum, i.e., ∇2f(w∗). In addition, we assume thatthe Hessian ∇2f(w) is Lipschitz continuous with modulus L, that is, for any ∆ ∈ Rd,

||∇2f(w + ∆)−∇2f(w)||op ≤ L||∆||2, (2)

where || · ||op is the operator norm.

In the Newton’s method, the update at the (t+ 1)-th iteration is obtained by minimizingthe Taylor’s expansion of the objective function f(·) at wt, within the constraint set C, thatis

wt+1 = arg minw∈C

f(wt) +∇f(wt)

T (w −wt) +1

2(w −wt)

T∇2f(wt)(w −wt). (3)

For the unconstrained case, that is when C = Rd, Eq. (3) becomes

wt+1 = wt − [∇2f(wt)]−1∇f(wt). (4)

Given a good initial point w0 such that

||w0 −w∗||2 ≤γ

8L, (5)

the Newton’s method satisfies the following update

||wt+1 −w∗||2 ≤2L

γ||wt −w∗||22, (6)

implying quadratic convergence [34,35].

In many applications like machine learning where the training data itself is noisy, usingthe exact Hessian is not necessary. Indeed, many results in the literature prove convergenceguarantees for Newton’s method when the Hessian is computed approximately using ideasfrom RandNLA for a single machine [28–31]. In particular, these methods perform aform of dimensionality reduction for the Hessian using random matrices, called sketchingmatrices. Many popular sketching schemes have been proposed in the literature, for example,sub-Gaussian, Hadamard, random row sampling, sparse Johnson-Lindenstrauss, etc. [16, 36].

Next, we present OverSketched Newton for solving problems of the form (1) distributedlyin serverless systems using ideas from RandNLA.

2.2 OverSketched Newton

OverSketched Newton computes the full gradient in each iteration by using tools fromerror-correcting codes [19, 20, 24]. The exact method of obtaining the full gradient in adistributed straggler-resilient way depends on the optimization problem at hand. Our keyobservation is that, for several commonly encountered optimization problems, gradientcomputation relies on matrix-vector multiplication (see Sec. 3.2 for examples). We leverage

5

Algorithm 1: Straggler-resilient distributed computation of Ax using codes

Input : Matrix A ∈ Rt×s, vector x ∈ Rs, and block size parameter bResult: y = Ax, where y ∈ Rs is the product of matrix A and vector x

1 Initialization: Divide A into T = t/b row-blocks, each of dimension b× s2 Encoding: Generate coded A, say Ac, in parallel using a 2D product code by

arranging the row blocks of A in a 2D structure of dimension√T ×√T and adding

blocks across rows and columns to generate parities; see Fig. 2 in [20] for anillustration

3 for i = 1 to T + 2√T + 1 do

4 1. Worker Wi receives the i-th row-block of Ac, say Ac(i, :), and x from cloudstorage

5 2. Wi computes y(i) = A(i, :)x6 3. Master receives y(i) from worker Wi

7 end8 Decoding: Master checks if it has received results from enough workers to reconstruct

y. Once it does, it decodes y from available results using the peeling decoder

coded matrix multiplication technique from [20] to perform the large-scale matrix-vectormultiplication in a distributed straggler-resilient manner. The main idea of coded matrixmultiplication is explained in Fig. 2; detailed steps are provided in Algorithm 1.

Similar to the gradient computation, the particular method of obtaining the Hessian in adistributed straggler-resilient way depends on the optimization problem at hand. For severalcommonly encountered optimization problems, Hessian computation involves matrix-matrixmultiplication for a pair of large matrices (see Sec. 3.2 for several examples). For computingthe large-scale matrix-matrix multiplication in parallel in serverless systems, we propose to usea straggler-resilient scheme called OverSketch from [11]. OverSketch does blocked partitioningof input matrices, and hence, it is more communication efficient than existing coding-basedstraggler mitigation schemes that do naıve row-column partition of input matrices [25,26].We note that it is well known in HPC that blocked partitioning of input matrices can leadto communication-efficient methods for distributed multiplication [11,37,38].

OverSketch uses a sparse sketching matrix based on Count-Sketch [36]. It has similarcomputational efficiency and accuracy guarantees as that of the Count-Sketch, with twoadditional properties: it is amenable to distributed implementation; and it is resilient tostragglers. More specifically, the OverSketch matrix is given as [11]:

S =1√N

(S1,S2, · · · ,SN+e), (7)

where Si ∈ Rn×b, for all i ∈ [1, N+e], are i.i.d. Count-Sketch matrices with sketch dimensionb, and e is the maximum number of stragglers per N + e blocks. Note that S ∈ Rn×(m+eb),where m = Nb is the required sketch dimension and e is the over-provisioning parameter toprovide resiliency against e stragglers per N + e workers. We assume that e = ηN for someconstant redundancy factor η, 0 < η < 1.

6

S3

S3

W1

W2

W3

!

"1

"2

"1 + "2

M

Decode

"!

"1!

"2!

("1 + "2)!

"1!, ("1 + "2)!

" = "1"2

Figure 2: Coded matrix-vector multiplica-tion: Matrix A is divided into 2 row chunksA1 and A2. During encoding, redundant chunkA1 + A2 is created. Three workers obtain A1,A2

and A1 + A2 from the cloud storage S3, respec-tively, and then multiply by x and write back theresult to the cloud. The master M can decode Axfrom the results of any two workers, thus beingresilient to one straggler (W2 in this case).

x =

!"# $%

b

m +bm+b

b d

d

m

#"!

Figure 3: OverSketch-based approximateHessian computation: First, the matrix A—satifying ATA = ∇2f(wt)—is sketched in parallelusing the sketch in (7). Then, each worker receivesa block each of the sketched matrices ATS andSTA, multiplies them, and communicates back itsresults for reduction. During reduction, stragglerscan be ignored by the virtue of “over” sketching.For example, here the desired sketch dimension mis increased by block-size b for obtaining resiliencyagainst one straggler for each block of H.

Each of the Count-Sketch matrices Si is constructed (independently of others) as follows.First, for every row j, j ∈ [n], of Si, independently choose a column h(j) ∈ [b]. Then, selecta uniformly random element from −1,+1, denoted as σ(i). Finally, set Si(j, h(j)) = σ(i)and set Si(j, l) = 0 for all l 6= h(j). (See [11,36] for details.) We can leverage the stragglerresiliency of OverSketch to obtain the sketched Hessian in a distributed straggler-resilientmanner. An illustration of OverSketch is provided in Fig. 3; see Algorithm 2 for details.

The model update for OverSketched Newton is given by

wt+1 = arg minw∈C

f(wt) +∇f(wt)

T (w −wt) +1

2(w −wt)

T Ht(w −wt), (8)

where Ht = ATt StS

Tt At, At is the square root of the Hessian ∇2f(wt), and St is an

independent realization of (7) at the t-th iteration.

2.3 Local and Global Convergence Guarantees

Next, we prove the following local convergence guarantee for OverSketched Newton, thatuses the sketch matrix in (7) and full gradient for approximate Hessian computation. Such aconvergence is referred to as “local” in the literature [28,30] since it assumes that the initialstarting point w0 satisfies (15). Let m be the resultant sketch dimension of S obtained afterignoring the stragglers, that is, m = Nb.

Theorem 2.1 (Local convergence). Let w∗ be the optimal solution of (1) and γ and βbe the minimum and maximum eigenvalues of ∇2f(w∗), respectively. Let ε ∈ (0, γ/(8β)]

and µ > 0. Then, using an OverSketch matrix with a sketch dimension m = Ω(d1+µ

ε2) and

7

Algorithm 2: Approximate Hessian calculation on serverless systems using OverSketch

Input : Matrices A ∈ Rn×d, required sketch dimension m, straggler tolerance e,block-size b. Define N = m/b

Result: H ≈ AT ×A1 Sketching: Use sketch in Eq. (7) to obtain A = STA distributedly (see Algorithm 5

in [11] for details)2 Block partitioning: Divide A into (N + e)× d/b matrix of b× b blocks

3 Computation phase: Each worker takes a block of A and AT each and multipliesthem. This step invokes (N + e)d2/b2 workers, where N + e workers compute oneblock of H

4 Termination: Stop computation when any N out of N + e workers return their

results for each block of H5 Reduction phase: Invoke d2/b2 workers to aggregate results during the computation

phase, where each worker will calculate one block of H

the number of column-blocks N + e = Θµ(1/ε), the updates for OverSketched Newton withinitialization satisfying (5) follow

||wt+1 −w∗||2 ≤25L

8γ||wt −w∗||22 +

5εβ

γ||wt −w∗||2,

with probability at least 1−O(1/d).

Proof. See Section 5.1.

Theorem 2.1 implies that the convergence is linear-quadratic in error ∆t = wt − w∗.Initially, when ||∆t||2 is large, the first term of the RHS will dominate and the convergencewill be quadratic, that is, ||∆t+1||2 . 25L

8γ ||∆t||22. In later stages, when ||wt −w∗||2 becomessufficiently small, the second term of RHS will start to dominate and the convergence will belinear, that is, ||∆t+1||2 . 5εβ

γ ||∆t||2. At this stage, the sketch dimension can be increasedto reduce ε to diminish the effect of the linear term and improve the convergence rate inpractice.

The above guarantee implies that when w0 is sufficiently close to w∗, OverSketchedNewton has a linear-quadratic convergence rate. However, in general, a good starting pointmay not be available. Thus, we prove the following “global” convergence guarantee thatshows that OverSketched Newton would converge from any random initialization of w0 ∈ Rdwith high probability. In the following, we focus our attention to the unconstrained case,i.e., C = Rd. We assume that f(·) is smooth and strongly convex in Rd, that is,

kI ∇2f(w) KI, (9)

for some 0 < k ≤ K < ∞ and ∀ w ∈ Rd. In addition, we use line-search to choose thestep-size, and the update is given below

wt+1 = wt + αtpt,

8

where

pt = −H−1t ∇f(wt), (10)

and, for some constant β ∈ (0, 1/2],

αt = max α

such that α ≤ 1 and

f(wt + αpt) ≤ f(wt) + αβpTt ∇f(wt). (11)

Recall that Ht = ATt StS

Tt At, where At is the square root of the Hessian ∇2f(wt), and St

is an independent realization of (7) at the t-th iteration.

Note that (11) can be solved approximately in single machine systems using Armijobacktracking line search [35]. In Section 2.4, we describe how to implement distributedline-search in serverless systems when the data is stored in cloud.

Theorem 2.2 (Global convergence). Let w∗ be the optimal solution of (1) where f(·)satisfies the constraints in (9). Let ε and µ be positive constants. Then, using an OverSketch

matrix with a sketch dimension m = Ω(d1+µ

ε2) and the number of column-blocks N + e =

Θµ(1/ε), the updates for OverSketched Newton, for any wt ∈ Rd, satisfy

f(wt+1)− f(w∗) ≤ (1− ρ)(f(wt)− f(w∗)),

with probability at least 1− 1/poly(d), where ρ = 2αtβkK(1+ε) . Moreover, the step-size satisfies

αt ≥ 2(1−β)(1−ε)kK .

Proof. See Section 5.2.

Theorem 2.2 guarantees the global convergence of OverSketched Newton starting withany initial estimate w0 ∈ Rd to the optimal solution w∗ with at least a linear rate.

2.4 Distributed Line Search

In our experiments in Section 4, line-search was not required for the three synthetic andreal datasets where OverSketched Newton was employed. However, that might not betrue in general as the global guarantee in Theorem 2.2 depends on line-search. In general,line-search may be required until the current model estimate is close to w∗, that is, satisfiesthe condition in (5). Here, we describe a line-search procedure for distributed serverlessoptimization, which is inspired by the line-search method from [5] for server-based systems2.

To solve for the step-size αt as described in the optimization problem in (11), we setβ = 0.1 and choose a candidate set S = 40, 41, · · · , 4−5. After the master calculatesthe descent direction pt in the t-th iteration according to (10), the i-th worker calculates

2Note that codes can be used to mitigate stragglers during distributed line-search in a manner similar tothe gradient computation phase.

9

fi(wt +αpt) for all values of α in the candidate set S, where fi(·) depends on the local dataavailable at the i-th worker and f(·) =

∑i fi(·). The master then sums the results from

workers to obtain f(wt + αpt) for all values of α in S and finds the largest α that satisfiesthe Armijo condition in (11).

Note that the line search requires an additional round of communication where themaster communicates pt to the workers through cloud. The workers then calculate the thefunctions fi(·) using local data and send the result back to the master. Finally, the masterfinds the right step-size and updates the model estimate wt.

3 OverSketched Newton on Serverless Systems

3.1 Logistic Regression using OverSketched Newton on Serverless Sys-tems

The optimization problem for supervised learning using Logistic Regression takes the form

minw∈Rd

f(w) =

1

n

n∑i=1

log(1 + e−yiwTxi) +

λ

2‖w‖22

. (12)

Here, x1, · · · ,xn ∈ Rd×1 and y1, · · · , yn ∈ R are training sample vectors and labels, respec-tively. The goal is to learn the feature vector w∗ ∈ Rd×1. Let X = [x1,x2, · · · ,xn] ∈ Rd×nand y = [y1, · · · , yn] ∈ Rn×1 be the feature and label matrices, respectively. Using Newton’smethod to solve (12) first requires evaluation of the gradient

∇f(w) =1

n

n∑i=1

−yixi1 + eyiw

Ti xi

+ λw.

Calculation of ∇f(w) involves two matrix-vector products, ααα = XTw and ∇f(w) =1nXβββ+λw, where βi = −yi

1+eyiαi ∀ i ∈ [1, · · · , n]. These matrix-vector products are performeddistributedly. Faster convergence can be obtained by second-order methods which willadditionally compute the Hessian H = 1

nXΛΛΛXT + λId, where ΛΛΛ is a diagonal matrix withentries given by Λ(i, i) = eyiαi

(1+eyiαi )2. The product XΛΛΛXT is computed approximately in a

distributed straggler-resilient manner using (7).

We provide a detailed description of OverSketched Newton for large-scale logistic regres-sion for serverless systems in Algorithm 3. Here, steps 4, 8, and 14 are computed in parallelon AWS Lambda. All other steps are simple vector operations that can be performed locallyat the master, for instance, the user’s laptop. We assume that the number of features aresmall enough to fit the Hessian matrix H locally at the master and compute the updatewt+1 = wt −H−1∇f(wt) efficiently. Steps 4 and 8 are executed in a straggler-resilientfashion using the coding scheme in [20], as illustrated in Fig. 1 and described in detail inAlgorithm 1.

We use the coding scheme in [20] since the encoding can be implemented in paralleland requires less communication per worker compared to the other schemes, for example

10

Algorithm 3: OverSketched Newton: Logistic Regression for Serverless Computing

1 Input Data (stored in cloud storage): Example Matrix X ∈ Rd×n and vectory ∈ Rn×1 (stored in cloud storage), regularization parameter λ, number of iterationsT , Sketch S as defined in Eq. (7)

2 Initialization: Define w1 = 0d×1,βββ = 0n×1, γγγ = 0n×1, Encode X and XT asdescribed in Algorithm 1

3 for t = 1 to T do4 ααα = Xwt ; // Compute in parallel using Algorithm 1

5 for i = 1 to n do

6 βi = −yi1+eyiαi ;

7 end8 g = XTβββ ; // Compute in parallel using Algorithm 1

9 ∇f(wt) = g + λwt;10 for i = 1 to n do

11 γ(i) = eyiαi(1+eyiαi )2

;

12 end

13 A =√diag(γγγ)XT

14 H = ATA ; // Compute in parallel using Algorithm 2

15 H = 1nH + λId;

16 wt+1 = wt −H−1∇f(wt);

17 endResult: w∗ = wT+1

schemes in [19, 26], that use Maximum Distance Separable (MDS) codes. Moreover, thedecoding scheme takes linear time and is applicable on real-valued matrices. Note that sincethe example matrix X is constant in this example, the encoding of X is done only oncebefore starting the optimization algorithm. Thus, the encoding cost can be amortized overiterations. Moreover, decoding over the resultant product vector requires negligible timeand space, even when n is scaling into the millions.

The same is, however, not true for the matrix multiplication for Hessian calculation(step 14 of Algorithm 3), as the matrix L changes in each iteration, thus encoding costs willbe incurred in every iteration if error-correcting codes are used. Moreover, encoding anddecoding a huge matrix stored in the cloud incurs heavy communication cost and becomesprohibitive. Motivated by this, we use OverSketch in step 14, as described in Algorithm2, to calculate an approximate matrix multiplication, and hence the Hessian, efficiently inserverless systems with inbuilt straggler resiliency.3

3We also evaluate the exact Hessian-based algorithm with speculative execution, i.e., recomputing thestraggling jobs, and compare it with OverSketched Newton in Sec. 4.

11

3.2 Example Problems

In this section, we describe several commonly encountered optimization problems (besideslogistic regression) that can be solved using OverSketched Newton.

Ridge Regularized Linear Regression: The optimization problem is

minw∈Rd

1

2n||XTw − y||22 +

λ

2‖w‖22. (13)

The gradient in this case can be written as 1nX(βββ − y) + λw, where βββ = XTw, where the

training matrix X and label vector y were defined previously. The Hessian is given by∇2f(w) = XXT + λI. For n d, this can be computed approximately using the sketchmatrix in (7).

Linear programming via interior point methods: The following linear programcan be solved using OverSketched Newton

minimizeAx≤b

cTx, (14)

where x ∈ Rm×1, c ∈ Rm×1,b ∈ Rn×1 and A ∈ Rn×m is the constraint matrix with n > m.In algorithms based on interior point methods, the following sequence of problems usingNewton’s method

minx∈Rm

f(x) = minx∈Rm

(τcTx−

n∑i=1

log(bi − aix)

), (15)

where ai is the i-th row of A, τ is increased geometrically such that when τ is very large,the logarithmic term does not affect the objective value and serves its purpose of keepingall intermediates solution inside the constraint region. The update in the t-th iteration isgiven by xt+1 = xt − (∇2f(xt))

−1∇f(xt), where xt is the estimate of the solution in thet-th iteration. The gradient can be written as ∇f(x) = τc + ATβββ where βi = 1/(bi − αi)and ααα = Ax.

The Hessian for the objective in (15) is given by

∇2f(x) = ATdiag1

(bi − αi)2A. (16)

The square root of the Hessian is given by ∇2f(x)1/2 = diag 1|bi−αi|A. The computation

of Hessian requires O(nm2) time and is the bottleneck in each iteration. Thus, we canuse sketching to mitigate stragglers while evaluating the Hessian efficiently, i.e. ∇2f(x) ≈(S∇2f(x)1/2)T × (S∇2f(x)1/2), where S is the OverSketch matrix defined in (7).

Lasso Regularized Linear Regression: The optimization problem takes the followingform

minw∈Rd

1

2||Xw − y||22 + λ||w||1, (17)

12

where X ∈ Rn×d is the measurement matrix, the vector y ∈ Rn contains the measurements,λ ≥ 0 and d n. To solve (17), we consider its dual variation

min||XT z||∞≤λ,z∈Rn

1

2||y − z||22,

which is amenable to interior point methods and can be solved by optimizing the followingsequence of problems where τ is increased geometrically

minzf(z) = min

z

(τ2||y − z||22 −

d∑j=1

log(λ− xTj z)−d∑j=1

(λ+ xTj z)),

where xj is the j-th column of X. The gradient can be expressed in few matrix-vectormultiplications as ∇f(z) = τ(z − y) + X(βββ − γγγ), where βi = 1/(λ − αi), γi = 1/(λ + αi),and ααα = XT z. Similarly, the Hessian can be written as ∇2f(z) = τI + XΛΛΛXT , where ΛΛΛ is adiagonal matrix whose entries are given by Λii = 1/(λ− αi)2 + 1/(λ+ αi)

2 ∀ i ∈ [1, n].

Other common problems where OverSketched Newton is applicable include LinearRegression, Support Vector Machines (SVMs), Semidefinite programs, etc.

4 Experimental Results

In this section, we evaluate OverSketched Newton on AWS Lambda using real-world andsynthetic datasets, and we compare it with state-of-the-art distributed optimization algo-rithms4. We use the serverless computing framework, Pywren, developed recently in [9].Our experiments are focused on logistic regression, a popular supervised learning problem,but they can be reproduced for other problems described in Section 3.2.

4.1 Comparison with Existing Second-order Methods

For comparison of OverSketched Newton with existing distributed optimization schemes, wechoose recently proposed Globally Improved Approximate Newton Direction (GIANT) [5]as a representative algorithm. This is because GIANT boasts a better convergence ratethan many existing distributed second-order methods for linear and logistic regression whenn d. In GIANT, and other similar distributed second-order algorithms, the trainingdata is evenly divided among workers, and the algorithms proceed in two stages. First, theworkers compute partial gradients using local training data, which is then aggregated bythe master to compute the exact gradient. Second, the workers receive the full gradientto calculate their local second-order estimate, which is then averaged by the master. Anillustration is shown in Fig. 4.

For straggler mitigation in such server-based algorithms, [24] proposes a scheme forcoding gradient updates called gradient coding, where the data at each worker is repeated

4A working implementation of OverSketched Newton is available athttps://github.com/vvipgupta/OverSketchedNewton

13

W1 W2 W3 W4

M

!"!" !"

!"

Stage1

Stage2

#1 #2 #3 #4

()*(+* (,*

(-*

W1 W2 W3 W4

(* (* (* (*

M

.1 .2 .3 .4

/)*/+* /,*

/-*M

(* = 123(4*5

46)

/* = 123/4*5

46)

!*7) = !* − 9*/*

Figure 4: GIANT: The two stage second order distributed optimization scheme with fourworkers. First, master calculates the full gradient by aggregating local gradients from workers.Second, the master calculates approximate Hessian using local second-order updates fromworkers.

(a) Simple Gradient Descentwhere each worker stores one-fourth fraction of the wholedata and sends back a partialgradient corresponding to itsown data to the master

(b) Gradient Coding describedin [24] with W3 straggling. Toget the global gradient, masterwould compute g1 + g2 + g3 +g4 = 3

(g1 + g2

2

)−(g22 − g3

)+

(g4 − 2g1)

(1 +(2+(4

(c) Mini-batch gradient de-scent, where the stragglers areignored during gradient aggre-gation and the gradient is laterscaled according to the size ofmini-batch

Figure 5: Different gradient descent schemes in server-based systems in presence of stragglers

multiple times to compute redundant copies of the gradient. See Figure 5b for illustration.Figure 5a illustrates the scheme that waits for all workers and Figure 5c illustrates theignoring stragglers approach. We use the three schemes for dealing with stragglers illustratedin Figure 5 during the two stages of GIANT, and we compare their convergence withOverSketched Newton. We also evaluate exact Newton’s method with speculative executionfor straggler mitigation, that is, reassigning and recomputing the work for straggling workers,and compare its convergence rate with OverSketched Newton.

14

0 200 400 600 800 1000Time (seconds)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Trai

ning

Erro

r

Logistic regression on synthetic data

Uncoded GIANT with full gradientGIANT with gradient codingGIANT with ignoring stragglersExact Newton's methodOverSketched Newton

Figure 6: Convergence comparison of GIANT (employed with different straggler mitigationmethods), exact Newton’s method and OverSketched Newton for Logistic regression on AWSLambda. The synthetic dataset considered has 300,000 examples and 3000 features.

Remark 1. We note that the conventional distributed second-order methods for server-basedsystems—which distribute training examples evenly across workers (such as [1–5])—typicallyfind a “local approximation” of second-order update at each worker and then aggregate it.OverSketched Newton, on the other hand, utilizes the massive storage and compute power inserverless systems to find a “global approximation”. As we demonstrate next using extensiveexperiments, it performs significantly better than existing server-based methods.

4.2 Experiments on Synthetic Data

In Figure 6, we present our experimental results on randomly generated dataset withn = 300, 000 and d = 3000 for logistic regression on AWS Lambda. Each column xi ∈ Rd,for all i ∈ [1, n], is sampled uniformly randomly from the cube [−1, 1]d. The labels yi aresampled from the logistic model, that is, P[yi = 1] = 1/(1 + exp(xiw + b)), where the weightvector w and bias b are generated randomly from the normal distribution.

The orange, blue and red curves demonstrate the convergence for GIANT with thefull gradient (that waits for all the workers), gradient coding and mini-batch gradient(that ignores the stragglers while calculating gradient and second-order updates) schemes,respectively. The purple and green curves depict the convergence for the exact Newton’smethod and OverSketched Newton, respectively. The gradient coding scheme is applied forone straggler, that is the data is repeated twice at each worker. We use 60 Lambda workersfor executing GIANT in parallel. Similarly, for Newton’s method, we use 60 workers formatrix-vector multiplication in steps 4 and 8 of Algorithm 3, 3600 workers for exact Hessiancomputation and 600 workers for sketched Hessian computation with a sketch dimension of10d = 30, 000 in step 14 of Algorithm 3.

An important point to note from Fig. 6 is that the uncoded scheme (that is, the one

15

0 200 400 600 800 1000 1200Time (seconds)

0.3

0.35

0.4

0.45

0.5

0.550.6

0.65Tr

aini

ng e

rror

Training error on EPSILON dataset


(a) Training error for logistic regression on EP-SILON dataset

0 200 400 600 800 1000 1200Time (seconds)

0.3

0.35

0.4

0.45

0.5

0.550.6

0.65

Test

ing

erro

r

Testing error on EPSILON dataset


(b) Testing error for logistic regression on EP-SILON dataset

Figure 7: Comparison of training and testing errors for several Newton based schemes withstraggler mitigation. Testing error closely follows training error.

0 100 200 300 400 500 600 700 800Time (seconds)

0.2

0.3

0.4

0.5

0.6

Trai

ning

Erro

r

Training error on web page dataset


(a) Training error for logistic regression on EP-SILON dataset

0 100 200 300 400 500 600 700 800Time (seconds)

0.2

0.3

0.4

0.5

0.6

Test

ing

Erro

r

Testing error on web page dataset


(b) Testing error for logistic regression on EP-SILON dataset

Figure 8: Comparison of training and testing errors for several Newton based schemes withstraggler mitigation. Testing error closely follows training error.

that waits for all stragglers) has the worst performance. The implication is that goodstraggler/fault mitigation algorithms are essential for computing in the serverless setting.Secondly, the mini-batch scheme outperforms the gradient coding scheme by 25%. This isbecause gradient coding requires additional communication of data to serverless workers(twice when coding for one straggler, see [24] for details) at each invocation to AWS Lambda.On the other hand, the exact Newton’s method converges much faster than GIANT, eventhough it requires more time per iteration.

The green plot shows the convergence of OverSketched Newton. It can be noted thenumber of iterations required for convergence for OverSketched Newton and exact Newton(that exactly computes the Hessian) is similar, but the OverSketched Newton convergesin almost half the time due to significant time savings during the computation of Hessian,which is the computational bottleneck in each iteration.

16

4.3 Experiments on Real Data

In Figure 7, we use the EPSILON classification dataset obtained from [39], with n = 400, 000and d = 2000. We plot training and testing errors for logistic regression for the schemesdescribed in the previous section. Here, we use 100 workers for GIANT, and 100 workersfor matrix-vector multiplications for gradient calculation in OverSketched Newton. We usegradient coding designed for three stragglers in GIANT. This scheme performs worse thanuncoded GIANT that waits for all the stragglers due to the repetition of training data atworkers. Hence, one can conclude that the communication costs dominate the stragglingcosts. In fact, it can be observed that the mini-batch gradient scheme that ignores thestragglers outperforms the gradient coding and uncoded schemes for GIANT.

During exact Hessian computation, we use 10, 000 serverless workers with speculative exe-cution to mitigate stragglers (i.e., recomputing the straggling jobs) compared to OverSketchedNewton that uses 1500 workers with a sketch dimension of 15d = 30, 000. OverSketchedNewton requires a significantly smaller number of workers, as once the square root of Hessianis sketched in a distributed fashion, it can be copied into local memory of the master due todimension reduction, and the Hessian can be calculated locally. Testing error follows trainingerror closely, and important conclusions remain the same as in Figure 6. OverSketchedNewton significantly outperforms GIANT and exact Newton-based optimization in terms ofrunning time.

We repeated the above experiments for classification on the web page dataset [39] withn = 49, 749 and d = 300. We used 30 workers for each iteration in GIANT and anymatrix-vector multiplications. Exact hessian calculation invokes 900 workers as opposed to300 workers for OverSketched Newton, where the sketch dimension was 10d = 3000. Theresults are shown in Figure 8 and follow the trends witnessed heretofore.

4.4 Coded computing versus Speculative Execution

In Figure 9, we compare the effect of straggler mitigation schemes, namely speculativeexecution, that is, restarting the jobs with straggling workers, and coded computing on theconvergence rate during training and testing. We regard OverSketch based matrix multipli-cation as a coding scheme in which some redundancy is introduced during “over” sketchingfor matrix multiplication. There are four different cases, corresponding to gradient andhessian calculation using either speculative execution or coded computing. For speculativeexecution, we wait for at least 90% of the workers to return (this works well as the numberof stragglers is generally less than 5%) and restart the jobs that did not return till this point.

For both exact Hessian and OverSketched Newton, using codes for distributed gradientcomputation outperforms speculative execution based straggler mitigation. Moreover,computing the Hessian using OverSketch is significantly better than exact computation interms of running time as calculating the Hessian is the computational bottleneck in eachiteration.

17

0 100 200 300 400 500 600 700Time (seconds)

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Trai

ning

Erro

r

Newton-type methods on EPSILON datasetExact Hessian (with recomputed gradient)Exact Hessian (with coded gradient)OverSketched Newton (with recomputed gradient)OverSketched Newton (with coded gradient)

(a) Comparison for Training error

0 100 200 300 400 500 600 700Time (seconds)

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Test

ing

erro

r

Newton's method on EPSILON dataset: Test error

Recomputing stragglers for Hessian and gradientRecomputing Hessian with coded gradientOverSketched Hessian with recomputed gradientOverSketched Hessian with coded gradient

(b) Comparison for Testing error

Figure 9: Comparison of speculative execution and coded computing schemes for computingthe gradient and Hessian for the EPSILON dataset. OverSketched Newton, that is, codingthe gradients and sketching the Hessian, outperforms all other schemes.

4.5 Comparison with Gradient Descent on AWS Lambda

In Figure 10, we compare gradient descent with OverSketched Newton for logistic regressionon EPSILON dataset. The statistics for OverSketched Newton were obtained as describedin the previous section. We observed that for first-order methods, there is only a slightdifference in convergence for a mini-batch gradient when the batch size is 95%. Hence, forgradient descent, we use 100 workers in each iteration while ignoring the stragglers. Thestep-size was chosen using the method of backtracking line search, which determines themaximum amount to move in given a descent direction. As can be noted, OverSketchedNewton significantly outperforms gradient descent.

4.6 Comparison with Server-based Optimization

In Fig. 11, we compare OverSketched Newton on AWS Lambda with existing distributedoptimization algorithm GIANT in server-based systems (AWS EC2 in our case). Theresults are plotted on synthetically generated data for logistic regression. For server-basedprogramming, we use Message Passing Interface (MPI) with one c3.8xlarge master and60 t2.medium workers in AWS EC2. In [10], the authors observed that many large-scalelinear algebra operations on serverless systems take at least 30% more time compared toMPI-based computation on server-based systems. However, as shown in Fig. 11, we observethat OverSketched Newton outperforms MPI-based optimization that uses existing state-of-the-art optimization algorithm. This is because OverSketched Newton exploits the flexibilityand massive scale at disposal in serverless, and thus produces a better approximation of the

18

0 200 400 600 800 1000 1200Time (seconds)

0.3

0.4

0.5

0.6Tr

aini

ng E

rror

Gradient descentOverSketched Newton

Figure 10: Convergence rate comparison ofgradient descent and OverSketched Newtonfor the EPSILON dataset.

0 100 200 300 400 500 600Time (seconds)

0

0.2

0.4

0.6

0.8

Trai

ning

erro

r

GIANT on Amazon EC2 (server-based)OveSketched Newton on AWS Lambda

Figure 11: Convergence rate comparisonof AWS EC2 and AWS Lambda that useGIANT and OverSketched Newton, respec-tively, to solve a large scale logistic regressionproblem.

second-order update than GIANT5.

5 Proofs

5.1 Proof of Theorem 2.1

As wt+1 is the optimal solution of RHS in (8), we have, for any w in C,

f(wt) +∇f(wt)T (w −wt) +

1

2(w −wt)

T Ht(w −wt),

≥ f(wt) +∇f(wt)T (wt+1 −wt) +

1

2(wt+1 −wt)

T H(wt+1 −wt),

⇒ ∇f(wt)T (w −wt+1) +

1

2(w −wt)

T Ht(w −wt)−1

2(wt+1 −wt)

T Ht(wt+1 −wt) ≥ 0,

⇒ ∇f(wt)T (w −wt+1) +

1

2

[(w −wt)

T Ht(w −wt+1) + (w −wt+1)T Ht(wt+1 −wt)

]≥ 0.

Substituting w by w∗ in the above expression and calling ∆t = w∗ −wt, we get

−∇f(wt)T∆t+1 +

1

2

[∆Tt+1Ht(2∆t −∆t+1)

]≥ 0,

⇒ ∆Tt+1Ht∆t −∇f(wt)

T∆t+1 ≥1

2∆Tt+1Ht∆t+1.

Now, due to the optimality of w∗, we have ∇f(w∗)T∆t+1 ≥ 0. Hence, we can write

∆Tt+1Ht∆t − (∇f(wt)−∇f(w∗))T∆t+1 ≥

1

2∆Tt+1Ht∆t+1.

5We do not compare with exact Newton in server-based sytems since the training data is large and storedin the cloud. Thus, computing the exact Hessian would require a large number of workers (e.g., we use 10,000workers for exact Newton in EPSILON dataset) which is infeasible in server-based as it incurs a heavy cost.

19

Next, subsetituting ∇f(wt)−∇f(w∗) =( ∫ 1

0 ∇2f(w∗ + p(wt −w∗))dp

)(wt −w∗) in the

above inequality, we get

∆Tt+1(Ht −∇2f(wt))∆t + ∆T

t+1

(∇2f(wt)−

∫ 1

0∇2f(w∗ + p(wt −w∗))dp

)∆t ≥

1

2∆Tt+1Ht∆t+1.

Using Cauchy-Schwartz inequality in the LHS above, we get

||∆t+1||2||∆t||2(||Ht −∇2f(wt)||op +

∫ 1

0||∇2f(wt)−∇2f(w∗ + p(wt −w∗))||opdp

)≥ 1

2∆Tt+1Ht∆t+1.

Now, using the Lipschitz property of f(·) in (2) in the inequality above, we get

1

2∆Tt+1Ht∆t+1 ≤ ||∆t+1||2||∆t||2||Ht −∇2f(wt)||op +

L

2||∆t+1||2||∆t||22

∫ 1

0(1− p)dp,

⇒ 1

2∆Tt+1Ht∆t+1 ≤ ||∆t+1||2

(||∆t||2||Ht −∇2f(wt)||op +

L

2||∆t||22

). (18)

To complete the proof, we will need the following two lemmas. The first lemma defines anupper bound on the first term of the RHS.

Lemma 5.1. Let Ht = ATt StS

Tt At where St is the sparse sketch matrix in (7) with sketch

dimension m = Ω(d1+µ/ε2) and N = Θµ(1/ε). Then, the following holds

||Ht −∇2f(wt)||op ≤ ε||At||2op = ε||∇2f(wt)||op (19)

with probability at least 1− 1poly(d) .

Proof. Note that for the positive definite matrix ∇2f(wt) = ATt At, we have ||At||2op =

||∇2f(wt)||op. Moreover,

||Ht −∇2f(wt)||op = ||ATt (StS

Tt − I)At||op ≤ ||At||2op||StSTt − I||op

Next, we note than N is the number of non-zero elements per row in the sketch St in (7)after ignoring stragglers. Moreover, we use Theorem 8 in [40] to bound the singular valuesfor the sparse sketch St in (7) with sketch dimension m = Ω(d1+µ/ε2) and N = Θ(1/ε). Itsays that P(∀ x ∈ Rn, ||Stx||2 ∈ (1± ε/3)||x||2) > 1− 1/dα, where α > 0 and the constantsin Θ(·) and Ω(·) depend on α and µ. Thus, ||Stx||2 ∈ (1± ε/3)||x||2, which implies that

||Stx||22 ∈ (1 + ε2/9± 2ε/3)||x||22,

with probability at least 1− 1/dα. For ε ≤ 1/2, this leads to the following inequality

||Stx||22 ∈ (1± ε)||x||22 ⇒ |xT (StSTt − I)x| ≤ ε||x||22 ∀ x ∈ Rn.

This implies that ||StSTt − I||op ≤ ε with probability 1 − 1/dα, which proves the desiredresult.

20

Next lemma provides a lower bound on the LHS in (18).

Lemma 5.2. For the sketch matrix St ∈ Rn×m in (7) with sketch dimension m = Ω(d/ε2),the following holds

xTStSTt x ≥ (1− ε)||x||22 ∀ x ∈ Rn, (20)

with probability at least 1− 1/d.

Proof. The result follows directly by substituting A = xT , B = x and θ = 1/d in Theorem4.1 of [11].

Using Lemma 5.1 to bound the RHS of (18), we have, with probability at least 1−1/poly(d),

1

2∆Tt+1Ht∆t+1 ≤ ||∆t+1||2

(ε||∇2f(wt)||op||∆t||2 +

L

2||∆t||22

).

Since Ht = ATt STt StAt and sketch dimension m = Ω(d1+µ/ε2), using Lemma 5.2 in above

inequality, we get, with probability at least 1−O(1/d),

1

2(1− ε)||A∆t+1||22 ≤ ||∆t+1||2

(ε||∇2f(wt)||op||∆t||2 +

L

2||∆t||22

),

⇒ 1

2(1− ε)∆T

t+1∇2f(wt)∆t+1 ≤ ||∆t+1||2(ε||∇2f(wt)||op||∆t||2 +

L

2||∆t||22

).

Now, since γ and β are the minimum and maximum eigenvalues of ∇2f(w∗), we get

1

2(1− ε)||∆t+1||2(γ − L||∆t||2) ≤ ε(β + L||∆t||2)||∆t||2 +

L

2||∆t||22

by the Lipschitzness of ∇2f(w), that is, |∆Tt+1(∇2f(wt)−∇2f(w∗))∆t+1| ≤ L||∆t||2||∆||2t+1.

Rearranging for ε ≤ γ/(8β) < 1/2, we get

||∆t+1||2 ≤4εβ

γ − L||∆t||2||∆t||2 +

5L

2(γ − L||∆t||2)||∆t||22. (21)

Now, we use induction to prove that ||∆t||2 ≤ γ/5L. We can verify the base case using theinitialization condition from (5), i.e. ||∆0||2 ≤ γ/5L. Now, assuming that ||∆t||2 ≤ γ/5Land using it in (21), we get

||∆t+1||2 ≤4εβ

γ× γ

5L+

5L

2γ× γ2

25L2

=4εβ

5L+

γ

10L

≤ γ

L

(1

10+

1

10

)≤ γ

5L,

where the last inequality uses the fact that ε ≤ γ/(8β). Thus, by induction, ||∆t||2 ≤γ/(5L) ∀ t ≥ 0. Using this in (21), we get the desired result, that is,

||∆t+1||2 ≤5εβ

γ||∆t||2 +

25L

8γ||∆t||22.

21

5.2 Proof of Theorem 2.2

Let’s define wα = wt + αpt, where the descent direction pt is defined in (10) as pt =−H−1t ∇f(wt). Also, from Lemma 5.1, we have

λmin(Ht) ≥ (1− ε)λmin(∇2f(wt)) and λmax(Ht) ≤ (1 + ε)λmax(∇2f(wt)),

with probability at least 1− 1/poly(d). Using the above inequalities in the strong convexityand smoothness condition of f(·) from (9), we get

(1− ε)kI Ht (1 + ε)KI, (22)

with probability at least 1− 1/poly(d).

Next, we show that there exists an α > 0 such that the Armijo line search condition in(11) is satisfied. From the smoothness of f(·), we get (see [41], Theorem 2.1.5)

f(wα)− f(wt) ≤ (wα −wt)T∇f(wt) +

K

2||wα −wt||2,

= αpTt ∇f(wt) + α2K

2||pt||2.

Now, for wα to satisfy the Armijo rule, α should satisfy

αpTt ∇f(wt) + α2K

2||pt||2 ≤ αβpTt ∇f(wt)

⇒ αK

2||pt||2 ≤ (β − 1)pTt ∇f(wt)

⇒ αK

2||pt||2 ≤ (1− β)pTt Htpt,

where the last inequality follows from the definition of pt. Now, using the lower bound from(22), wα satisfies Armijo rule for all

α ≤ 2(1− β)(1− ε)kK

.

Hence, we can always find an αt ≥ 2(1−β)(1−ε)kK using backtracking line search such that

wt+1 satisfies the Armijo condition, that is,

f(wt+1)− f(wt) ≤ αtβpTt ∇f(wt)

= −αtβ∇f(wt)T H−1t ∇f(wt)

≤ − αtβ

λmax(Ht)||∇f(wt)||2

which in turn implies

f(wt)− f(wt+1) ≥αtβ

K(1 + ε)||∇f(wt)||2 (23)

22

with probability at least 1− 1/poly(d). Here the last inequality follows from the bound in(22). Moreover, strong convexity of f(·) as in (9) implies (see [41], Theorem 2.1.10)

f(wt)− f(w∗) ≤ 1

2k||∇f(wt)||2.

Using the inequality from (23) in the above inequality, we get

f(wt)− f(wt+1) ≥2αtβk

K(1 + ε)(f(wt)− f(w∗))

≥ ρ(f(wt)− f(w∗)),

where ρ = 2αtβkK(1+ε) . Rearranging, we get

f(wt+1)− f(w∗) ≤ (1− ρ)(f(wt)− f(w∗))

with probability at least 1− 1/poly(d), which proves the desired result.

6 Conclusions

We proposed OverSketched Newton, a straggler-resilient distributed optimization algorithmfor serverless systems. It uses the idea of matrix sketching from RandNLA to find anapproximate second-order update in each iteration. We proved that OverSketched Newtonhas a linear-quadratic convergence rate, where the dependence on the linear term can bemade to diminish by increasing the sketch dimension. By exploiting the massive scalabilityof serverless systems, OverSketched Newton produces a global approximation of the second-order update. Empirically, this translates into faster convergence than state-of-the-artdistributed optimization algorithms on AWS Lambda.

Acknowledgments

This work was partially supported by NSF grants CCF-1748585 and CNS-1748692 to SK,and NSF grants CCF-1704967 and CCF- 0939370 (Center for Science of Information) toTC, and ARO, DARPA, NSF, and ONR grants to MWM, and NSF Grant CCF-1703678 toKR. The authors would like to additionally thank AWS for providing promotional cloudcredits for research.

References

[1] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimizationusing an approximate Newton-type method,” in Proceedings of the 31st InternationalConference on International Conference on Machine Learning - Volume 32, ICML’14,pp. II–1000–II–1008, JMLR.org, 2014.

23

[2] V. Smith, S. Forte, C. Ma, M. Takac, M. I. Jordan, and M. Jaggi, “Cocoa: A gen-eral framework for communication-efficient distributed optimization,” arXiv preprintarXiv:1611.02189, 2016.

[3] Y. Zhang and X. Lin, “Disco: Distributed optimization for self-concordant empiricalloss,” in Proceedings of the 32nd International Conference on Machine Learning (F. Bachand D. Blei, eds.), vol. 37 of Proceedings of Machine Learning Research, (Lille, France),pp. 362–370, PMLR, 07–09 Jul 2015.

[4] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “On variance reductionin stochastic gradient descent and its asynchronous variants,” in Proceedings of the28th International Conference on Neural Information Processing Systems - Volume 2,NIPS’15, (Cambridge, MA, USA), pp. 2647–2655, MIT Press, 2015.

[5] S. Wang, F. Roosta-Khorasani, P. Xu, and M. W. Mahoney, “Giant: Globally im-proved approximate Newton method for distributed optimization,” arXiv preprintarXiv:1709.03528, 2017.

[6] C. Duenner, A. Lucchi, M. Gargiani, A. Bian, T. Hofmann, and M. Jaggi, “A dis-tributed second-order algorithm you can trust,” in Proceedings of the 35th InternationalConference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings ofMachine Learning Research, (Stockholmsmssan, Stockholm Sweden), pp. 1358–1366,PMLR, 10–15 Jul 2018.

[7] I. Baldini, P. C. Castro, K. S.-P. Chang, P. Cheng, S. J. Fink, V. Ishakian, N. Mitchell,V. Muthusamy, R. M. Rabbah, A. Slominski, and P. Suter, “Serverless computing:Current trends and open problems,” CoRR, vol. abs/1706.03178, 2017.

[8] J. Spillner, C. Mateos, and D. A. Monge, “Faaster, better, cheaper: The prospectof serverless scientific computing and HPC,” in Latin American High PerformanceComputing Conference, pp. 154–168, Springer, 2017.

[9] E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, and B. Recht, “Occupy the cloud:distributed computing for the 99%,” in Proceedings of the 2017 Symposium on CloudComputing, pp. 445–451, ACM, 2017.

[10] V. Shankar, K. Krauth, Q. Pu, E. Jonas, S. Venkataraman, I. Stoica, B. Recht, andJ. Ragan-Kelley, “numpywren: serverless linear algebra,” ArXiv e-prints, Oct. 2018.

[11] V. Gupta, S. Wang, T. Courtade, and K. Ramchandran, “Oversketch: Approximatematrix multiplication for the cloud,” IEEE International Conference on Big Data,Seattle, WA, USA, 2018.

[12] A. Aytekin and M. Johansson, “Harnessing the Power of Serverless Runtimes forLarge-Scale Optimization,” arXiv e-prints, p. arXiv:1901.03161, Jan. 2019.

24

[13] L. Feng, P. Kudva, D. D. Silva, and J. Hu, “Exploring serverless computing for neuralnetwork training,” in 2018 IEEE 11th International Conference on Cloud Computing(CLOUD), vol. 00, pp. 334–341, Jul 2018.

[14] J. Dean and L. A. Barroso, “The tail at scale,” Commun. ACM, vol. 56, pp. 74–80, Feb.2013.

[15] T. Hoefler, T. Schneider, and A. Lumsdaine, “Characterizing the influence of systemnoise on large-scale applications by simulation,” in Proc. of the ACM/IEEE Int. Conf.for High Perf. Comp., Networking, Storage and Analysis, pp. 1–11, 2010.

[16] M. W. Mahoney, Randomized algorithms for matrices and data. Foundations and Trendsin Machine Learning, Boston: NOW Publishers, 2011.

[17] P. Drineas and M. W. Mahoney, “RandNLA: Randomized numerical linear algebra,”Communications of the ACM, vol. 59, pp. 80–90, 2016.

[18] P. Drineas, R. Kannan, and M. W. Mahoney, “Fast Monte Carlo algorithms for matricesI: Approximating matrix multiplication,” SIAM Journal on Computing, vol. 36, pp. 132–157, 2006.

[19] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding updistributed machine learning using codes,” IEEE Transactions on Information Theory,vol. 64, no. 3, pp. 1514–1529, 2018.

[20] T. Baharav, K. Lee, O. Ocal, and K. Ramchandran, “Straggler-proofing massive-scaledistributed matrix multiplication with d-dimensional product codes,” in IEEE Int. Sym.on Information Theory (ISIT), 2018, IEEE, 2018.

[21] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,”Commun. ACM, vol. 51, pp. 107–113, Jan. 2008.

[22] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Clustercomputing with working sets,” in Proceedings of the 2Nd USENIX Conference on HotTopics in Cloud Computing, pp. 10–10, 2010.

[23] Y. Yang, P. Grover, and S. Kar, “Coded distributed computing for inverse problems,” inAdvances in Neural Information Processing Systems 30, pp. 709–719, Curran Associates,Inc., 2017.

[24] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoidingstragglers in distributed learning,” in Proceedings of the 34th International Conferenceon Machine Learning, vol. 70, pp. 3368–3376, PMLR, 2017.

[25] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix multiplication,”in IEEE Int. Sym. on Information Theory (ISIT), 2017, pp. 2418–2422, IEEE, 2017.

25

[26] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial codes: an optimal design forhigh-dimensional coded matrix multiplication,” in Advances in Neural InformationProcessing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett, eds.), pp. 4403–4413, Curran Associates, Inc., 2017.

[27] J. Zhu, Y. Pu, V. Gupta, C. Tomlin, and K. Ramchandran, “A sequential approxi-mation framework for coded distributed optimization,” in Annual Allerton Conf. onCommunication, Control, and Computing, 2017, pp. 1240–1247, IEEE, 2017.

[28] M. Pilanci and M. J. Wainwright, “Newton sketch: A near linear-time optimizationalgorithm with linear-quadratic convergence,” SIAM Jour. on Opt., vol. 27, pp. 205–245,2017.

[29] F. Roosta-Khorasani and M. W. Mahoney, “Sub-Sampled Newton Methods I: GloballyConvergent Algorithms,” arXiv e-prints, p. arXiv:1601.04737, Jan. 2016.

[30] F. Roosta-Khorasani and M. W. Mahoney, “Sub-Sampled Newton Methods II: LocalConvergence Rates,” arXiv e-prints, p. arXiv:1601.04738, Jan. 2016.

[31] R. Bollapragada, R. Byrd, and J. Nocedal, “Exact and Inexact Subsampled NewtonMethods for Optimization,” arXiv e-prints, p. arXiv:1609.08502, Sep 2016.

[32] A. S. Berahas, R. Bollapragada, and J. Nocedal, “An Investigation of Newton-Sketchand Subsampled Newton Methods,” arXiv e-prints, p. arXiv:1705.06211, May 2017.

[33] V. Ishakian, V. Muthusamy, and A. Slominski, “Serving deep learning models in aserverless platform,” arXiv e-prints, p. arXiv:1710.08460, Oct. 2017.

[34] Y. Nesterov, “A method of solving a convex programming problem with convergencerate O(1/sqr(k)),” Soviet Mathematics Doklady, vol. 27, pp. 372–376, 1983.

[35] S. Boyd and L. Vandenberghe, Convex Optimization. New York, NY, USA: CambridgeUniversity Press, 2004.

[36] D. P. Woodruff, “Sketching as a tool for numerical linear algebra,” Found. TrendsTheor. Comput. Sci., vol. 10, pp. 1–157, 2014.

[37] E. Solomonik and J. Demmel, “Communication-optimal parallel 2.5D matrix multi-plication and LU factorization algorithms,” in Proceedings of the 17th InternationalConference on Parallel Processing, pp. 90–109, 2011.

[38] R. A. van de Geijn and J. Watts, “Summa: Scalable universal matrix multiplicationalgorithm,” tech. rep., 1995.

[39] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACMtransactions on intelligent systems and technology (TIST), vol. 2, no. 3, p. 27, 2011.

26

[40] J. Nelson and H. L. Nguyen, “Osnap: Faster numerical linear algebra algorithms viasparser subspace embeddings,” in 2013 IEEE 54th Annual Symposium on Foundationsof Computer Science, pp. 117–126, Oct 2013.

[41] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course. SpringerPublishing Company, Incorporated, 1 ed., 2014.

27

arxiv:1903.08857v1 [cs.dc] 21 mar 2019faculty.washington.edu/wlloyd/courses/tcss562/papers...1for...

Documents