arxiv:1810.07896v3 [cs.ds] 20 oct 2020

44
Solving Linear Programs in the Current Matrix Multiplication Time Michael B. Cohen * MIT & Microsoft Research [email protected] Yin Tat Lee * UW & Microsoft Research [email protected] Zhao Song * UT-Austin & UW [email protected] Abstract This paper shows how to solve linear programs of the form min Ax=b,x0 c > x with n variables in time O * ((n ω + n 2.5-α/2 + n 2+1/6 ) log(n/δ)) where ω is the exponent of matrix multiplication, α is the dual exponent of matrix multiplication, and δ is the relative accuracy. For the current value of ω 2.37 and α 0.31, our algorithm takes O * (n ω log(n/δ)) time. When ω =2, our algorithm takes O * (n 2+1/6 log(n/δ)) time. Our algorithm utilizes several new concepts that we believe may be of independent interest: • We define a stochastic central path method. • We show how to maintain a projection matrix WA > (AW A > ) -1 A W in sub-quadratic time under 2 multiplicative changes in the diagonal matrix W . * Work was done while the first two authors were visiting Microsoft Research and hosted by Sébastien Bubeck, and the third author was visiting University of Washington. The authors would like to express their sincere gratitude to Rasmus Kyng for his questions about inverse maintenance that initiated this project. A preliminary version of this paper appeared in the Proceedings of 51th Annual ACM Symposium on Theory of Computing (STOC 2019). The full version of this paper appeared in the Journal of the Association for Computing Machinery (JACM 2020). arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Upload: others

Post on 29-Dec-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Solving Linear Programs inthe Current Matrix Multiplication Time

Michael B. Cohen∗

MIT & Microsoft [email protected]

Yin Tat Lee∗

UW & Microsoft [email protected]

Zhao Song∗

UT-Austin & [email protected]

Abstract

This paper shows how to solve linear programs of the form minAx=b,x≥0 c>x with n variablesin time

O∗((nω + n2.5−α/2 + n2+1/6) log(n/δ))

where ω is the exponent of matrix multiplication, α is the dual exponent of matrix multiplication,and δ is the relative accuracy. For the current value of ω ∼ 2.37 and α ∼ 0.31, our algorithmtakes O∗(nω log(n/δ)) time. When ω = 2, our algorithm takes O∗(n2+1/6 log(n/δ)) time.

Our algorithm utilizes several new concepts that we believe may be of independent interest:• We define a stochastic central path method.• We show how to maintain a projection matrix

√WA>(AWA>)−1A

√W in sub-quadratic

time under `2 multiplicative changes in the diagonal matrix W .

∗Work was done while the first two authors were visiting Microsoft Research and hosted by Sébastien Bubeck, andthe third author was visiting University of Washington. The authors would like to express their sincere gratitude toRasmus Kyng for his questions about inverse maintenance that initiated this project. A preliminary version of thispaper appeared in the Proceedings of 51th Annual ACM Symposium on Theory of Computing (STOC 2019). Thefull version of this paper appeared in the Journal of the Association for Computing Machinery (JACM 2020).

arX

iv:1

810.

0789

6v3

[cs

.DS]

20

Oct

202

0

Page 2: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

1 Introduction

Linear programming is one of the key problems in computer science. In both theory and practice,many problems can be reformulated as linear programs to take advantage of fast algorithms. For anarbitrary linear program minAx=b,x≥0 c

>x with n variables and d constraints1, the fastest algorithmtakes O∗(

√d · nnz(A) + d2.5)2 where nnz(A) is the number of non-zeros in A [LS14, LS15].

For the generic case d = Ω(n) we focus in this paper, the current fastest runtime is dominatedby O∗(n2.5). This runtime has not been improved since a result by Vaidya on 1989 [Vai87, Vai89b].The n2.5 bound originated from two factors: the cost per iteration n2 and the number of iterations√n. The n2 cost per iteration looks optimal because this is the cost to compute Ax for a dense

A. Therefore, many efforts [Kar84, Ren88, NN89, Vai89a, LS14] have been focused on decreasingthe number of iterations while maintaining the cost per iteration. As for many important linearprograms (and convex programs), the number of iterations has been decreased, including maximumflow [Mad13, Mad16], minimum cost flow [CMSV17], geometric median [CLM+16], matrix scalingand balancing [CMTV17], and `p regression [BCLL18]. Unfortunately, beating

√n iterations (or√

d when d n) for the general case remains one of the biggest open problems in optimization.Avoiding this open problem, this paper develops a stochastic central path method that has a

runtime of O∗(nω + n2.5−α/2 + n2+1/6), where ω is the exponent of matrix multiplication and α isthe dual exponent of matrix multiplication3. For the current value of ω ∼ 2.38 and α ∼ 0.31, theruntime is simply O∗(nω). This achieves a natural barrier for solving linear programs because linearsystems are a special case of linear program and this is the best known runtime for solving linearsystems. Although [AFLG15, AW18, Alm19] showed that the exact and similar4 approaches usedin [CW87, Wil12, DS13, LG14] cannot give a bound on ω better than 2.168, we believe improvingthe additive 2 + 1/6 term remains important for understanding linear programming. A recent work[JSWZ20] improved the 2 + 1/6 term to 2 + 1/18.

Our method is a stochastic version of the short step central path method. This short stepmethod takes O∗(

√n) steps and each step decreases xisi by a 1− 1/

√n factor for all i where x is

the primal variable and s is the dual variable [Ren88] (See the definition of s in (1)). This resultsin O∗(

√n) × n = O∗(n1.5) coordinate updates. Our method takes the same number of steps but

only updates O(√n) coordinates each step. Therefore, we only update O∗(n) coordinates in total,

which is nearly optimal.Our framework is efficient enough to take a much smaller step while maintaining the same

running time. For the current value of ω ∼ 2.38, we show how to obtain the same runtime of O∗(nω)by taking O∗(n) steps and O(1) coordinates update per steps. This is because the complexity ofeach step decreases proportionally when the step size decreases. Beyond the cost per iteration, weremark that our algorithm is one of the very few central path algorithms [PRT02, Mad13, Mad16]that does not maintain xisi close to some ideal vector in `2 norm. We are hopeful that our stochasticmethod and our proof will be useful for future research on interior point methods.

1Throughout this paper, we assume there is no redundant constraints and hence n ≥ d. Note that papers indifferent communities uses different symbols to denote the number of variables and constraints in a linear program.

2We use O∗ to hide no(1) and logO(1)(1/δ) factors and O to hide logO(1)(n/δ) factors.3The dual exponent of matrix multiplication α is the supremum among all a ≥ 0 such that it takes n2+o(1) time

to multiply an n× n matrix by an n× na matrix.4Improving the matrix multiplication constant boils down to constructing/analyzing the tensors in better sense.

Those work about limitation of matrix multiplication constant explore the exact same tensor and also variation oftensor in the previous work. For more details, we refer the readers to matrix multiplication literatures.

1

Page 3: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

1.1 Related Work

Interior point method has a long history, for more detailed surveys, we refer the readers to [Wri97,Ye97, Ren01, RTV05, Meg12, Ter13]. This paper is in part inspired by the use of data-structure inLaplacian solvers [ST04, KMP10, KMP11, CKM+11, KOSZ13, CKM+14, KLP+16, KS16, CKK+18,KPSZ18], in particular the cycle update in [KOSZ13].

In a few recent follow-ups of this paper, the techniques developed in this work are generalized to amore broad class of optimization problems, i.e., Empirical Risk Minimization [LSZ19], Cutting planemethod [JLSW20], semi-definite programming [JKL+20], deep neural network training [BPSW20,CLP+20]. A deterministic variant of our algorithm has been developed [Bra20], a sketching variantof our algorithm has been developed [SY20], a streaming variant has been developed [LSZ20], theadditive 1/6 term has been improved to 1/18 [JSWZ20], and the runtime has been improved tonearly linear time for dense linear programs with n d [BLSS20].

Matrix vector multiplication is a subtask of our iterative algorithm for solving linear programs.Online matrix-vector multiplication [LW17, HKNS15, CKL18] is closely related to our problem, butusually the computational model is different than our setting.

2 Results and Techniques

Theorem 2.1 (Main result). Given a linear program minAx=b,x≥0 c>x with no redundant con-

straints. Assume that the polytope has diameter R in `1 norm, namely, for any x ≥ 0 with Ax = b,we have ‖x‖1 ≤ R.

Then, for any 0 < δ ≤ 1, Main(A, b, c, δ) outputs x ≥ 0 such that

c>x ≤ minAx=b,x≥0

c>x+ δ · ‖c‖∞R and ‖Ax− b‖1 ≤ δ ·

R∑i,j

|Ai,j |+ ‖b‖1

in expected time (

nω+o(1) + n2.5−α/2+o(1) + n2+1/6+o(1))· log(

n

δ)

where ω is the exponent of matrix multiplication, α is the dual exponent of matrix multiplication.For the current value of ω ∼ 2.38 and α ∼ 0.31, the expected time is simply nω+o(1) log(nδ ).

See [Ren88] and [LS13, Sec E, F] on the discussion on converting an approximation solutionto an exact solution. For integral A, b, c, it suffices to pick δ = 2−O(L) to get an exact solutionwhere L = log(1 + dmax + ‖c‖∞ + ‖b‖∞) is the bit complexity and dmax is the largest absolutevalue of the determinant of a square sub-matrix of A. For many combinatorial problems, L =O(log(n+ ‖b‖∞ + ‖c‖∞)).

In this paper, we assume all floating point calculations are done exactly for simplicity. In general,the algorithm can be carried out with O(L) bits of accuracy. See [Ren88] for some discussions onthe numerical stability of the interior point methods.

If T (n) is the current cost of matrix multiplication and inversion with T (n) ∼ n2.38, our runtimeis simply O(T (n) log n log(nδ )). The log(nδ ) comes from iteration count and the log n factor comesfrom the doubling trick (|yπ(1.5r)| ≥ (1− 1/ log n)|yπ(r)|) in the projection maintenance section. Weleft the problem of obtaining O(T (n) log(nδ )) as an open problem.

2

Page 4: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Finally, we note that our runtime holds for any square and rectangular matrix multiplicationalgorithm as long as ω ≤ 3 − α (See Lemma A.4)5 For example, Strassen algorithm together witha simple rectangular multiplication algorithm gives a runtime of roughly n2.807.

2.1 Central Path Method

Our algorithm relies on two new ingredients: stochastic central path and projection maintenance.The central path method considers the linear programs

minAx=b,x≥0

c>x (primal) and maxA>y≤c

b>y (dual)

with A ∈ Rd×n. Any solution of the linear program satisfies the following optimality conditions:

xisi = 0 for all i, (1)Ax = b,

A>y + s = c,

xi, si ≥ 0 for all i.

We call (x, s, y) feasible if it satisfies the last three equations above. For any feasible (x, s, y),the duality gap of (x, s, y) is

∑i xisi. The central path method finds a solution of the linear

program by following the central path which uniformly decrease the duality gap. The central path(xt, st, yt) ∈ Rn+n+d is a path parameterized by t and defined by

xt,ist,i = t for all i, (2)Axt = b,

A>yt + st = c,

xt,i, st,i ≥ 0 for all i.

where xt,i is the i-th coordinate of xt and st,i is the i-th coordinate of st. It is known [YTM94] howto transform linear programs by adding O(n) many variables and constraints so that:

• The optimal solution remains the same.• The central path at t = 1 is near (1n, 1n, 0d) where 1n and 0d are all 1 and all 0 vectors with

lengths n and d.• It is easy to convert an approximate solution of the transformed program to the original one.

For completeness, a theoretical version of such result is included in Lemma A.6. This result showsthat it suffices to move gradually (x1, s1, y1) to (xt, st, yt) for small enough t.

2.1.1 Short Step Central Path Method

The short step central path method maintains xisi = µi for some vector µ such that∑i

(µi − t)2 = O(t2) for some scalar t > 0. (3)

Since the duality gap is∑

i µi, it suffices to find x and s satisfying the above equation with smallenough t. There are many variants of central path methods. We will focus on the version thatdecreases t and takes a step of µ at the same time. The purpose of moving µ is to maintain the

5[CGLZ20] proved a stronger result ω + 0.5ωα ≤ 3.

3

Page 5: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

invariant (3) and the purpose of decreasing t is decrease the duality gap, which is roughly nt. Onenatural way to maintain the invariant (3) is to do a gradient descent step on the energy

∑i(µi− t)2

defined in (3), namely, moving µ to µ− h(µ− t) with step size h6.More generally, say we want to move from µ to µ+δµ, we approximate the term (x+δx)i(s+δs)i

by xisi + xiδs,i + siδx,i and obtain the following system:

Xδs + Sδx = δµ,

Aδx = 0, (4)

A>δy + δs = 0,

where X = diag(x) and S = diag(s). This equation is the linear approximation of the original goal(moving from µ to µ+ δµ), and that the step is explicitly given by the formula

δx =X√XS

(I − P )1√XS

δµ and δs =S√XS

P1√XS

δµ, (5)

where P =√

XS A> (AX

S A>)−1

A√

XS is an orthogonal projection and the formulas X√

XS, XS , · · · are

the diagonal matrices of the corresponding vectors.In turns out that one can decrease t by 1 − 1√

na multiplicative factor every iteration while

maintaining the invariant (3). This requires O(√n) iterations to converge. Combining this with

the inverse maintenance technique [Vai87], this gives a total runtime of n2.5. More precisely, thealgorithm maintains the invariant

∑i(µi − t)2 = O(t2) by making steps bring µi closer to t while

taking steps to decrease µi uniformly. The progress of the whole algorithm is measured by t becausethe duality gap of (xt, st, yt) is bounded by nt.

2.1.2 Stochastic Central Path Method

This part discusses how to modify the short step central path to decrease the cost per iterationto roughly nω−

12 . Since our goal is to implement a central path method in sub-quadratic time per

iteration, we do not even have the budget to compute Ax every iterations. Therefore, instead ofmaintaining

(AXS A>)−1 as shown in previous papers, we will study the problem of maintaining a

projection matrix P =√

XS A> (AX

S A>)−1

A√

XS due to the formula of δx and δs (5).

However, even if the projection matrix P is given explicitly for free, it is difficult to multiplythe dense projection matrix with a dense vector δµ in time o(n2). To avoid moving along a denseδµ, we move along an O(k) sparse direction δµ defined by

δµ,i =

δµ,i/pi, with probability pidef= k ·

(δ2µ,i∑l δ

2µ,l

+ 1n

);

0, else.(6)

The sparse direction is defined so that we are moving in the same direction in expectation (E[δµ,i] =

δµ,i) and that the direction has as small variance as possible (E[δ2µ,i] ≤

∑i δ

2µ,i

k ). If the projectionmatrix is given explicitly, we can apply the projection matrix on δµ in time O(nk). This paper

6The classical view of central path method is to take a Newton step on the system (2), which turns out to be sameas taking a gradient step on the energy defined in (3). However, our main algorithm will choose a different energyand this gradient descent view is crucial for designing our algorithm.

4

Page 6: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

picks k ∼ √n and the sum of the cost of projection vector multiplications in the whole algorithm isabout nk2 = n2.

During the whole algorithm, we maintain a projection matrix

P =

√X

SA>(AX

SA>)−1

A

√X

S

for vectors x and s such that xi and si are multiplicative approximations of xi and si respectivelyfor all i. Since we maintain the projection at a nearby point (x, s), our stochastic step x← x+ δx,s← s+ δs and y ← y + δy are defined by

Xδs + Sδx = δµ,

Aδx = 0, (7)

A>δy + δs = 0,

which is different from (4) on both sides of the first equation. Note that this system uses X andS because we have only maintained this projection matrix. The main goal of Section 4 is to showX = Θ(X) and S = Θ(S) is good enough for our interior point method. Similar to (5), Lemma 4.2shows that

δx =X√XS

(I − P )1√XS

δµ and δs =S√XS

P1√XS

δµ. (8)

The previously fastest algorithm involves maintaining the matrix inverse (AXS A>)−1 using sub-

space embedding techniques [Sar06, CW13, NN13] and leverage score sampling [SS11]. In this paper,we maintain the projection directly using lazy updates.

The key departure from the central path we present is that we can only maintain

0.9t ≤ µi = xisi ≤ 1.1t for some t > 0

instead of µ close to t in `2 norm. We will further explain the proof in Section 4.1.

2.2 Projection Maintenance via Lazy Update

The projection matrix we maintain is of the form√WA>

(AWA>

)−1A√W where W = diag(x/s).

For intuition, we only explain how to maintain the matrix Mwdef= A>(AWA>)−1A for the short

step central path step here. In this case, we have∑

i

(wnewi −wiwi

)2= O(1) for each step. Given this,

there are mainly two extreme cases, w changes uniformly on all coordinates and w changes only ona few coordinates.

If the changes(wnewi −wiwi

)2is uniform across all the coordinates, then wnew

i = (1 ± 1√n

)wi forall i. Since it takes

√n steps to change all coordinates by a constant factor and we only need to

maintain Mv for some vi = Θ(wi) for all i, we can update the matrix every√n steps. Hence, the

average cost per iteration of maintaining the projection matrix is nω−12 , which is exactly what we

desired.For the other extreme case when w changes on only a few coordinates, only

√n coordinates are

changed by a constant factor during all√n iterations. In this case, instead of updating Mw every

step, we can compute Mwh online by the Woodbury matrix identity.

5

Page 7: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Fact 2.2 ([Woo50]). The Woodbury matrix identity is

(M + UCV )−1 = M−1 −M−1U(C−1 + VM−1U)−1VM−1.

Let S ⊂ [n] denote the set of coordinates that is changed by more than a constant factor andr = |S|. Using the identity above, we have that

Mwnew = Mw − (Mw)S(∆−1S,S + (Mw)S,S)−1((Mw)S)>, (9)

where ∆ = diag(wnew−w), (Mw)S ∈ Rn×r is the r columns from S ofMw and (Mw)S,S ,∆S,S ∈ Rr×rare the r rows and columns from S of Mw and ∆.

As long as there are only few coordinates violating vi = Θ(wi), (9) can be applied onlineefficiently. In another case, we can use (9) instead to update the matrix Mw and the cost isdominated by multiplying a n× n matrix with a n× r matrix.

Theorem 2.3 (Rectangular matrix multiplication, [LGU18]). Let the dual exponent of matrix mul-tiplication α be the supremum among all a ≥ 0 such that it takes n2+o(1) time to multiply an n× nmatrix by an n× na matrix.

Then, for any n ≥ r, multiplying an n× r with an r × n matrix or n× n with n× r takes time

n2+o(1) + rω−21−αn2−α(ω−2)

1−α +o(1).

Furthermore, we have α > 0.31389.

See Lemma A.5 for the origin of the formula. Since the cost of multiplying n × n matrix by an × 1 matrix is same as the cost for n × n with n × n0.31, (9) should be used to update at leastn0.31 coordinates. In the extreme case only few wi are changing, we only need to update the matrixn

12−0.31 times during the whole algorithm and each takes n2 time, and hence the total cost is less

than nω for the current value of ω ∼ 2.37.In previous papers [Kar84, Vai89b, NN91, NN94, LS14, LS15], the matrix is updated in a fixed

schedule independent of the input sequence w. This leads to sub-optimal bounds if used in thispaper. We instead define a potential function to measure the distance between the approximatevector v and the target vector w. When there are less than nα coordinates of v that is far from w,we are lazy and do not update the matrix. We simply apply the Woodbury matrix identity online.When there are more than nα coordinates, we update v by a certain greedy step. As in the extremecases, the worst case for our algorithm is when w changes uniformly across all coordinates and hencethe worst case runtime is nω−

12 per iteration. We will further explain the potential in Section 5.1.

3 Notations

For notational convenience, we assume the number of variables n ≥ 10 and there are no redundantconstraints. In particular, this implies that the constraint matrix A is full rank and n ≥ d.

For a positive integer n, let [n] denote the set 1, 2, · · · , n.For any function f , we define O(f) to be f · logO(1)(f). In addition to O(·) notation, for two

functions f, g, we use the shorthand f . g (resp. &) to indicate that f ≤ Cg (resp. ≥) for someabsolute constant C.

We use sinhx to denote ex−e−x2 and coshx to denote ex+e−x

2 .For vectors a, b ∈ Rn and accuracy parameter ε ∈ (0, 1), we use a ≈ε b to denote that (1− ε)bi ≤

ai ≤ (1 + ε)bi, ∀i ∈ [n]. Similarly, for any scalar t, we use a ≈ε t to denote that (1 − ε)t ≤ ai ≤(1 + ε)t,∀i ∈ [n].

6

Page 8: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

For a vector x ∈ Rn and s ∈ Rn, we use xs to denote a length n vector with the i-th coordinate(xs)i is xi · si. Similarly, we extend other scalar operations to vector coordinate-wise.

Given vectors x, s ∈ Rn, we use X and S to denote the diagonal matrix of those two vectors. Weuse X

S to denote the diagonal matrix given (XS )i,i = xi/si. Similarly, we extend other scalar opera-

tions to diagonal matrix diagonal-wise. Note that matrix√

XS A>(AX

S A>)−1A

√XS is an orthogonal

projection matrix.

7

Page 9: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Convex body

N∞ = xs ≈0.1 t, for some t

Φλ(x) ≤ n3

t = 0

t = 1

xs = t

StochasticStep

ClassicalStep

Figure 1: ClassicalStep happens with n−2 probability

4 Stochastic Central Path Method

4.1 Proof Outline

The short step central path method is defined using the approximation (x+ δx)i(s+ δs)i ∼ xisi +xiδs,i + siδx,i. This approximate is accurate if ‖X−1δx‖∞ ≤ 1/2 and ‖S−1δs‖∞ ≤ 1/2. For the δxstep, we have

X−1δx =1√XS

(I − P )1√XS

δµ ∼1

t(I − P )δµ, (10)

where we used xisi ∼ t for all i.If we know that ‖δµ‖2 ≤ t/4, then the `∞ norm can be roughly bounded as follows:

‖X−1δx‖∞ ≤ ‖X−1δx‖2 .1

t‖(I − P )δµ‖2 ≤

1

t‖δµ‖2 ≤ 1/2,

where we used that I − P is an orthogonal projection matrix. This is the reason why a standardchoice of δµ,i is −ct/

√n for all i for some small constant c.

For the stochastic step, δµ,i ∼ − t√nnk for roughly k coordinates where the term n

k is used to

preserve the expectation of the step. Therefore, the `2 norm of δµ is very large (‖δµ‖2 ∼ t√

nk ).

After the projection, we have ‖X−1δx‖2 ∼ 1t ‖(I − P )δµ‖2 ∼

√nk . Hence, the bound of ‖X−1δx‖∞

using ‖X−1δx‖2 is too weak. To improve the bound, we use Chernoff bounds to estimate ‖X−1δx‖∞.To simplify the proof, we use a loop in Algorithm 1 to ensure both the sup norm is always smallnot just with high probability.

Beside the `∞ norm bound, the proof sketch in (10) also requires using xisi ∼ t for all i. Theshort step central path proof maintains an invariant that

∑i(xisi− t)2 = O(t2). However, since our

stochastic step has a stochastic noise with `2 norm as large as t√

nk , one cannot hope to maintain

xisi close to t in `2 norm. Instead, we follow an idea in [LS14, LSW15] and maintain the followingpotential

n∑i=1

cosh(λ(xisi

t− 1))

= nO(1)

8

Page 10: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Algorithm 11: procedure StochasticStep(mp, x, s, δµ, k, ε) . Lemma 4.2,4.3,4.82: w ← x

s , v ← mp.Update(w) . Algorithm 3

3: x← x√

vw , s← s

√wv . It guarantees that x

s = v and xs = xs4: repeat5: Generate δµ such that . Compute a sparse direction

6: δµ,i ←δµ,i/pi, with prob. pi = min(1, k · ((δ2

µ,i/∑n

l=1 δ2µ,l) + 1/n));

0 else.7: . Compute an approximate step8: . Find (δx, δs, δy) such that these three equations hold

Xδs + Sδx = δµ,

Aδx = 0,

A>δy + δs = 0.

9: pµ ← mp.Query( 1√XS

δµ) . Algorithm 3

10: δs ← S√XS

pµ . According to (11)

11: δx ← 1Sδµ − X√

XSpµ . According to (12)

12: until ‖s−1δs‖∞ ≤ 1100 logn and ‖x−1δx‖∞ ≤ 1

100 logn

13: return (x+ δx, s+ δs)14: end procedure

with λ = Θ(log n). This potential is a variant of soft-max. Note that the potential bounded bynO(1) implies that xisi is a multiplicative approximation of t. To bound the potential, considerri = xisi

t and Φ(r) be the potential above. Then, we have that

E[Φ(rnew)] ≤ Φ(r) + 〈∇Φ(r),E[rnew − r]〉+O(1)E[‖rnew − r‖2∇2Φ(r)].

The first order term can be bounded efficiently because E[rnew− r] is close to the short step centralpath step. The second term is a variance term which scales like 1/k due to the k independentcoordinates. Therefore, the potential changed by 1/k ∼ 1/

√n factor each step. Hence, we can

maintain it for roughly√n steps.

To make sure the potential Φ is bounded during the whole algorithm, our step is the mixturesof two steps of the form δµ ∼ − t√

n− t ∇Φ‖∇Φ‖2 . The first term is to decrease t and the second term is

to decrease Φ.Since the algorithm is randomized, there is a tiny probability that Φ is large. In that case, we

switch to a short step central path method. See Figure 1, Algorithm 1, and Algorithm 2. The firstpart of the proof involves bounding every quantity listed in Table 1. In the second part, we areusing these quantities to bound the expectation of Φ.

To decouple the proof in both parts, we will make the following assumption in the first part. Itwill be verified in the second part.

Assumption 4.1. Assume the following for the input of the procedure StochasticStep (seeAlgorithm 1):

9

Page 11: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Algorithm 2 Our main algorithm1: procedure Main(A, b, c, δ) . Theorem 2.12: ε← 1

40000 logn , εmp ← 140000 , k ←

1000ε√n log2 n

εmp.

3: λ← 40 log n, δ ← min( δ2 ,1λ), a← min(α, 2/3).

4: Modify the linear program and obtain an initial x and s according to Lemma A.6.5: MaintainProjection mp6: mp.Initialize(A, xs , εmp, a) . Algorithm 37: t← 1 . Initialize t8: while t > δ2/(32n3) do . We stop once the error is small enough9: tnew ← (1− ε

3√n

)t10: µ← xs11: δµ ← ( t

new

t − 1)xs− ε2 · tnew · ∇Φλ(µ/t−1)

‖∇Φλ(µ/t−1)‖2 . Φλ is defined in Lemma 4.1212: (xnew, snew)← StochasticStep(mp, x, s, δµ, k, ε) . Algorithm 113: if Φλ(µnew/tnew − 1) > n3 then . When potential function is large14: (xnew, snew)← ClassicalStep(x, s, tnew) . Lemma A.2, [Vai89b]15: mp.Initialize(A, x

new

snew , εmp, a) . Restart the data structure16: end if17: (x, s)← (xnew, snew), t← tnew

18: end while19: Return an approximate solution of the original linear program according to Lemma A.6.20: end procedure

• xs ≈0.1 t with t > 0.• mp.Update(w) outputs v such that w ≈εmp v with εmp ≤ 1/40000.• ‖δµ‖2 ≤ εt with 0 < ε < 1/(40000 log n).• k ≥ 1000ε

√n log2 n/εmp.

The data structure mp in both Algorithm 1 and Algorithm 2 is used to maintain some approx-imation of the projection matrix. It is formally defined in Section 5. For this section, the onlyfacts we need is that w ≈εmp v stated in the assumption and that the vector mp.Query(w) outputssatisfies line 8 in Algorithm 1.

4.2 Bounding each quantity of stochastic step

First, we give an explicit formula for our step, which will be used in all subsequent calculations.

Lemma 4.2. The procedure StochasticStep(mp, x, s, δµ, k, ε) (see Algorithm 1) finds a solutionδx, δs ∈ Rn to (7) by the formula

δx =X√XS

(I − P )1√XS

δµ (11)

δs =S√XS

P1√XS

δµ (12)

with

P =

√X

SA>(AX

SA>)−1

A

√X

S. (13)

10

Page 12: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Quantity Bound Place‖E[s−1δs]‖2, ‖E[x−1δx]‖2, ‖E[µ−1δµ]‖2 O(ε) Part 1, Lemma 4.3‖E[µ−1(µnew − µ− δµ)]‖2 O(εmp · ε) Part 1, Lemma 4.8‖E[µ−1(µnew − µ)]‖2 O(ε) Part 1, Lemma 4.8Var[s−1

i δs,i],Var[x−1i δx,i],Var[µ−1

i δµ,i] O(ε2/k) Part 2, Lemma 4.3Var[µ−1

i µnew] O(ε2/k) Part 2, Lemma 4.8‖s−1δs‖∞, ‖x−1δx‖∞, ‖µ−1δµ‖∞ O(1/ log n) Part 3, Lemma 4.3‖µ−1(µnew − µ)‖∞ O(1/ log n) Part 3, Lemma 4.8

Table 1: The bound of each quantity under Assumption 4.1. For intuition, think ε ∼ εmp ∼ 1/10and k ∼ √n.

Proof. For the first equation of (7), we multiply AS−1 on both sides,

AS−1Xδs +Aδx = AS

−1δµ.

Since the second equation gives Aδx = 0, then we know that AS−1Xδs = AS

−1δµ.

Multiplying AS−1X on both sides of the third equation of (7), we have

−AS−1XA>δy = AS

−1Xδs = AS

−1δµ.

Thus,

δy = − (AS−1XA>)−1AS

−1δµ,

δs = A>(AS−1XA>)−1AS

−1δµ,

δx = S−1δµ − S−1

XA>(AS−1XA>)−1AS

−1δµ.

Recall we define P as (13), then we have

δs =S√XS·

√X

SA>(A

X

SA>)−1

√X

S· 1√

XSδµ =

S√XS

P1√XS

δµ,

and

δx = S−1δµ −

X√XS·

√X

SA>(A

X

SA>)−1

√X

S· 1√

XSδµ =

X√XS

(I − P )1√XS

δµ.

which are matching (11) and (12).To see why the StochasticStep outputs δx, δs satisfying (11) and (12), we note that

pµ =√V A>

(AX

SA>)−1

A√V

1√XS

δµ = P1√XS

δµ

because of Theorem 5.1.

Using the explicit formula, we are ready to bound all quantities we needed in the following twosubsubsections.

11

Page 13: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

4.2.1 Bounding δs, δx and δµ

Lemma 4.3. Under the Assumption 4.1, the two vectors δx and δs found by StochasticStepsatisfy :1. ‖E[s−1δs]‖2 ≤ 2ε, ‖E[x−1δx]‖2 ≤ 2ε, ‖E[s−1δs]‖2 ≤ 2ε, ‖E[x−1δx]‖2 ≤ 2ε, ‖E[µ−1δµ]‖2 ≤ 4ε.

2. Var[δs,isi

] ≤ 2ε2

k ,Var[δx,ixi

] ≤ 2ε2

k ,Var[δs,isi

] ≤ 2ε2

k ,Var[δx,ixi

] ≤ 2ε2

k ,Var[δµ,iµi

] ≤ 8ε2

k .3. ‖s−1δs‖∞ ≤ 0.01

logn , ‖s−1δs‖∞ ≤ 0.02logn , ‖x−1δx‖∞ ≤ 0.01

logn , ‖x−1δx‖∞ ≤ 0.02logn , ‖µ−1δµ‖∞ ≤ 0.02

logn .

Remark 4.4. For notational simplicity, the E and Var in the proof are for the case without re-sampling (Line 12). Since the all the additional terms due to resampling are polynomially boundedand since we can set failure probability to an arbitrarily small inverse polynomial (see Claim 4.7),if we took into account the extra variance from resampling, the proof does not change and the resultremains the same.

Proof.

Claim 4.5 (Part 1, bounding the `2 norm of expectation).

‖E[s−1δs]‖2 ≤ 2ε, ‖E[x−1δx]‖2 ≤ 2ε, ‖E[s−1δs]‖2 ≤ 2ε, ‖E[x−1δx]‖2 ≤ 2ε, ‖E[µ−1δµ]‖2 ≤ 4ε.

Proof. For ‖s−1δs‖∞, we consider the i-th coordinate of the vector

s−1i δs,i =

1√xisi

n∑j=1

P i,jδµ,j√xjsj

.

Then, we have

E[s−1i δs,i

]=

1√xisi

n∑j=1

P i,jE[δµ,j ]√xjsj

=1√xisi

n∑j=1

P i,jδµ,j√xjsj

.

Since xs ≈0.1 t and ‖δµ‖2 ≤ εt, we have ‖ δµ√xs‖2 ≤ 1.1εt√

t. Since P is an orthogonal projection

matrix, we have ‖P δµ√xs‖2 ≤ ‖ δµ√

xs‖2. Putting all the above facts and xs = xs, we can show

∥∥∥E[s−1δs]∥∥∥2

2=

n∑i=1

1√xisi

n∑j=1

P i,jδµ,j√xjsj

2

=

n∑i=1

1

xisi

n∑j=1

P i,jδµ,j√xjsj

2

≤ 1

0.9t

n∑i=1

n∑j=1

P i,jδµ,j√xjsj

2

=1

0.9t‖P δµ√

xs‖22

≤ 1

0.9t‖ δµ√

xs‖22 ≤

(1.1)2

0.9t· (εt)2

t≤ 1.4ε2,

which implies that ∥∥∥E[s−1δs]∥∥∥

2≤ 1.2ε. (14)

Notice that the proof for x is identical to the proof for s because (I−P ) is also a projection matrix.Since s ≈0.1 s and x ≈0.1 x, then we can also prove the next two inequalities in the Claim statement.

Now, we are ready to bound ‖E[µ−1δµ]‖2‖E[µ−1δµ]‖2 = ‖E[s−1x−1(xδs + sδx)]‖2 ≤ ‖E[s−1δs]‖2 + ‖E[x−1δx]‖2 ≤ 4ε.

by using µ = xs = xs and xδs + sδx = δµ from (7).

12

Page 14: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Claim 4.6 (Part 2, bounding the variance per coordinate).

Var[s−1i δs,i] ≤

2ε2

k,Var[x−1

i δx,i] ≤2ε2

k,Var[s−1

i δs,i] ≤2ε2

k,Var[x−1

i δx,i] ≤2ε2

k,Var[µ−1

i δµ,i] ≤8ε2

k.

Proof. Consider the i-th coordinate of the vector

s−1i δs,i =

1√xisi

n∑j=1

P i,jδµ,j√xjsj

.

For variance of s−1i δs,i, we have

Var[s−1i δs,i] =

1

xisi

n∑j=1

P2i,j

xjsjVar[δµ,j ] by all δµ,j are independent

≤ 1

xisi

n∑j=1

P2i,j

xjsj

1

k

δ2µ,j

δ2µ,j∑nl=1 δ

2µ,l

+ 1n

by (6)

≤ 1

xisi

n∑j=1

P2i,j

xjsj

1

k

n∑l=1

δ2µ,l

≤ 1.3

t2

n∑j=1

P2i,j

1

k

n∑l=1

δ2µ,l ≤

1.3ε2

k, by xisi = xisi ≈1/10 t

where we used that∑n

j=1 P2i,j = P i,i ≤ 1, ‖δµ‖2 ≤ εt at the end.

The proof for the other three inequalities in the Claim statement are identical to this one. Weomit here.

For the variance of µ−1i δµ,i,

Var[µ−1i δµ,i] = Var[x−1

i s−1i (xiδs,i + siδx,i)]

≤ 2Var[x−1i xis

−1i δs,i] + 2Var[s−1

i six−1i δx,i]

= 2Var[s−1i δs,i] + 2Var[x−1

i δx,i] ≤ 8ε2/k.

where we used the definition µ = xs = xs and (7) in the first step, the triangle inequality in thesecond step and, Var[s−1

i δs,i],Var[x−1i δx,i] ≤ 2ε2/k at the end.

Claim 4.7 (Part 3, bounding the probability of success). Without resampling, the following holdswith probability 1− 2n exp(− 0.003k

ε√n logn

).

‖s−1δs‖∞ ≤0.01

log n, ‖s−1δs‖∞ ≤

0.02

log n, ‖x−1δx‖∞ ≤

0.01

log n, ‖x−1δx‖∞ ≤

0.02

log n, ‖µ−1δµ‖∞ ≤

0.02

log n.

With resampling, it always holds.

Proof. We can write s−1i δs,i−E[s−1

i δs,i] =∑

j Yj where Yj are independent random variables definedby

Yj =1√xisi

P i,jδµ,j√xjsj

− 1√xisi

P i,jδµ,j√xjsj

.

13

Page 15: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

We bound the sum using Bernstein inequality. Note that Yj are mean 0 and that Claim 4.6 showsthat

∑nj=1 E[Y 2

j ] = Var[s−1i δs,i] ≤ 2ε2

k . We also need to give an upper bound for Yj

|Yj | =∣∣∣∣∣ 1√xisi

P i,j

(δµ,j − δµ,j√

xjsj

)∣∣∣∣∣≤ 1.2

t|δµ,j − δµ,j | by |P i,j | ≤ 1, xisi ≈1/10 t

≤ 1.2

t|δµ,j/pj | by δµ,j ∈ [0, δµ,j/pj ]

=1.2

t

1

k

1

(δµ,i∑nl=1 δ

2µ,l

+ 1nδµ,i

)by (6)

≤ 0.6

t

1

k

(n

n∑l=1

δ2µ,l

)1/2

by a2 + b2 ≥ 2ab

≤ 0.6ε√n

k

def= M. by ‖δµ‖2 ≤ εt

Now, we can apply Bernstein inequality

Pr

∣∣∣∣∣∣n∑j=1

Yj

∣∣∣∣∣∣ > b

≤ 2 exp

(− b2/2∑n

j=1 E[Y 2j ] +Mb/3

)

≤ 2 exp

(− b2/2

2ε2/k + (0.6ε√n/k) · b/3

).

We choose b = 0.005logn and use ε ≤ 1

400 logn and n ≥ 10 to get

Pr

∣∣∣∣∣∣n∑j=1

Yj

∣∣∣∣∣∣ ≥ 0.05

log n

≤ 2 exp

(− 0.003k

ε√n log n

).

Since ‖E[s−1i δs,i]‖2 ≤ 2ε ≤ 0.005

logn , we have that |s−1i δs,i| ≤ 0.01

logn with probability 1−2 exp(− 0.003kε√n logn

).

Taking a union bound, we have that ‖s−1δs‖∞ ≤ 0.01logn with probability 1− 2n exp(− 0.003k

ε√n logn

). Sim-ilarly, this holds for the other 3 terms.

Now, the last term follows by the calculation

|µ−1i δµ,i| = |x−1

i s−1i (xiδs,i + siδx,i)| = |s−1

i δs,i|+ |x−1i δx,i| ≤

0.02

log n.

4.2.2 Bounding µnew − µLemma 4.8. Under the Assumption 4.1, the vector µnew

idef= (xi + δx,i)(si + δs,i) satisfies

1. ‖E[µ−1(µnew − µ− δµ)]‖2 ≤ 10εmp · ε and ‖E[µ−1(µnew − µ)]‖2 ≤ 5ε.2. Var[µ−1

i µnewi ] ≤ 50ε2/k for all i.

3. ‖µ−1(µnew − µ)‖∞ ≤ 0.021logn .

14

Page 16: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Claim 4.9 (Part 1 of Lemma 4.8).

‖E[µ−1(µnew − µ− δµ)]‖2 ≤ 10εmp · ε, and ‖E[µ−1(µnew − µ)]‖2 ≤ 5ε.

Proof.

µnew = (x+ δx)(s+ δs) = µ+ xδs + sδx + δxδs = µ+ xδs + sδx︸ ︷︷ ︸δµ

+ (x− x)δs + (s− s)δx + δxδs︸ ︷︷ ︸εµ

.

Taking the expectation on both sides, we have

E[µnew − µ− δµ] = (x− x)E[δs] + (s− s)E[δx] + E[δxδs].

Hence, we have that

‖µ−1 E[µnew − µ− δµ]‖2≤ ‖µ−1(x− x)s · s−1 E[δs]‖2 + ‖µ−1(s− s)x · x−1 E[δx]‖2 + ‖µ−1 E[δxδs]‖2≤ ‖µ−1(x− x)s‖∞ · ‖s−1 E[δs]‖2 + ‖µ−1(s− s)x‖∞ · ‖x−1 E[δx]‖2 + ‖µ−1 E[δxδs]‖2≤ εmp · ‖s−1 E[δs]‖2 + εmp · ‖x−1 E[δx]‖2 + ‖µ−1 E[δxδs]‖2≤ 4εmp · ε+ ‖µ−1 E[δxδs]‖2, (15)

where we used the triangle inequality in the first step, ‖ab‖2 ≤ ‖a‖∞ · ‖b‖2 in the second step,‖µ−1(x− x)s‖∞ ≤ εmp and ‖µ−1(s− s)x‖∞ ≤ εmp (since x ≈εmp x, s ≈εmp s) in the third step, and‖E[s−1δs]‖2 ≤ 2ε and ‖E[x−1δx]‖2 ≤ 2ε (Part 1 of Lemma 4.3) at the end.

To bound the last term, using E[δs] = δs and E[δx] = δx, we note that

E[δx,iδs,i] = δx,iδs,i + E[(δx,i − δx,i)(δs,i − δs,i)].

Hence, we have

‖µ−1 E[δxδs]‖2 ≤ ‖µ−1δxδs‖2 +

(n∑i=1

(E[x−1i (δx,i − δx,i) · s−1

i (δs,i − δs,i)])2)1/2

≤ 4ε2 +1

2

(n∑i=1

(Var[x−1

i δx,i] + Var[s−1i δs,i]

)2)1/2

≤ 4ε2 +1

2

(n∑i=1

2(Var[x−1i δx,i])

2 + 2(Var[s−1i δs,i])

2

)1/2

≤ 4ε2 + 2√n · ε4/k2 ≤ 4ε2 + 2ε · εmp ≤ 6ε · εmp, (16)

where we used ‖µ−1δxδs‖2 ≤ ‖x−1δx‖2 · ‖s−1δs‖2 ≤ 4ε2 (Part 1 of Lemma 4.3) and 2ab ≤ a2 + b2 inthe second step, (a+b)2 ≤ 2a2+2b2 in the third step, Var[x−1

i δx,i] ≤ 2ε2/k andVar[s−1i δs,i] ≤ 2ε2/k

(Part 2 of Lemma 4.3) in the fourth step, and k ≥ ε√n

εmpat the end.

Combining (15) and (16), we have that

‖µ−1(E[µnew − µ− δµ])‖2 ≤ 4εmp · ε+ ‖µ−1 E[δxδs]‖2 ≤ 10εmp · ε.

where we used ε ≤ εmp.

15

Page 17: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

From Part 1 of Lemma 4.3, we know that ‖µ−1 E[δµ]‖2 ≤ 4ε. Thus using triangle inequality, weknow

‖µ−1(E[µnew − µ])‖2 ≤ 10εmp · ε+ 4ε ≤ 5ε.

Claim 4.10 (Part 2 of Lemma 4.8). Var[µ−1i µnew

i ] ≤ 50ε2/k for all i.

Proof. Recall that

µnew = µ+ δµ + (x− x)δs + (s− s)δx + δxδs.

We can upper bound the variance of µ−1i µnew

i ,

Var[µ−1i µnew

i ] ≤ 4Var[µ−1i δµ,i] + 4Var[µ−1

i (xi − xi)δs,i] + 4Var[µ−1i (si − si)δx,i] + 4Var[µ−1

i δx,iδs,i]

≤ 32ε2

k+ 4

ε2

k+ 4

ε2

k+ Var[µ−1

i δx,iδs,i]

= 40ε2

k+ Var[x−1

i δx,i · s−1i δs,i]

≤ 40ε2

k+ 2Sup[(x−1

i δx,i)2] ·Var[s−1

i δs,i] + 2Sup[(s−1i δs,i)

2] ·Var[x−1i δx,i]

≤ 40ε2

k+ 2 · ( 0.02

log n)2 · ε

2

k+ 2 · ( 0.02

log n)2 · ε

2

k≤ 50

ε2

k.

where the second step follows by the inequality Var[µ−1i δµ,i] ≤ 8ε2/k (Part 2 of Lemma 4.3),

Var[µ−1i (xi − xi)δs,i] = Var[x−1

i (xi − xi)s−1i δs,i] ≤ 2ε2mp Var[s−1

i δs,i] ≤ ε2/k.

and a similar inequality Var[µ−1i (si − si)δx,i] ≤ ε2/k, the third step follows by the definition

µ = xs, the fourth step follows by the inequality Var[xy] ≤ 2Sup[x2]Var[y] + 2Sup[y2]Var[x](Lemma A.1) with Sup denoting the deterministic maximum of the random variable, the fifth stepfollows by the inequalitiesVar[s−1

i δs,i] ≤ 2ε2/k andVar[x−1i δx,i] ≤ 2ε2/k (Part 2 of Lemma 4.3).

Claim 4.11 (Part 3 of Lemma 4.8). ‖µ−1(µnew − µ)‖∞ ≤ 0.021logn .

Proof. We again note that

µnew = µ+ δµ + (x− x)δs + (s− s)δx + δxδs.

Hence, we have

|µ−1i (µnew

i − µi − δµ,i)|≤ |(x− x)iµ

−1i δs,i|+ |(s− s)iµ−1

i δx,i|+ |µ−1i δx,iδs,i|

= |(x− x)ix−1i | · |s−1

i δs,i|+ |(s− s)is−1i | · |x−1

i δx,i|+ |x−1i δx,i| · |s−1

i δs,i|≤ εmp|s−1

i δs,i|+ εmp|x−1i δx,i|+ |s−1

i δs,i||x−1i δx,i|

≤ εmp ·0.2

log n+ εmp ·

0.02

log n+ (

0.02

log n)2 ≤ 1

1000 log n,

16

Page 18: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

where the first step follows by the triangle inequality, the second step follows by the definitionµi = xisi, the third step follows by the invariants x ≈εmp x and s ≈εmp s, the fifth step follows bythe inequalities |s−1

i δs,i| ≤ 0.02logn and |x−1

i δx,i| ≤ 0.02logn (Part 3 of Lemma 4.3).

Since we know that |µ−1i δµ,i| ≤ 0.02

logn (Part 3 of Lemma 4.3), we have

|µ−1i (µnew

i − µi)| ≤1

1000 log n+

0.02

log n≤ 0.021

log n.

4.3 Stochastic central path

Now, we are ready to prove xisi ≈0.1 t during the whole algorithm. As explained in the proofoutline (see Section 4.1), we will prove this bound by analyzing the potential Φλ(µ/t − 1) whereΦλ(r) =

∑ni=1 cosh(λri).

First, we give some basic properties of Φλ.

Lemma 4.12 (Basic properties of potential function). Let Φλ(r) =∑n

i=1 cosh(λri) for some λ > 0.For any vector r ∈ Rn,1. For any vector ‖v‖∞ ≤ 1/λ, we have that

Φλ(r + v) ≤ Φλ(r) + 〈∇Φλ(r), v〉+ 2‖v‖2∇2Φλ(r).

2. ‖∇Φλ(r)‖2 ≥ λ√n

(Φλ(r)− n).

3.(∑n

i=1 λ2 cosh2(λri)

)1/2 ≤ λ√n+ ‖∇Φλ(r)‖2.

Proof. For each i ∈ [n], we use ri to denote the i-th coordinate of vector r.Proof of Part 1. Using mean-value forms of Taylor’s theorem, we have that

cosh(λ(ri + vi)) = cosh(λri) + λ sinh(λri)vi +λ2

2cosh(ζi)v

2i ,

where ζi is between λri and λ(ri + vi). By definition of cosh and the assumption that ‖v‖∞ ≤ 12λ ,

we have that

cosh(ζi) =1

2exp(ζi) +

1

2exp(−ζi) ≤ exp(1) · 1

2(exp(λri) + exp(−λri)) ≤ 3 cosh(λri).

Hence, we have

cosh(λ(ri + vi)) ≤ cosh(λri) + λ sinh(λri)vi + 2λ2 cosh(λri)v2i .

Summing over all the coordinates gives

n∑i=1

cosh(λ(ri + vi)) ≤n∑i=1

[cosh(λri) + 2λ sinh(λri)vi + λ2 cosh(λri)v

2i

]=⇒ Φλ(r + v) ≤ Φλ(r) + 〈∇Φλ(r), v〉+ 2‖v‖2∇2Φλ(r).

Proof of Part 2. Since Φλ(r) =∑n

i=1 cosh(λri), then

∇Φλ(r) =[λ sinh(λr1) λ sinh(λr2) · · · λ sinh(λrn)

]>.

17

Page 19: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Thus, we can lower bound ‖∇Φλ(r)‖2 in the following way,

‖∇Φλ(r)‖2 =

(n∑i=1

λ2 sinh2(λri)

)1/2

=

(n∑i=1

λ2(cosh2(λri)− 1)

)1/2

by cosh2(y)− sinh2(y) = 1,∀y

≥ λ√n

n∑i=1

√cosh2(λri)− 1 by ‖ · ‖2 ≥

1√n‖ · ‖1

≥ λ√n

n∑i=1

(cosh(λri)− 1) by cosh(λri) ≥ 1

=λ√n

(Φλ(r)− n). by def of Φ(r)

Proof of Part 3.

(n∑i=1

λ2 cosh2(λri)

)1/2

=

(n∑i=1

λ2 + λ2 sinh2(λri)

)1/2

by cosh2(y)− sinh2(y) = 1,∀y

≤ (nλ2)1/2 +

(n∑i=1

λ2 sinh2(λri)

)1/2

= λ√n+ ‖∇Φλ(r)‖2.

The following lemma shows that the potential Φ is decreasing in expectation when Φ is large.

Lemma 4.13. Under the Assumption 4.1, we have

E

[Φλ

(µnew

tnew− 1

)]≤ Φλ

(µt− 1)− λε

15√n

(Φλ

(µt− 1)− 10n

).

Proof. Let εµ = µnew − µ− δµ. From the definition, we have

µnew − tnew = µ+ δµ + εµ − tnew,

which implies

µnew

tnew− 1 =

µ

tnew+

1

tnew(δµ + εµ)− 1

t

t

tnew+

1

tnew(δµ + εµ)− 1

t+µ

t(t

tnew− 1) +

1

tnew(δµ + εµ)− 1

t− 1 +

µ

t(t

tnew− 1) +

1

tnew(δµ + εµ)︸ ︷︷ ︸

v

. (17)

18

Page 20: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

To apply Lemma 4.12 with r = µ/t − 1 and r + v = µnew/tnew − 1, we first compute theexpectation of v

E[v] =µ

t(t

tnew− 1) +

1

tnew(E[δµ] + E[εµ])

t(t

tnew− 1) +

1

tnew(δµ + E[εµ])

t(t

tnew− 1) +

1

tnew

(((tnew

t− 1)µ− ε

2tnew ∇Φλ(µ/t− 1)

‖∇Φλ(µ/t− 1)‖2

)+ E[εµ]

)= − ε

2

∇Φλ(µ/t− 1)

‖∇Φλ(µ/t− 1)‖2+

1

tnewE[εµ], (18)

where the third step follows by the definition of δµ.Next, we bound the ‖v‖∞ as follows

‖v‖∞ ≤∥∥∥∥µt (

t

tnew− 1)

∥∥∥∥∞

+

∥∥∥∥ 1

tnew(δµ + εµ)

∥∥∥∥∞≤ ε√

n+‖µ−1(µnew − µ)‖∞

0.9

≤ ε√n

+0.021

0.9 log n≤ 1

λ.

where we used Part 3 of Lemma 4.8 and ε ≤ 1400 logn .

Since ‖v‖∞ ≤ 1λ , we can apply Part 1 of Lemma 4.12 and get

E[Φλ(µ/t+ v − 1)]

≤ Φλ(µ/t− 1) + 〈∇Φλ(µ/t− 1),E[v]〉+ 2E[‖v‖2∇2Φλ(µ/t+v−1)]

= Φλ(µ/t− 1)− ε

2‖∇Φλ(µ/t− 1)‖2 +

t

tnew〈∇Φλ(µ/t− 1),E[t−1εµ]〉+ 2E[‖v‖2∇2Φλ(µ/t−1)]

≤ Φλ(µ/t− 1)− ε

2‖∇Φλ(µ/t− 1)‖2 +

t

tnew‖∇Φλ(µ/t− 1)‖2 · ‖E[t−1εµ]‖2 + 2E[‖[v]‖2∇2Φλ(µ/t−1)]

≤ Φλ(µ/t− 1)− ε

2‖∇Φλ(µ/t− 1)‖2 + 10εmp · ε‖∇Φλ(µ/t− 1)‖2 + 2E[‖v‖2∇2Φλ(µ/t−1)],

where we substituted E[v] by (18) in the second step, we used 〈a, b〉 ≤ ‖a‖2 · ‖b‖2 in the third step,and ‖E[t−1εµ]‖2 ≤ 10εmp · ε (from Part 1 of Lemma 4.8 and µ ≈0.1 t) at the end

We still need to bound E[‖v‖2∇2Φλ(µ/t−1)]. Before bounding it, we first bound E[v2i ],

E[v2i ] ≤ 2E

[(µit

(t

tnew− 1)

)2]

+ 2E

[(1

tnew(δµ,i + δµ,i)

)2]

≤ ε2/n+ 2.5E[((µnew

i − µi)/µi)2]

= ε2/n+ 2.5Var[(µnewi − µi)/µi] + 2.5(E[(µnew

i − µi)/µi])2

≤ ε2/n+ 125ε2/k + 2.5(E[(µnewi − µi)/µi])2

≤ 126ε2/k + 3(E[(µnewi − µi)/µi])2, (19)

where we used the definition of v (see (17)) in the first step, µ ≈0.1 t and (t/tnew − 1)2 ≤ ε2/(4n)in the second step, E[x2] = Var[x] + (E[x])2 in the third step, Part 2 of Lemma 4.8 in the fourthstep, and n ≥ k at the end.

19

Page 21: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Now, we are ready to bound E[‖v‖2∇2Φλ(µ/t−1)]

E[‖v‖2∇2Φλ(µ/t−1)]

= λ2n∑i=1

E[Φλ(µ/t− 1)iv2i ]

≤ λ2n∑i=1

Φλ(µ/t− 1)i · (126ε2/k + 3(E[(µnewi − µi)/µi])2)

= 126λ2ε2

kΦλ(µ/t− 1) + 3λ2

n∑i=1

Φλ(µ/t− 1)i · (E[(µnewi − µi)/µi])2

≤ 126λ2ε2

kΦλ(µ/t− 1) + 3λ

(n∑i=1

λ2Φλ(µ/t− 1)2i

)1/2

· ‖E[µ−1(µnew − µ)]‖24

≤ 126λ2ε2

kΦλ(µ/t− 1) + 3λ

(λ√n+ ‖∇Φλ(µ/t− 1)‖2

)· (5ε)2,

where the first step follows from the fact Φλ(x)i = cosh(λxi), the second step follows from (19), thefourth step follows from Cauchy-Schwarz inequality, the fifth step follows from Part 3 of Lemma 4.12and the fact that ‖E[µ−1(µnew − µ)]‖24 ≤ ‖E[µ−1(µnew − µ)]‖22 ≤ (5ε)2 (Lemma 4.8).

Then,

E[Φλ(µ/t+ v − 1)]

≤ Φλ(µ/t− 1)− (ε

2− 10εmp · ε)‖∇Φλ(µ/t− 1)‖2 + 252

λ2ε2

kΦλ(µ/t− 1)

+ 150λ2ε2√n+ 150λε2‖Φλ(µ/t− 1)‖2

≤ Φλ(µ/t− 1)− ε

3‖∇Φλ(µ/t− 1)‖2 + 252

λ2ε2

kΦλ(µ/t− 1) + 150λ2ε2

√n

≤ Φλ(µ/t− 1)− λε

3√n

(Φλ(µ/t− 1)− n) + 252λ2ε2

kΦλ(µ/t− 1) + 150λ2ε2

√n

≤ Φλ(µ/t− 1)− λε

3√n

(Φλ(µ/t− 1)/5− 2n),

where the second step follows from the inequalities 1000λε ≤ 1 and 1000εmp ≤ 1, the third stepfollows from Part 2 of Lemma 4.12, and the last step follows from the inequalities 1000λεmp ≤ log n

and k ≥√nε lognεmp

.

As a corollary, we have the following:

Lemma 4.14. During the Main algorithm, Assumption 4.1 is always satisfied. Furthermore, theClassicalStep happens with probability O( 1

n2 ) each step.

Proof. The second and the fourth assumptions simply follow from the choice of εmp and k.Let Φ(k) be the potential at the k-th iteration of the Main. The ClassicalStep ensures that

Φ(k) ≤ n3 at the end of each iteration. By the definition of Φ and the choice of λ in Main, we havethat ∥∥∥xs

t− 1∥∥∥∞≤ ln(2n3)

λ≤ 0.1.

20

Page 22: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

This proves the first assumption xs ≈0.1 t with t > 0.For the third assumption, we note that

‖δµ‖2 =

∥∥∥∥( tnew

t− 1

)xs− ε

2· tnew · ∇Φλ(µ/t− 1)

‖∇Φλ(µ/t− 1)‖2

∥∥∥∥2

≤∣∣∣∣ tnew

t− 1

∣∣∣∣ ‖xs‖2 +ε

2tnew

≤ ε

3√n· 1.1√nt+ 1.01 · ε

2t ≤ εt,

where we used xs ≈0.1 t and the formula of tnew. Hence, we proved all assumptions in Assump-tion 4.1.

Now, we bound the probability that ClassicalStep happens. In the beginning of the Main,Lemma A.6 is used to modify the linear program with parameter min( δ2 ,

1λ). Hence, the initial

point x and s satisfies xs ≈1/λ 1. Therefore, we have Φ(0) ≤ 10n. Lemma 4.13 shows E[Φ(k+1)] ≤(1− λε

15√n

)E[Φ(k)]+ λε15√n

10n. By induction, we have that E[Φ(k)] ≤ 10n for all k. Since the potentialis positive, Markov inequality shows that for any k, Φ(k) ≥ n3 with probability at most O( 1

n2 ).

4.4 Analysis of cost per iteration

To apply the data structure for projection maintenance (Theorem 5.1), we need to first prove theinput vector w does not change too much for each step.

Lemma 4.15. Let xnew = x+ δx and snew = s+ δs. Let w = xs and wnew = xnew

snew . Then we have

n∑i=1

(E[lnwnewi ]− lnwi)

2 ≤ 64ε2,n∑i=1

(Var[lnwnewi ])2 ≤ 1000ε2.

Proof. From the definition, we know that

wnewi

wi=

1

s−1i xi

xi + δx,i

si + δs,i=

1 + x−1i δx,i

1 + s−1i δs,i

.

Part 1. For each i ∈ [n], we have

E[lnwnewi ]− lnwi = E

[ln(1 + x−1

i δx,i)− ln(1 + s−1i δs,i)

]≤ 2|E[x−1

i δx,i − s−1i δs,i]| by |s−1

i δs,i|, |x−1i δx,i| ≤ 0.2,Lemma 4.3

≤ 2|E[x−1i δx,i]|+ 2|E[s−1

i δs,i]|. by triangle inequality

Thus, summing over all the coordinates gives

n∑i=1

(E[lnwnewi ]− lnwi)

2 ≤n∑i=1

8(E[x−1i δx,i])

2 + 8(E[s−1i δs,i])

2 ≤ 64ε2.

where the first step follows by the triangle inequality, the last step follows by the inequalities‖E[s−1δs]‖22, ‖E[x−1δx]‖22 ≤ 4ε2 (Part 1 of Lemma 4.3).

21

Page 23: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Part 2. For each i ∈ [n], we have

Var[wnewi ] ≤E

[(lnwnew

i − lnwi)2]

= E

(ln1 + x−1

i δx,i

1 + s−1i δs,i

)2

≤ 2E[(x−1i δx,i − s−1

i δs,i)2]

≤ 2E[2(x−1i δx,i)

2 + 2(s−1i δs,i)

2]

= 4E[(x−1i δx,i)

2] + 4E[(s−1i δs,i)

2]

= 4Var[x−1i δx,i] + 4(E[x−1

i δx,i])2 + 4Var[s−1

i δs,i] + 4(E[s−1i δs,i])

2

≤ 16ε2/k + 4(E[x−1i δx,i])

2 + 4(E[s−1i δs,i])

2,

where we used Var[x−1i δx,i],Var[s−1

i δs,i] ≤ 2ε2/k (Part 2 of Lemma 4.3) at the end.Thus summing over all the coordinates

n∑i=1

(Var[wnewi ])2 ≤ 512nε4

k2+ 64

n∑i=1

((E[x−1

i δx,i])4 + (E[s−1

i δs,i])4)

≤ 512nε4

k2+ 2048ε4 ≤ 1000ε2,

where we used ‖E[s−1δs]‖22, ‖E[x−1δx]‖22 ≤ 4ε2 and k ≥ √nε at the end.

Now, we analyze the cost per iteration in procedure Main. This is a direct application of ourprojection maintenance result.

Lemma 4.16. For ε ≥ 1√n, each iteration of Main (Algorithm 2) takes

n1+a+o(1) + ε · (nω−1/2+o(1) + n2−a/2+o(1))

expected time per iteration in amortized where 0 ≤ a ≤ α controls the batch size in the data structureand α is the dual exponent of matrix multiplication.

Proof. Lemma 4.14 shows that ClassicalStep happens with only O(1/n2) probability each step.Since the cost of each step only takes O(n2.5), the expected cost is only O(n0.5).

Lemma 4.15 shows that the conditions in Theorem 5.1 holds with the parameter C1 = O(ε), C2 =O(ε), εmp = Θ(1).

In the procedure StochasticStep, Theorem 5.1 shows that the amortized time per iterationis mainly dominated by two steps:

1. mp.Update(w): O(ε · (nω−1/2+o(1) + n2−a/2+o(1))).2. mp.Query( 1√

XSδµ): O(n · ‖δµ‖0 + n1+a+o(1)).

Combining both running time and using E[‖δµ‖0] = O(1 + k) = O(ε√n log2 n) (according to

the probability of success in Claim 4.7 and matching Assumption 4.1), we have the result.

22

Page 24: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

4.5 Main result

Proof of Theorem 2.1. In the beginning of the Main algorithm, Lemma A.6 is called to modify thelinear program. Then, we run the stochastic central path method on this modified linear program.

When the algorithm stops, we obtain a vector x and s such that xs ≈0.1 t with t ≤ δ2

32n3 . Hence,the duality gap is bounded by

∑i xisi ≤ (δ/4n)2. Lemma A.6 shows how to obtain an approximate

solution of the original linear program with the guarantee needed using the x and s we just found.Since t is decreased by 1− ε

3√nfactor each iteration, it takes O(

√nε · log(nδ )) iterations in total.

In Lemma 4.16, we proved that each iteration takes

n1+a+o(1) + ε · (nω−1/2+o(1) + n2−a/2+o(1)).

and hence the total runtime is

O(n2.5−a/2+o(1) + nω+o(1) +n1.5+a+o(1)

ε) · log(

n

δ).

Since ε = Θ( 1logn), the total runtime is

O(n2.5−a/2+o(1) + nω+o(1) + n1.5+a+o(1)) · log(n

δ).

Finally, we note that the optimal choice of a is min(23 , α), which gives the promised runtime.

Using the same proof, but different choice of the parameters, we can analyze the ultra short stepstochastic central path method, where each step involves sampling only polylogarithmic coordinates.As we mentioned before, the runtime is still around nω.

Corollary 4.17. Under the same assumption as Theorem 2.1, if we choose ε = Θ(1/√n) and

a = min(13 , α), the expected time of Main (Algorithm 2) is(

nω+o(1) + n2.5−α/2+o(1) + n2+1/3+o(1))· log(

n

δ).

23

Page 25: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

5 Projection Maintenance

The goal of this section is to prove the following theorem:

Theorem 5.1 (Projection maintenance). Given a full rank matrix A ∈ Rd×n with n ≥ d and atolerance parameter 0 < εmp < 1/4. Given any positive number a such that a ≤ α where α is thedual exponent of matrix multiplication. There is a deterministic data structure (Algorithm 3) thatapproximately maintains the projection matrices

√WA>(AWA>)−1A

√W

for positive diagonal matrices W through the following two operations:1. Update(w): Output a vector v such that for all i,

(1− εmp)vi ≤ wi ≤ (1 + εmp)vi.

2. Query(h): Output√V A>(AV A>)−1A

√V h for the v outputted by the last call to Update.

The data structure takes n2dω−2 time to initialize and each call of Query(h) takes time

n · ‖h‖0 + n1+a+o(1).

Furthermore, if the initial vector w(0) and the (random) update sequence w(1), · · · , w(T ) ∈ Rn satis-fies

n∑i=1

(E[lnw

(k+1)i ]− lnw

(k)i

)2≤ C2

1 andn∑i=1

(Var[lnw(k+1)i ])2 ≤ C2

2

with the expectation and variance is conditional on w(k)i for all k = 0, 1, · · · , T − 1. Then, the

amortized expected time7 per call of Update(w) is

(C1/εmp + C2/ε2mp) · (nω−1/2+o(1) + n2−a/2+o(1)).

Remark 5.2. For our linear program algorithm, we have C1 = O(1/ log n), C2 = O(1/ log n) andεmp = Θ(1). See Lemma 4.15.

5.1 Proof outline

For intuition, we consider the case C1 = Θ(1), C2 = Θ(1), and εmp = Θ(1) in this explanation.The correctness of the data structure (Algorithm 3) directly follows from Woodbury matrix identity.(The update rule in Line 34 correctly maintainsM = A>(AV A>)−1A). The amortized time analysisis based on a potential function that measures the distance of the approximate vector v and thetarget vector w. We will show that

• The cost to update the projection M is proportional to the decrease of the potential.• Each call to query increase the potential by a fixed amount.

Combining both together gives the amortized runtime bound of our data structure.7If the input is deterministic, so is the output and the runtime.

24

Page 26: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Algorithm 3 Projection Maintenance Data Structure1: datastructure MaintainProjection . Theorem 5.12:3: members4: w ∈ Rn . Target vector5: v, v ∈ Rn . Approximate vectors v ≈εmp

w and v ≈εmpw

6: A ∈ Rd×n7: M ∈ Rn×n . Matrix M = A>(AV A>)−1A8: εmp ∈ (0, 1/4) . Tolerance9: a ∈ (0, α] . Batch Size na for Update10: end members11:12: procedure Initialize(A,w, εmp, a) . Lemma 5.313: w ← w, v ← w, εmp ← εmp, A← A, a← a14: M ← A>(AV A>)−1A15: end procedure16:17: procedure Update(wnew) . Lemma 5.418: yi ← lnwnew

i − ln vi, ∀i ∈ [n]19: r ← the number of indices i such that |yi| ≥ εmp/2.20: if r < na then21: vnew ← v22: Mnew ←M23: else24: Let π : [n]→ [n] be a sorting permutation such that |yπ(i)| ≥ |yπ(i+1)|25: while 1.5 · r < n and |yπ(d1.5·re)| ≥ (1− 1/ log n)|yπ(r)| do26: r ← min(d1.5 · re, n)27: end while

28: vnewπ(i) ←wnewπ(i) i ∈ 1, 2, · · · , r

vπ(i) i ∈ r + 1, · · · , n29: . Compute Mnew = A>(AV newA>)−1A via Woodbury matrix identity30: ∆← diag(vnew − v) . ∆ ∈ Rn×n and ‖∆‖0 = r31: Let S ← π([r]) be the first r indices in the permutation.32: Let MS ∈ Rn×r be the r columns from S of M .33: Let MS,S ,∆S,S ∈ Rr×r be the r rows and columns from S of M and ∆.34: Mnew ←M −MS · (∆−1S,S +MS,S)−1 · (MS)>

35: end if36: w ← wnew, v ← vnew, M ←Mnew

37: vi ←vi if | lnwi − ln vi| < εmp/2

wi otherwise38: return v39: end procedure40:41: procedure Query(h) . Lemma 5.542: Let S be the indices i such that | lnwi − ln vi| ≥ εmp/2.43: return

√V · (M · (

√V · h))−

√V · (MS · ((∆−1S,S +MS,S)−1 · (M>

S

√V h)))

44: end procedure45:46: end datastructure

Now, we explain the definition of the potential. Consider the k-th round of the algorithm. Forall i ∈ [n], we define x(k)

i = lnw(k)i − ln v

(k)i . Note that |x(k)

i | measures the relative distance between

25

Page 27: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

w(k)i and v

(k)i . Our algorithm fixes the indices with largest error x(k)

i . To capture the fact thatupdating in a larger batch is more efficient, we define the potential as a weighted combination ofthe error where we put more weight to higher x(k)

i . Formally, we sort the coordinates of x(k) suchthat |x(k)

i | ≥ |x(k)i+1| and define the potential by

Ψk =n∑i=1

gi · ψ(x(k)i ).

where gi are positive decreasing numbers to be chosen and ψ is a symmetric (ψ(x) = ψ(−x)) positivefunction that increases on both sides. For intuition, one can think ψ(x) behaves roughly like |x|.

Each iteration we update the projection matrix such that the error of |x1|, · · · , |xr| drops fromroughly εmp to 0. This decreases the potential of ψ(x

(k)i ) by Ω(εmp) from i = 1, · · · , r. Therefore,

the whole potential decreases by Ω(εmp∑r

i=1 gi). To make the term∑r

i=1 gi proportional to thetime to update a rank r part of the projection matrix, we set

gi =

n−a, if i < na;

iω−21−a−1n−

a(ω−2)1−a , otherwise.

(20)

where ω is the exponent of matrix multiplication and a is any positive number less than orequals to the dual exponent of matrix multiplication. Lemma A.4 shows that g is indeed non-increasing and Lemma 5.4 shows that the update time of data-structure is indeed O(rgrn

2+o(1)) =O(∑r

i=1 gin2+o(1)) for any r ≥ na.

Each call to Update, the expectation of the error vector x(k) moves roughly in an unit `2 ball.Therefore, the changes of the potential is roughly upper bounded (

∑ni=1 g

2i )

1/2 ≈ nω−5/2. Since ittakes us n2+o(1) time to decrease the potential by roughly 1 in the update step, the total time isroughly nω−1/2.

For the case of stochastic central path, we note that the variance of the vector x is quite small.By choosing a smooth potential function ψ (see (21)), we can essentially give the same result as ifthere is no variance.

5.2 Proof of Theorem 5.1

Now, we give the proof of Theorem 5.1. We will defer some simple calculations into later sections.

Proof of Theorem 5.1.Proof of Correctness. The definition of v in Line 37 ensures that (1−εmp)vi ≤ wi ≤ (1+εmp)vi.Using the Woodbury matrix identity, one can verify that the update rule in Line 34 correctly

maintains M = A>(AV A>)−1A. See the deviation of the formula in Lemma 5.3. By the samereasoning, the Line 43 outputs the vector

√V A>(AV A>)−1A

√V h. This completes the proof of

correctness.Definition of x and y. Consider the k-th round of the algorithm. For all i ∈ [n], we define

x(k)i , x(k+1)

i and y(k)i as follows:

x(k)i = lnw

(k)i − ln v

(k)i , y

(k)i = lnw

(k+1)i − ln v

(k)i , x

(k+1)i = lnw

(k+1)i − ln v

(k+1)i .

Note that the difference between x(k)i and y(k)

i is that w is changing. The difference between y(k)i

and x(k+1)i is that v is changing.

26

Page 28: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Assume sorting. Assume the coordinates of vector x(k) ∈ Rn are sorted such that |x(k)i | ≥

|x(k)i+1|. Let τ and π are permutations such that |x(k+1)

τ(i) | ≥ |x(k+1)τ(i+1)| and |y

(k)π(i)| ≥ |y

(k)π(i+1)|.

Definition of Potential function. Let g be defined in (20). Let ψ : R→ R be defined by

ψ(x) =

|x|2εmp

, |x| ∈ [0, εmp/2]

εmp/2− (εmp−|x|)2εmp

, |x| ∈ (εmp/2, εmp]

εmp/2. |x| ∈ (εmp,+∞)

(21)

We define the potential at the k-th round by

Ψk =n∑i=1

gi · ψ(x(k)τk(i)).

where τk(i) is the permutation such that |x(k)τk(i)| ≥ |x

(k)τk(i+1)|.

Bounding the potential.We can express Ψk+1 −Ψk as follows:

Ψk+1 −Ψk =

n∑i=1

gi ·(ψ(x

(k+1)τ(i) )− ψ(x

(k)i ))

=n∑i=1

gi ·(ψ(y

(k)π(i))− ψ(x

(k)i ))

︸ ︷︷ ︸w move

−n∑i=1

gi ·(ψ(y

(k)π(i))− ψ(x

(k+1)τ(i) )

)︸ ︷︷ ︸

v move

. (22)

Now, using Lemma 5.6 and 5.9, and the fact that Ψ0 = 0 and ΨT ≥ 0, with (22), we get

0 ≤ ΨT −Ψ0 =T−1∑k=0

(Ψk+1 −Ψk)

≤T−1∑k=0

(O(C1 + C2/εmp) ·

√log n · (n−a/2 + nω−5/2)− Ω(εmprkgrk/ log n)

)= T ·O(C1 + C2/εmp) ·

√log n · (n−a/2 + nω−5/2)−

T∑k=1

Ω(εmprkgrk/ log n),

where the third step follows by Lemma 5.6 and Lemma 5.9 and rk is the number of coordinates weupdate during that iteration.

Therefore, we get,

T∑k=1

rkgrk = O(T · (C1/εmp + C2/ε

2mp) · log3/2 n · (nω−5/2 + n−a/2)

).

Proof of running time. See the Section 5.3.

5.3 Initialization time, update time, query time

To formalize the amortized runtime proof, we first analyze the initialization time (Lemma 5.3), up-date time (Lemma 5.4), and query time (Lemma 5.5) of our projection maintenance data-structure.

27

Page 29: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Lemma 5.3 (Initialization time). The initialization time of data-structure MaintainProjection(Algorithm 3) is O(n2dω−2).

Proof. Given matrix A ∈ Rd×n and diagonal matrix V ∈ Rn×n, computing A>(AV A>)−1A takesO(n2dω−2).

Lemma 5.4 (Update time). The update time of data-structure MaintainProjection (Algo-rithm 3) is O(rgrn

2+o(1)) where r is the number of indices we updated in v.

Proof. Let AS ∈ Rd×r be the r columns from S of A. From k-th query to (k+ 1)-th query, we have

A>(AV (k+1)A>)−1A

= A>(A(V (k) + ∆)A>)−1A

= A>(

(AV (k)A>)−1 − (AV (k)A>)−1AS(∆−1S,S +A>S (AV (k)A>)−1AS)−1A>S (AV (k)A>)−1

)A

= A>(AV (k)A>)−1A−A>(AV (k)A>)−1AS(∆−1S,S +A>S (AV (k)A>)−1AS)−1A>S (AV (k)A>)−1A

= M (k) −M (k)S (∆−1

S,S +M(k)S,S)−1(M

(k)S )>,

where the second step follows by Woodbury matrix identity and the last step follows by the definitionof M (k) ∈ Rn×n.

Thus the update rule of matrix M (k+1) ∈ Rn×n can be written as

M (k+1) = M (k) −M (k)S (∆−1

S,S + (M (k))S,S)−1(M(k)S )>.

The updates in round k can be splitted into four parts:1. Adding two r × r matrices takes O(r2) time.2. Computing the inverse of an r × r matrix takes O(rω+o(1)) time.3. Computing the matrix multiplication of a n× r and r × n matrix takes O(rgr · n2+o(1)) time

where we used that r ≥ na (Lemma 2.3).4. Adding two n× n matrices together takes O(n2) time.

Hence, the total cost is

O(r2 + rω+o(1) + rgr · n2+o(1) + n2) = O(r2 + rω+o(1) + rgr · n2+o(1)) = O(rgr · n2+o(1)).

where we used rgr ≥ 1 for all r ≥ na in the first step.

Lemma 5.5 (Query time). The query time of data-structure MaintainProjection (Algorithm 3)is O(n · ‖h‖0 + n1+a+o(1)).

Proof. Let ∆ satisfy V = V + ∆. Let S ⊂ [n] denote the support of ∆ and then |S| ≤ na. Letr denote |S|. We abuse the notation here, ∆ denotes both n × n diagonal matrix and a length nvector.

UsingWoodbury matrix identity and definition ofM , the same proof as Update time (Lemma 5.4)shows

A>(AV A>)−1A = M +MS

(∆−1

S,S+M

S,S

)−1M>S,

where ∆S×S has size r × r, M

S,Shas size r × r and M

Shas size n× r.

28

Page 30: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

To compute√V A>(AV A>)−1A

√V h, we just need to compute√

V M√V h+

√V M

S(∆−1

S,S+M

S,S)−1M>

S

√V h.

Note the running time of computing the first term of the above equation only takes O(n · ‖h‖0)time.

Next, we analyze the cost of computing the second term of the above equation. It containsseveral parts:

1. Computing M>S· (√V · h) ∈ Rr takes r‖h‖0 time.

2. Computing (∆−1

S,S+M

S,S)−1 ∈ Rr×r that is the inverse of a r × r matrix takes rω+o(1) time.

3. Computing matrix-vector multiplication between r × r matrix ((∆−1

S,S+ M

S,S)−1) and r × 1

vector (M>S

√V h) takes O(r2) time.

4. Computing matrix-vector multiplication between n× r matrix (MS) and r×1 vector ((∆−1

S,S+

MS,S

)−1M>S

√V h) takes O(nr) time.

5. Computing the entry-wise product of two n vectors takes O(n) timeThus, overall the running time is

O(r‖h‖0 + rω+o(1) + r2 + nr + n) = O(rω+o(1) + nr) = O(na·ω+o(1) + n1+a).

Finally, we note that ω ≤ 3 − α ≤ 3 − a (Lemma A.4) and hence a · ω ≤ a(3 − a) ≤ 1 + a.Therefore, the runtime is n1+a+o(1).

5.4 Bounding w move

The goal of this section is to prove Lemma 5.6.

Lemma 5.6 (w move). We haven∑i=1

gi ·E[ψ(y

(k)π(i))− ψ(x

(k)i )]≤ O(C1 + C2/εmp) ·

√log n · (n−a/2 + nω−5/2).

Proof. Observe that since the errors |x(k)i | are sorted in descending order, and ψ(x) is symmetric

and non-decreasing function for x ≥ 0, thus ψ(x(k)i ) is also in decreasing order. In addition, note

that g is decreasing, we haven∑i=1

giψ(x(k)π(i)) ≤

n∑i=1

giψ(x(k)i ). (23)

Hence the first term in (22) can be upper bounded as follows:

E

[n∑i=1

gi ·(ψ(y

(k)π(i))− ψ(x

(k)i ))]≤ E

[n∑i=1

gi ·(ψ(y

(k)π(i))− ψ(x

(k)π(i))

)]by (23)

=n∑i=1

gi ·E[ψ(y(k)π(i))− ψ(x

(k)π(i))]

= O(C1 + C2/εmp) ·√

log n · (n−a/2 + nω−5/2). by Lemma 5.7

Thus, we complete the proof of w move Lemma.

29

Page 31: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

It remains to prove the following Lemma,

Lemma 5.7.n∑i=1

gi ·E[ψ(y(k)π(i))− ψ(x

(k)π(i))] = O(C1 + C2/εmp) ·

√log n · (n−a/2 + nω−5/2).

Proof. We separate the term into two:

n∑i=1

gi ·E[ψ(y(k)π(i))− ψ(x

(k)π(i))] =

n∑i=1

gπ−1(i) ·E[ψ(y(k)i )− ψ(E[y

(k)i ])] +

n∑i=1

gπ−1(i) · (ψ(E[y(k)i ])− ψ(x

(k)i )).

For the first term, Mean value theorem shows that

ψ(y(k)i )− ψ(E[y

(k)i ]) = ψ′(E[y

(k)i ])(y

(k)i −E[y

(k)i ]) +

1

2ψ′′(ζ)(y

(k)i −E[y

(k)i ])2

≤ ψ′(E[y(k)i ])(w

(k+1)i −E[w

(k+1)i ]) +

L2

2

(w

(k+1)i −E[w

(k+1)i ]

)2,

where L2 = maxx ψ′′(x). Let γi = Var[lnw

(k+1)i ]. Summing over i and taking conditional expecta-

tion given w(k) on both sides, we get

n∑i=1

gπ−1(i) E[ψ(y(k)i )− ψ(E[y

(k)i ])] ≤

n∑i=1

gπ−1(i)ψ′(E[y

(k)i ])E[w

(k+1)i −E[w

(k+1)i ]] +

L2

2

n∑i=1

gπ−1(i)γi

=L2

2·n∑i=1

gπ−1(i)γi

≤ L2

2· ‖g‖2 ·

(n∑i=1

γ2i

)1/2

≤ L2

2· C2 · ‖g‖2

For the second term, we define βi = E[lnw(k+1)i ]− lnw

(k)i . Lipschitz constant of ψ shows that

n∑i=1

gπ−1(i)(ψ(E[y(k)i ])− ψ(x

(k)i )) ≤ L1 ·

n∑i=1

gπ−1(i)|E[y(k)i ]− x(k)

i |

= L1 ·n∑i=1

gπ−1(i)|βi|

≤ L1 · C1 · ‖g‖2

where we used that∑n

i=1 β2i ≤ C2

1 .Now, combining both terms and using that L1 = O(1), L2 = O(1/εmp) (from part 4 of

Lemma 5.10) and ‖g‖2 ≤√

log n ·O(n−a/2 + nω−5/2) (from Lemma 5.8), we have that

n∑i=1

gi ·E[ψ(y(k)π(i))− ψ(x

(k)π(i))] ≤ O(C1 + C2/εmp) ·

√log n · (n−a/2 + nω−5/2).

30

Page 32: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Lemma 5.8. (n∑i=1

g2i

)1/2

≤√

log n ·O(n−a/2 + nω−5/2).

Proof. Since function g behaves differently when i ≤ na and i > na. We split the sum into twoparts.

For the first part, we have

na∑i=1

g2i =

na∑i=1

n−2a = n−a.

For the second part, we have

n∑i=na

g2i =

n∑i=na

i2(ω−2)1−a −2n−

2a(ω−2)1−a =

n∑i=na

1

i· i

2(ω−2)1−a −1n−

2a(ω−2)1−a .

Note that

maxi∈[na,n]

i2(ω−2)1−a −1n−

2a(ω−2)1−a = max(na

2(ω−2)1−a −an−

2a(ω−2)1−a , n

2(ω−2)1−a −1n−

2a(ω−2)1−a ) = max(n−a, n2ω−5).

Thus, the second part is

n∑i=na

g2i ≤

n∑i=na

1

i·max(n−a, n2ω−5) = O(log n) ·max(n−a, n2ω−5).

Combining the first part and the second part completes the proof.

5.5 Bounding v move

Lemma 5.9 (v move). We have,

n∑i=1

gi ·(ψ(y

(k)π(i))− ψ(x

(k+1)τ(i) )

)≥ Ω(εmprkgrk/ log n).

Proof. We first understand some simple facts which are useful in the later proof. Note that from thedefinition of x(k+1)

i , we know that x(k+1) has rk coordinates are 0 and hence ‖y(k) − x(k+1)‖0 = rk.The difference between those vectors is, for the largest rk coordinates in y(k), we erase them inx(k+1). Then for each i ∈ [n − rk], x

(k+1)τ(i) = y

(k)π(i+rk). For convenience, we define y(k)

π(n+i) = 0,∀i ∈ [rk].

We split the proof into two cases.

Case 1. We exit the while loop when 1.5rk ≥ n.Let u∗ denote the largest u s.t. |y(k)

π(u)| ≥ εmp/2. If u∗ = rk, we have that |y(k)π(rk)| ≥ εmp/2 ≥

εmp/100. Otherwise, the condition of the loop shows that

|y(k)π(rk)| ≥ (1− 1/ log n)log1.5 rk−log1.5 u

∗ |y(k)π(u∗)| ≥ (1− 1/ log n)log1.5 nεmp/2 ≥ εmp/100.

where we used that n ≥ 4.

31

Page 33: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

According to definition of x(k+1)τ(i) , we have

n∑i=1

gi(ψ(y(k)π(i))− ψ(x

(k+1)τ(i) )) =

n∑i=1

gi(ψ(y(k)π(i))− ψ(y

(k)π(i+rk))) ≥

n∑i=n/3+1

gi(ψ(y(k)π(i))− ψ(y

(k)π(i+rk)))

≥n∑

i=n/3+1

gi(ψ(y(k)π(i))) ≥

2n/3∑i=n/3+1

giψ(εmp/100) ≥ Ω(rkgrkεmp),

where the first step follows from x(k+1)τ(i) = y

(k)π(i+rk), the second step follows from the facts that ψ(|x|)

is non-decreasing (part 2 of Lemma 5.10) and |y(k)π(i)| is non-increasing, the third step follows from

1.5rk > n and hence ψ(y(k)π(i+rk)) = 0 for i ≥ n/3 + 1, the fourth step follows from the facts ψ is

non-decreasing and |y(k)π(i)| ≥ |y

(k)π(rk)| ≥ εmp/100 for all i < 2n/3, and the last step follows by the

fact g is decreasing and part 3 of Lemma 5.10.

Case 2. We exit the while loop when 1.5rk < n and |y(k)π(1.5rk)| < (1− 1/ log n)|y(k)

π(rk)|.By the same argument as Case 1, we have that |y(k)

π(rk)| ≥ εmp/100. Part 3 of Lemma 5.10together with the fact

|y(k)π(1.5r)| < min(εmp/2, |y(k)

π(r)| · (1− 1/ log n)),

shows that

ψ(|y(k)π(1.5r)|)− ψ(|y(k)

π(r)|) = Ω(εmp/ log n). (24)

Putting it all together, we have

n∑i=1

gi · (ψ(y(k)π(i))− ψ(x

(k+1)τ(i) ))

=

n∑i=1

gi · (ψ(y(k)π(i))− ψ(y

(k)π(i+rk))) by x(k+1)

τ(i) = y(k)π(i+rk)

≥rk∑

i=rk/2

gi · (ψ(y(k)π(i))− ψ(y

(k)π(i+rk))) by ψ(y

(k)π(i))− ψ(y

(k)π(i+rk)) ≥ 0

≥rk∑

i=rk/2

gi · (ψ(y(k)π(rk))− ψ(y

(k)π(1.5rk)))

≥rk∑

i=rk/2

gi · Ω(εmp

log n) by (24)

≥rk∑

i=rk/2

grk · Ω(εmp

log n) by gi is decreasing

= Ω (εmprkgrk/ log n) ,

where the third step follows by the facts |y(k)π(i)| is decreasing and ψ is non-decreasing (from part 2

of Lemma 5.10).

32

Page 34: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

ε/2 ε−ε/2−ε

ε/4ε/2

ψ(x)

ε/2 ε

−ε/2−ε1

−1

ψ(x)′

ε/2 ε−ε/2−ε

2/ε

−2/ε

ψ(x)′′

Figure 2: ψ(x), ψ(x)′ and ψ(x)′′. For εmp ∈ (0, 1).

5.6 Potential function ψ

Lemma 5.10 (Properties of function ψ). Let function ψ be defined in (21). Then function ψ sat-isfies the following properties:1. Symmetric (ψ(−x) = ψ(x)) and ψ(0) = 0;2. ψ(|x|) is non-decreasing;3. |ψ′(x)| = Ω(1), ∀|x| ∈ [0.01εmp, εmp];4. L1

def= maxx ψ

′(x) = 1 and L2def= maxx ψ

′′(x) = 1/εmp.

Proof. We can see that

ψ(x)′ =

2|x|εmp

, |x| ∈ [0, εmp/2]2(εmp−|x|)

εmp, |x| ∈ (εmp/2, εmp]

0, |x| ∈ (εmp,+∞)

and ψ(x)′′ =

2εmp

, x ∈ [0, εmp/2] ∪ [−εmp,−εmp/2]

− 2εmp

, x ∈ (εmp/2, εmp] ∪ [−εmp/2, 0]

0. |x| ∈ (εmp,+∞)

From the ψ(x)′ and ψ(x)′′, it is not hard to see that ψ satisfies the properties needed.

Acknowledgement

This work was supported in part by NSF Awards CCF-1740551, CCF-1749609, and DMS-1839116.We thank Sébastien Bubeck and Aaron Sidford for helpful discussions. We thank Rasmus Kyngfor bringing up the question and providing some fixes in the proof of projection maintenance.We thank Josh Alman for some useful discussions about matrix multiplication. We thank SwatiPadmanabhan for writing suggestions. We thank Eric Price for the suggestion of the title of thispaper. We thank Shunhua Jiang and Hengjie Zhang for drawing several beautiful pictures. Finally,we thank anonymous STOC and JACM reviewers for their detailed feedback.

References

[AFLG15] Andris Ambainis, Yuval Filmus, and François Le Gall. Fast matrix multiplication:limitations of the coppersmith-winograd method. In Proceedings of the forty-seventhannual ACM symposium on Theory of computing (STOC), pages 585–593. ACM, 2015.

[Alm19] Josh Alman. Limits on the universal method for matrix multiplication. In 34th Com-putational Complexity Conference (CCC), pages 12:1–12:24, 2019.

[AW18] Josh Alman and Virginia Vassilevska Williams. Limits on all known (and some un-known) approaches to matrix multiplication. In 2018 IEEE 59th Annual Symposium onFoundations of Computer Science (FOCS). IEEE, 2018.

33

Page 35: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

[BCLL18] Sébastien Bubeck, Michael B Cohen, Yin Tat Lee, and Yuanzhi Li. An homotopymethod for `p regression provably beyond self-concordance and in input-sparsity time.In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing(STOC), pages 1130–1137. ACM, 2018.

[BLSS20] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. Solving tall denselinear programs in nearly linear time. In 52nd Annual ACM Symposium on Theory ofComputing (STOC), pages 775–788, 2020.

[BPSW20] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (over-parametrized) neural networks in near-linear time. arXiv preprint arXiv:2006.11648,2020.

[Bra20] Jan van den Brand. A deterministic linear program solver in current matrix multiplica-tion time. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms (SODA), pages 259–278. SIAM, 2020.

[CGLZ20] Matthias Christandl, François Le Gall, Vladimir Lysikov, and Jeroen Zuiddam. Barriersfor rectangular matrix multiplication. In 35th Computational Complexity Conference(CCC). https://arXiv.org/pdf/2003.03019.pdf, 2020.

[CKK+18] Michael B Cohen, Jonathan Kelner, Rasmus Kyng, John Peebles, Richard Peng, Anup BRao, and Aaron Sidford. Solving directed laplacian systems in nearly-linear time throughsparse LU factorizations. In 2018 IEEE 59th Annual Symposium on Foundations ofComputer Science (FOCS), pages 898–909. IEEE, 2018.

[CKL18] Diptarka Chakraborty, Lior Kamma, and Kasper Green Larsen. Tight cell probe boundsfor succinct boolean matrix-vector multiplication. In Proceedings of the 50th AnnualACM SIGACT Symposium on Theory of Computing (STOC), pages 1297–1306, 2018.

[CKM+11] Paul Christiano, Jonathan A Kelner, Aleksander Madry, Daniel A Spielman, and Shang-Hua Teng. Electrical flows, laplacian systems, and faster approximation of maximumflow in undirected graphs. In Proceedings of the forty-third annual ACM symposium onTheory of computing (STOC), pages 273–282. ACM, 2011.

[CKM+14] Michael B Cohen, Rasmus Kyng, Gary L Miller, Jakub W Pachocki, Richard Peng,Anup B Rao, and Shen Chen Xu. Solving sdd linear systems in nearly m log1/2 ntime. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing(STOC), pages 343–352. ACM, 2014.

[CLM+16] Michael B Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. Ge-ometric median in nearly linear time. In Proceedings of the forty-eighth annual ACMsymposium on Theory of Computing (STOC), pages 9–21. ACM, 2016.

[CLP+20] Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, TriDao, Zhao Song, Anshumali Shrivastava, and Christopher Re. Mongoose: A learn-able lsh framework for efficient neural network training. In OpenReview.net. https://openreview.net/forum?id=wWK7yXkULyh, 2020.

[CMSV17] Michael B Cohen, Aleksander Mądry, Piotr Sankowski, and Adrian Vladu. Negative-weight shortest paths and unit capacity minimum cost flow in O(m10/7 logW ) time.In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algo-rithms (SODA), pages 752–771. SIAM, 2017.

34

Page 36: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

[CMTV17] Michael B Cohen, Aleksander Madry, Dimitris Tsipras, and Adrian Vladu. Matrixscaling and balancing via box constrained newton’s method and interior point methods.In 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 902–913. IEEE, 2017.

[CW87] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progres-sions. In Proceedings of the nineteenth annual ACM symposium on Theory of computing(STOC), pages 1–6. ACM, 1987.

[CW13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regressionin input sparsity time. In Symposium on Theory of Computing Conference (STOC),pages 81–90. https://arxiv.org/pdf/1207.6365, 2013.

[DS13] Alexander Munro Davie and Andrew James Stothers. Improved bound for complexityof matrix multiplication. Proceedings of the Royal Society of Edinburgh Section A:Mathematics, 143(2):351–369, 2013.

[HKNS15] Monika Henzinger, Sebastian Krinninger, Danupon Nanongkai, and Thatchaphol Sara-nurak. Unifying and strengthening hardness for dynamic problems via the online matrix-vector multiplication conjecture. In Proceedings of the forty-seventh annual ACM sym-posium on Theory of computing (STOC), pages 21–30, 2015.

[JKL+20] Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song.A faster interior point method for semidefinite programming. In 61st Annual IEEESymposium on Foundations of Computer Science (FOCS). https://arxiv.org/pdf/2009.10217.pdf, 2020.

[JLSW20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cuttingplane method for convex optimization, convex-concave games and its applications. In52nd Annual ACM Symposium on Theory of Computing (STOC), pages 944–953. https://arxiv.org/pdf/2004.04250.pdf, 2020.

[JSWZ20] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrixinverse for faster lps. In arXiv preprint. https://arxiv.org/2004.07470, 2020.

[Kar84] Narendra Karmarkar. A new polynomial-time algorithm for linear programming. InProceedings of the sixteenth annual ACM symposium on Theory of computing (STOC),pages 302–311. ACM, 1984.

[KLP+16] Rasmus Kyng, Yin Tat Lee, Richard Peng, Sushant Sachdeva, and Daniel A Spielman.Sparsified cholesky and multigrid solvers for connection laplacians. In Proceedings of theforty-eighth annual ACM symposium on Theory of Computing (STOC), pages 842–850.ACM, 2016.

[KMP10] Ioannis Koutis, Gary L Miller, and Richard Peng. Approaching optimality for solvingsdd linear systems. In 2010 IEEE 51st Annual Symposium on Foundations of ComputerScience (FOCS), pages 235–244. IEEE, 2010.

[KMP11] Ioannis Koutis, Gary L Miller, and Richard Peng. A nearly-m log n time solver for sddlinear systems. In 2011 IEEE 52nd Annual Symposium on Foundations of ComputerScience (FOCS), pages 590–598. IEEE, 2011.

35

Page 37: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

[KOSZ13] Jonathan A Kelner, Lorenzo Orecchia, Aaron Sidford, and Zeyuan Allen Zhu. A simple,combinatorial algorithm for solving sdd systems in nearly-linear time. In Proceedings ofthe forty-fifth annual ACM symposium on Theory of computing (STOC), pages 911–920.ACM, https://arxiv.org/pdf/1301.6628.pdf, 2013.

[KPSZ18] Rasmus Kyng, Richard Peng, Robert Schwieterman, and Peng Zhang. Incomplete nesteddissection. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory ofComputing (STOC), pages 404–417. ACM, 2018.

[KS16] Rasmus Kyng and Sushant Sachdeva. Approximate gaussian elimination for laplacians-fast, sparse, and simple. In 2016 IEEE 57th Annual Symposium on Foundations ofComputer Science (FOCS), pages 573–582. IEEE, 2016.

[LG14] François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedingsof the 39th international symposium on symbolic and algebraic computation (ISSAC),pages 296–303. ACM, https://arxiv.org/pdf/1401.7714.pdf, 2014.

[LGU18] Francois Le Gall and Florent Urrutia. Improved rectangular matrix multiplication usingpowers of the coppersmith-winograd tensor. In Proceedings of the Twenty-Ninth AnnualACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1029–1046. SIAM, 2018.

[LS13] Yin Tat Lee and Aaron Sidford. Path finding I: Solving linear programs with O(√

rank)linear system solves. arXiv preprint arXiv:1312.6677, 2013.

[LS14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solvinglinear programs in O(

√rank) iterations and faster algorithms for maximum flow. In 2014

IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS), pages424–433. IEEE, 2014.

[LS15] Yin Tat Lee and Aaron Sidford. Efficient inverse maintenance and faster algorithms forlinear programming. In 56th Annual Symposium on Foundations of Computer Science(FOCS), pages 230–249. IEEE, 2015.

[LSW15] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane method andits implications for combinatorial and convex optimization. In 56th Annual Symposiumon Foundations of Computer Science (FOCS), pages 1049–1065. IEEE, 2015.

[LSZ19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in thecurrent matrix multiplication time. In Conference on Learning Theory (COLT), pages2140–2157. https://arxiv.org/pdf/1905.04447.pdf, 2019.

[LSZ20] S Cliff Liu, Zhao Song, and Hengjie Zhang. Breaking the n-pass barrier: A streamingalgorithm for maximum weight bipartite matching. arXiv preprint arXiv:2009.06106,2020.

[LW17] Kasper Green Larsen and Ryan Williams. Faster online matrix-vector multiplication.In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algo-rithms (SODA), pages 2182–2189. SIAM, 2017.

[Mad13] Aleksander Madry. Navigating central path with electrical flows: From flows to match-ings, and back. In 2013 IEEE 54th Annual Symposium on Foundations of ComputerScience (FOCS), pages 253–262. IEEE, 2013.

36

Page 38: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

[Mad16] Aleksander Madry. Computing maximum flow with augmenting electrical flows. In 2016IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages593–602. IEEE, 2016.

[Meg12] Nimrod Megiddo. Progress in Mathematical Programming: Interior-Point and RelatedMethods. Springer Science & Business Media, 2012.

[NN89] Yu Nesterov and Arkadi Nemirovsky. Self-concordant functions and polynomial-timemethods in convex programming. Report, Central Economic and Mathematic Institute,USSR Acad. Sci, 1989.

[NN91] Yu Nesterov and Arkadi Nemirovsky. Acceleration and parallelization of the path-following interior point method for a linearly constrained convex quadratic problem.SIAM Journal on Optimization, 1(4):548–564, 1991.

[NN94] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convexprogramming, volume 13. Siam, 1994.

[NN13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms viasparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations ofComputer Science (FOCS), pages 117–126. IEEE, https://arxiv.org/pdf/1211.1002,2013.

[Pan84] Victor Pan. How to multiply matrices faster. Lecture notes in computer science, 179,1984.

[PRT02] Jiming Peng, Cornelis Roos, and Tamás Terlaky. Self-regular functions and newsearch directions for linear and semidefinite optimization. Mathematical Programming,93(1):129–171, 2002.

[Ren88] James Renegar. A polynomial-time algorithm, based on newton’s method, for linearprogramming. Mathematical Programming, 40(1-3):59–93, 1988.

[Ren01] James Renegar. A mathematical view of interior-point methods in convex optimization,volume 3. Siam, 2001.

[RTV05] Cornelis Roos, Tamás Terlaky, and J-Ph Vial. Interior point methods for linear opti-mization. Springer Science & Business Media, 2005.

[Sar06] Tamás Sarlós. Improved approximation algorithms for large matrices via random projec-tions. In 47th Annual Symposium on Foundations of Computer Science (FOCS), pages143–152. IEEE, 2006.

[SS11] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances.SIAM Journal on Computing, 40(6):1913–1926, 2011.

[ST04] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graphpartitioning, graph sparsification, and solving linear systems. In Proceedings of theThirty-sixth Annual ACM Symposium on Theory of Computing (STOC), pages 81–90.ACM, 2004.

[SY20] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for solvinglinear programming problems. In OpenReview.net. https://openreview.net/forum?id=fGiKxvF-eub, 2020.

37

Page 39: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

[Ter13] Tamás Terlaky. Interior point methods of mathematical programming, volume 5. SpringerScience & Business Media, 2013.

[Vai87] Pravin M Vaidya. An algorithm for linear programming which requires O(((m+n)n2 +(m+ n)1.5n)L) arithmetic operations. In In Proceedings of the nineteenth annual ACMsymposium on Theory of computing (STOC), pages 29–38. ACM, 1987.

[Vai89a] Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets. In30th Annual Symposium on Foundations of Computer Science (FOCS), pages 338–343.IEEE, 1989.

[Vai89b] Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In30th Annual Symposium on Foundations of Computer Science (FOCS), pages 332–337.IEEE, 1989.

[Wil12] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd.In Proceedings of the forty-fourth annual ACM symposium on Theory of computing(STOC), pages 887–898. ACM, 2012.

[Woo50] Max A Woodbury. Inverting modified matrices. Memorandum report, 42(106):336, 1950.

[Wri97] Stephen J Wright. Primal-dual interior-point methods, volume 54. SIAM, 1997.

[Ye97] Yinyu Ye. Interior point algorithms: theory and analysis. Springer, 1997.

[YTM94] Yinyu Ye, Michael J Todd, and Shinji Mizuno. An O(√nL)-iteration homogeneous and

self-dual linear programming algorithm. Mathematics of Operations Research, 19(1):53–67, 1994.

A Appendix

Lemma A.1. Let x and y denote (possibly dependent) random variables such that |x| ≤ cx and|y| ≤ cy almost surely. Then, we have

Var[xy] ≤ 2c2x ·Var[y] + 2c2

y ·Var[x].

Proof. Recall that Var[xy] ≤ E[(xy − t)2] for any scalar t. Hence,

Var[xy] ≤ E[(xy −E[x]E[y])2] = E[(xy − xE[y] + xE[y]−E[x]E[y])2]

≤ 2E[(xy − xE[y])2] + 2E[(xE[y]−E[x]E[y])2]

≤ 2c2x ·Var[y] + 2c2

y ·Var[x].

Lemma A.2 ([Vai89b]). Given a matrix A ∈ Rd×n, vectors b ∈ Rd, c ∈ Rn. Suppose x, s, y ∈ Rnsatisfy that xs ≈0.1 t, Ax = b and A>y + s = c for some t > 0. For any ε ∈ (0, 1/2], inO(n2.5 log(n/ε)) time, we can find vectors xnew, snew ∈ Rn and ynew ∈ Rd such that

‖xnewsnew − t‖2 ≤ ε,

Axnew = b,

A>ynew + s = c.

38

Page 40: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Remark A.3. Instead of using the algorithm in [Vai89b], one can also run our algorithm withk = n for O(

√n log n) iterations. Since k = n, there is no randomness involved and hence Φ will

decrease deterministically to O(n).

Lemma A.4. ω ≤ 3− α.Proof. We consider a n × n matrix A multiply another n × n matrix B. We split A into n1−α fatmatrices where each of them has size nα×n. Since ω is the best exponent of matrix multiplication,thus we know

nω+o(1) ≤ n1−α · n2+o(1).

Taking n→∞, this implies ω ≤ 3− α.

Note that the bound in Lemma A.4 can be improved to ω+ 12ωα ≤ 3 [CGLZ20] via tensor rank.

Lemma A.5 (Rectangular matrix multiplication). For any n ≥ r, multiplying an n × r with anr × n matrix or n× n with n× r takes time

n2+o(1) + rω−21−αn2−α(ω−2)

1−α +o(1).

Proof. The cost for multiplying a n× n and a n× r matrix is the same as multiplying a n× r anda r × n matrix [Pan84, page 51]. So, we focus on the later case.

For the case r ≤ nα, it follows from the rectangular matrix multiplication result in [LGU18].For the case r ≥ nα, we let k = (n/r)

11−α . We can view the problem as multiplying a k × kα

and a kα × k block matrices and each block has size nk × n

k size. Therefore, the total cost is

k2+o(1) × (n

k)ω+o(1) = r

ω−21−αn2−α(ω−2)

1−α +o(1).

Lemma A.6. Let A ∈ Rd×n, b ∈ Rd and c ∈ Rn. For a matrix A, we define ‖A‖1 to be∑

i,j |Ai,j |.Given a linear program minAx=b,x≥0 c

>x with n variables and d constraints. Assume that1. Diameter of the polytope : For any x ≥ 0 with Ax = b, we have that ‖x‖∞ ≤ R.2. Lipschitz constant of the linear program : ‖c‖∞ ≤ L.

For any δ ∈ (0, 1], the modified linear program minAx=b,x≥0 c>x with

A =

[A 0 1

Rb−A1n1>n 1 0

]∈ R(d+1)×(n+2), b =

[1Rb

n+ 1

]∈ Rd+1 and, c =

δL · c

01

∈ Rn+2

satisfies the following :

1. x =

1n11

∈ Rn+2, y =

[0d−1

]∈ Rd+1 and s =

1n + δL · c

11

∈ Rn+2 are feasible primal dual

vectors.2. For any feasible primal dual vectors (x, y, s) ∈ R(n+2)×(d+1)×(n+2) with duality gap ≤ δ2, thevector x = R · x1:n ∈ Rn (x1:n is the first n coordinates of x) is an approximate solution to theoriginal linear program in the following sense

c>x ≤ minAx=b,x≥0

c>x+ LR · δ,

‖Ax− b‖1 ≤ 4nδ · (R‖A‖1 + ‖b‖1),

x ≥ 0.

39

Page 41: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

Proof. Part 1. For the first result, straightforward calculations show that (x, y, s) ∈ R(n+2)×(d+1)×(n+2)

are feasible, i.e.,

Ax =

[A 0 1

Rb−A1n1>n 1 0

1n11

=

[1Rb

n+ 1

]= b

and

A>y + s =

A> 1n0 1

1Rb> − 1>nA

> 0

· [ 0d−1

]+

1n + δL · c

11

=

−1n−10

+

1n + δL · c

11

=

δL · c

01

= c

Part 2. For the second result, we let

OPT = minAx=b,x≥0

c>x, and, OPT = minAx=b,x≥0

c>x.

For any optimal x ∈ Rn in the original LP, we consider the following x ∈ Rn+2

x =

1Rx

n+ 1− 1R

∑ni=1 xi

0

(25)

and c ∈ Rn+2

c =

δL · c>

01

(26)

We want to argue that x ∈ Rn+2 is feasible in the modified LP. It is obvious that x ≥ 0, it remainsto show Ax = b ∈ Rd+1. We have

Ax =

[A 0 1

Rb−A1n1>n 1 0

1Rx

n+ 1− 1R

∑ni=1 xi

0

=

[1RAxn+ 1

]=

[1Rb

n+ 1

]= b,

where the third step follows from Ax = b, and the last step follows from the definition of b.Therefore, using the definition of x in (25) we have that

OPT ≤ c>x =[δL · c> 0 1

1Rx

n+ 1− 1R

∑ni=1 xi

0

LR· c>x =

δ

LR·OPT . (27)

40

Page 42: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

where the first step follows from the fact that modified program is a minimization problem, thesecond step follows from the definitions of x ∈ Rn+2 (25) and c ∈ Rn+2 (26), the last step followsfrom the fact that x ∈ Rn is an optimal solution in the original linear program.

Given a feasible (x, y, s) ∈ R(n+2)×(d+1)×(n+2) with duality gap δ2. Write x =

x1:n

τθ

∈ Rn+2

for some τ ≥ 0, θ ≥ 0. We can compute c>x which is δL · c>x1:n + θ. Then, we have

δ

L· c>x1:n + θ ≤ OPT + δ2 ≤ δ

LR·OPT +δ2, (28)

where the first step follows from definition of duality gap, the last step follows from (27).Hence, we can upper bound the OPT of the transformed program as follows:

c>x = R · c>x1:n =LR

δ· δLc>x1:n ≤

RL

δ(δ

LR·OPT +δ2) = OPT +LR · δ,

where the first step follows by x = R · x1:n, the third step follows by (28).Note that

δ

Lc>x1:n ≥ −

δ

L‖c‖∞‖x1:n‖1 = − δ

L‖c‖∞‖

1

Rx‖1 ≥ −

δ

L‖c‖∞

n

R‖x‖∞ ≥ −δn, (29)

where the second step follows from the definition of x ∈ Rn+2, and the last step follows from‖c‖∞ ≤ L and ‖x‖∞ ≤ R.

We can upper bound the θ in the following sense,

θ ≤ δ

LR·OPT +δ2 + δn ≤ 2nδ + δ2 ≤ 4nδ (30)

where the first step follows from (28) and (29), the second step follows by OPT = minAx=b,x≥0 c>x ≤

nLR (because ‖c‖∞ ≤ L and ‖x‖∞ ≤ R), and the last step follows from δ ≤ 1 ≤ n.The constraint in the new polytope shows that

Ax1:n + (1

Rb−A1n)θ =

1

Rb.

Using x = Rx1:n ∈ Rn, we have

A1

Rx+ (

1

Rb−A1n)θ =

1

Rb.

Rewriting it, we have Ax− b = (RA1n − b)θ ∈ Rd and hence

‖Ax− b‖1 = ‖(RA1n − b)θ‖1 ≤ θ(‖RA1n‖1 + ‖b‖1) ≤ θ · (R‖A‖1 + ‖b‖1) ≤ 4nδ · (R‖A‖1 + ‖b‖1),

where the second step follows from the triangle inequality, the third step follows from ‖A1n‖1 ≤ ‖A‖1(because the definition of entry-wise `1 norm), and the last step follows from (30).

Thus, we complete the proof.

41

Page 43: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

B Generalized Projection Maintenance

Given the usefulness of projection maintenance, we state a more general version of Theorem 5.1 forfuture use.

Theorem B.1. Assume the following about the cost of matrix operations:• In O(tk) time, we can multiply a n× n and a n× k matrix, and we can multiply a n× k anda k × n matrix.

• In O(sn) time, we can invert a n×n matrix, and we can multiply a n×n and a n×n matrix.• tk/k is decreasing k.

Given a matrix A ∈ Rd×n with n ≥ d, a tolerance parameter 0 < εmp < 1/4 and k∗ ∈ [n], there is adeterministic data structure that approximately maintains the projection matrices

√WA>(AWA>)−1A

√W

and the inverse matrices (AWA>)−1 ∈ Rd×d for positive diagonal matrices W ∈ Rn×n through thefollowing operations:

• Update(w): Output a vector v such that for all i,

(1− εmp)vi ≤ wi ≤ (1 + εmp)vi.

• Query1(h): Output√V A>(AV A>)−1A

√V h ∈ Rn for the v outputted by the last call to

Update.• Query2(h): Output (AV A>)−1h ∈ Rd for the v outputted by the last call to Update.• Insert(a,wa): Insert a column a into A, a weight wa into w.• Delete(a): Delete a column a from A and its corresponding weight from w.• Output(): Output

√V A>(AV A>)−1A

√V ∈ Rn×n and (AV A>)−1 ∈ Rd×d for the v out-

putted by the last call to Update.Suppose that the number of columns is O(n) during the whole algorithm and that for any call ofUpdate, we have

n∑i=1

(E[lnwi]− ln(wold

i ))2≤ C2

1 andn∑i=1

(Var[ln(wi)])2 ≤ C2

2

where w is the input of call, wold is the weight before the call, and the expectation and variance isconditional on wold

i . Then, we have that:• The data structure takes O(sn + tn) time to initialize.• Each call of Query takes time O(n · ‖h‖0 + sk∗ + nk∗).• Each call of Output() takes O(tk∗) time.• Each call of Insert and Delete takes O(n2) time.• Each call of Update takes

O

(C1/εmp + C2/ε2mp) ·

(t2k∗

k∗+

n∑i=k∗

t2ii2

)1/2

· log n

expected time in amortized.

Proof. The proof is essentially the same as Theorem 5.1. The way to maintain AV A> ∈ Rd×d

is almost identical to√V A>(AV A>)−1A

√V h ∈ Rn. Updating both matrices under insertion

42

Page 44: arXiv:1810.07896v3 [cs.DS] 20 Oct 2020

and deletion can be done via Sherman Morrison formula in O(n2) time. We note that these up-dates does not increase our potential and hence it does not affect the amortized cost for Update.Finally, to bound the runtime using tk and sn instead of α and ω, we use the same potentialΨk =

∑ni=1 giψ(x

(k)i ) with a new definition of g:

gi =

ti/i, if i ≥ k∗;tk∗/k

∗, otherwise..

Now, the update time in Lemma 5.4 becomes O(rgr). The rest of the proof is identical. In particular,Lemma 5.8 becomes

‖g‖2 =

(tk∗

k∗+

n∑i=k∗

t2ii2

)1/2

and hence Lemma 5.7 gives the bound

O(C1 + C2/εmp) ·(tk∗

k∗+

n∑i=k∗

t2ii2

)1/2

.

Hard Threshold

k i

yπ(i)

O k 1.5k

Soft Threshold

yπ(1.5k) < (1− 1/ log(n))yπ(k)

i

yπ(i)

O

Figure 3: In this figure we illustrate the naive hard threshold and the soft threshold. The x-axisrepresents the sorted n coordinates, and the y-axis represents the errors yπ(i). All the coordinatessmaller than the threshold k are updated. In the left figure, we choose the hard threshold k as thesmallest coordinates such that yπ(k) ≤ εmp. In the right figure, we choose the soft threshold k asthe smallest coordinates that satisfies both yπ(k) ≤ εmp and yπ(1.5k) < (1− 1/ log(n))yπ(k).

43