+1 inria - sierra project-teamr´emi leblond 2 inria ...rleblond/asagafinal.pdf · what is ? how is...

Whatis?Howisitdefined?

Classicalanswerfollowing[2]:isthenumberofsuccessfulwritesto.Thisisthe“a8erwrite”approach.

Problem:stochas>csamplesarenolongerindependent,sotheupdatecanbebiased!

Example:2cores,

Diagnos>c:dependencyinjec>onthroughthelabelingassignment.

Asaga: Asynchronous Parallel Saga

We consider algorithms that execute in parallel the following four steps, where t is aglobal labeling that needs to be defined:

1. Read the information in shared memory (xt

).

2. Sample it

.

3. Perform some computations using (xt

, it

).

4. Write an update to shared memory.(5)

The “After Write” Approach. We call the “after write” approach the standard globallabeling scheme used in ? and re-used in all the later papers that we mentioned in the relatedwork section, with the notable exceptions of ? and ?. In this approach, t is a (virtual) globalcounter recording the number of successful writes to the shared memory x (incremented afterstep 4 in (5)); x

t

thus represents the (true) content of the shared memory after t updates.The interpretation of the crucial equation (3) then means that x

t

represents the (delayed)local copy value of the core that made the (t+1)th successful update; i

t

represents the factorsampled by this core for this update. Notice that in this framework, the value of x

t

andit

is unknown at “time t”; we have to wait to the later time when the next core writes tomemory to finally determine that its local variables are the ones labeled by t. We thus seethat here x

t

and it

are not necessarily independent – they share dependence through theassignment of the t label. In particular, if some values of i

t

yield faster updates than others,it will influence the label assignment defining x

t

.Let us now provide a concrete example of this possible dependency. Suppose that we

have two cores and that f has two factors: f1

which has support on only one variable, andf2

which has support on 106 variables and thus yields a gradient step that is significantlymore expensive to compute. x

0

is the initial content of the memory, and we do not o�ciallyknow yet whether x

0

is the local copy read by the first core or the second core, but weare sure that x

0

= x0

as no update can occur in shared memory without incrementing thecounter. There are four possibilities for the next step defining x

1

depending on which indexi was sampled on each core. If any core samples i = 1, we know that x

1

= x0

� �f 01

(x0

) as itwill be the first (much faster update) to complete. This happens in 3 out of 4 possibilities;we thus have that Ex

1

= x0

� �(34

f 01

(x0

) + 1

4

f 02

(x0

)). We see that this analysis scheme doesnot satisfy the crucial unbiasedness condition (4).

To understand this subtle point better, note that in this very simple example, i0

andi1

are not independent. We can show that P (i1

= 2 | i0

= 2) = 1. They share dependencythrough the labeling assignment.

The only way we can think to resolve this issue and ensure unbiasedness is to assume thatthe computation time for the algorithm running on a core is independent of the sample i cho-sen. This assumption seems overly strong in the context of potentially heterogeneous factorsfi

’s, and is thus a fundamental flaw for analyzing non-uniform asynchronous computationthat has mostly been ignored in the recent asynchronous optimization literature.5

5. We note that ? briefly discussed this issue (see Section 7.8.3), stressing that their analysis for SGDrequired that the scheduling of computation was independent from the randomness from SGD, but theydid not o↵er any solution if this assumption was not satisfied. Both the “before read” labeling from ?and our proposed “after read” labeling resolve this issue.

5

=)


Figure 4: Theoretical speedups. Suboptimality with respect to number of iterations forAsaga, Kromagnon and Hogwild with 1 and 10 cores. Curves almost coincide, whichmeans the theoretical speedup is almost the number of cores p, hence linear.

other factors that can influence this quantity. We will now attempt to give a few qualitativearguments as to what these other factors might be and how they relate to ⌧ .

Number of cores. The first of these factors is indeed the number of cores. If we have pcores, ⌧ � p � 1. Indeed, in the best-case scenario where all cores have exactly the sameexecution speed for a single iteration, ⌧ = p� 1.

Length of an iteration. To get more insight into what ⌧ really encompasses, let us nowtry to define the worst-case scenario in the preceding example. Consider 2 cores. In the worstcase, one core runs while the other is stuck. Then the overlap is t for all t and eventuallygrows to +1. If we assume that one core runs twice as fast as the other, then ⌧ = 2. Ifboth run at the same speed, ⌧ = 1.

It appears then that a relevant quantity is R, the ratio between the fastest executiontime and the slowest execution time for a single iteration. We have ⌧ (p� 1)R, which canbe arbitrarily bigger than p.

There are several factors at play in R itself. These include:

• the speed of execution of the cores themselves (i.e. clock time).

• the data matrix itself. Di↵erent support sizes for fi

means di↵erent gradient computa-tion times. If one f

i

has support of size n while all the others have support of size 1for example, R may eventually become very big.

• the length of the computation itself. The longer our algorithm runs, the more likely itis to explore the potential corner cases of the data matrix.

The overlap is then upper bounded by the number of cores multiplied by the ratio of themaximum iteration time over the minimum iteration time (which is linked to the sparsitydistribution of the data matrix). This is an upper bound, which means that in some casesit will not really be useful. For example, if one factor has support size 1 and all othershave support size d, the probability of the event which corresponds to the upper bound is

29

ASAGA:AsynchronousParallelSAGA

RémiLeblond,FabianPedregosa,SimonLacoste-JulienINRIA/ENS,Paris,France

SparseSAGA

References

AsynchronoussparseSAGA

Perturbediterateframework[3]

“AKerread”labeling

Experimentalresults

VariancereducPon[1]

SUMMARY

Theory

Projectwebpage

Setup Conclusions

[1]Defazioetal.Afastincrementalgradientmethodwithsupportfornon-stronglyconvexcompositeobjec>ves,NIPS2014[2]Niuetal.Hogwild:alock-freeapproachtoparallelizingstochas>cgradientdescent,NIPS2011[3]Maniaetal.Perturbediterateanalysisforasynchronousstochas>cop>miza>on,arXiv:1507.06970,2015.

MoPvaPngissue Analysis

SoluPon

VariancereducPonop>miza>onmethodsneedtobeadaptedtotheparallelse^ngtoleveragemoderncomputerarchitectures.SAGA[1]isanaturalcandidateasitdoesnothaveanysynchronizaPonsteps.Weprovelinearspeedupsevenwithoutsparsity,usinganovel,simpleframeworkofanalysis.

ContribuPons:i.   Asparsevariantofthelinearly-convergentSAGAalgorithmii.   ASAGA,itsasynchronousparalleladaptaPoniii.   Abe[erframeworkforasynchronousanalysisenablingcorrectand

simplerprooftechniques:the“aKerread”labelingiv.  AninvesPgaPonofclassicalassump>ons,includingtheoverlap

quan>ty

Problemse\ng

Improvement:SAGA

StochasPcgradientdescent

MoPvaPon ComparisontolaggedupdatesSimplerwayofleveragingsparsitythantheprevioustrickusinglaggedupdates[1].

Mucheasiertoparallelize(nocounterstosynchronize).

Alsolinearlyconvergent.

Insight:an1cipatedvs.laggedupdates.

Algorithm

DefiniPons InterpretaPon

Minimiza>onofafinitesumoffactors(ERM):

Eachis-stronglyconvexand-smooth.

Journal of Machine Learning Research 1 (2000) 1-48 Submitted 4/00; Published 10/00


Remi Leblond [email protected] - Sierra Project-TeamEcole Normale SuperieureParis, France

Fabian Pedregosa [email protected] - Sierra Project-TeamChaire Havas-Dauphine Economie des Nouvelles DonneesParis, France

Simon Lacoste-Julien [email protected]

INRIA - Sierra Project-Team

Ecole Normale Superieure

Paris, France

Editor:

AbstractInclude overlap experiments?

We describe Asaga, an asynchronous parallel version of the incremental gradientalgorithm Saga that enjoys fast linear convergence rates. Through a novel perspective,we revisit and clarify a subtle but important technical issue present in a large fractionof the recent convergence rate proofs for asynchronous parallel optimization algorithms,and propose a simplification of the recently introduced “perturbed iterate” frameworkthat resolves it. We thereby prove that Asaga can obtain a theoretical linear speedup onmulti-core systems even without sparsity assumptions. We show that our new frameworkcan be applied to other asynchronous parallel optimization algorithms such as Kromagnonand Hogwild, both removing problematic assumptions and obtaining better theoreticalresults. We present results of an implementation on a 40-core architecture illustrating thepractical speedup as well as the hardware overhead.Keywords: Optimization, Machine Learning, Large Scale, Asynchronous Parallel, some-thing else?

1. Introduction

Could be fleshed out?We consider the unconstrained optimization problem of minimizing a finite sum of

smooth convex functions:

minx2Rd

f(x), f(x) :=1

n

nX

i=1

fi

(x), (1)

where each fi

is assumed to be convex with L-Lipschitz continuous gradient, f is µ-stronglyconvex and n is large (for example, the number of data points in a regularized empiricalrisk minimization setting). We define a condition number for this problem as := L/µ. A

c�2000 Remi Leblond, Fabian Pedregosa and Simon Lacoste-Julien.








Paris, France

Editor:



1. Introduction



minx2Rd

f(x), f(x) :=1

n

nX

i=1

fi

(x), (1)

where each fi










Paris, France

Editor:



1. Introduction



minx2Rd

f(x), f(x) :=1

n

nX

i=1

fi

(x), (1)

where each fi



Idea:usecheapgradientes>mates:

Problem:introducesnon-vanishingvariance,requiringdiminishingstepsizesforconvergencesublinearrateofconvergence.

x

+ = x� �f

0i(x), i ⇠ U(1...n)

Stochastic vs. deterministic methods

• Goal = best of both worlds: Linear rate with O(1) iteration costRobustness to step size

hybridlog(

exce

ss c

ost)

stochastic

deterministic

time

IfASAGAcangetthesamerateasSAGAevenwithoutsparsity()forInthisregimeSAGAenjoysawiderangeofstepsizeswithsimilarratefactors,sowecanuseasmallerstepsizeforASAGAatnocost.

Ifsparsityisrequiredforalinearspeedup,withaboundonofinthebestcase.

n > � = 1

⌧ < O(n/).

n < ⌧ < O(

pn)

Problem:analysisforparallelalgorithmsishard.

Solu>on:castthemassequen>alalgorithmsworkingonperturbedinputs.Dis>nguish:

:theinconsistentquan>tyreadbythecores:thevirtualiteratedefinedby:

Interpretasanoisyversionofduetoasynchrony.

xt

xt

xt

xt

xt+1 := xt � �g(xt, it) .

Basicrecursiveinequality:

whereand

Insight:getahandleontoanalyzetheconvergencefor.

Leblond, Pedregosa and Lacoste-Julien

• In contrast to the Svrg analysis from ?, Thm. 14, we obtain a better dependence onthe condition number in our rate (1/ vs. 1/2 for them) and on the sparsity (theyget ⌧ O(��1

/3)), while we remove their gradient bound assumption. We also giveour convergence guarantee on x

t

during the algorithm, whereas they only bound theerror for the “last” iterate x

T

.

3.4 Proof of Theorem ??

intro repeated from previous section. Not sure it’s necessary.We give here a detailed outline of the proof. Its full version can be found in Appendix ??.

Initial recursive inequality. Let gt

:= g(xt

, ↵t, it

). By expanding the update equa-tion (??) defining the virtual iterate x

t+1

and introducing xt

in the inner product term weget:

kxt+1

� x⇤k2 = kxt

� �gt

� x⇤k2= kx

t

� x⇤k2 + �2kgt

k2 � 2�hxt

� x⇤, gt

i= kx

t

� x⇤k2 + �2kgt

k2 � 2�hxt

� x⇤, gt

i+ 2�hxt

� xt

, gt

i . (13)

Note that we introduce xt

in the inner product because gt

is a function of xt

, not xt

.In the sequential setting, we require i

t

to be independent of xt

to get unbiasedness.In the perturbed iterate framework, we instead require that i

t

is independent of xt

(seeProperty ??). This crucial property enables us to use the unbiasedness condition (4) towrite: Ehx

t

� x⇤, gt

i = Ehxt

� x⇤, f 0(xt

)i. We thus take the expectation of (??) that allowsus to use the µ-strong convexity of f :8

hxt

� x⇤, f 0(xt

)i � f(xt

)� f(x⇤) +µ

2kx

t

� x⇤k2. (14)

With further manipulations on the expectation of (??), including the use of the standardinequality ka + bk2 2kak2 + 2kbk2 (see Appendix ??), we obtain our basic recursivecontraction inequality:

at+1

(1� �µ

2)a

t

+ �2Ekgt

k2 � 2�et

(15)

+�µEkxt

� xt

k2 + 2�Ehxt

� xt

, gt

i| {z }additional asynchrony terms

, (16)

where at

:= Ekxt

� x⇤k2 and et

:= Ef(xt

)� f(x⇤).Inequality (??) is a midway point between the one derived in the proof of Lemma 1 in ?

and Equation (2.5) in ?, because we use the tighter strong convexity bound (??) than in thelatter (giving us the important extra term �2�e

t

).In the sequential setting, one crucially uses the negative suboptimality term �2�e

t

tocancel the variance term �2Ekg

t

k2 (thus deriving a condition on �). In our setting, we needto bound the additional asynchrony terms using the same negative suboptimality in order to

8. Note that here is our departure point with ? who replaced the f(xt)� f(x⇤) term with the lower boundµ2 kxt�x⇤k2 in this relationship (see their Equation (2.4)), thus yielding an inequality too loose afterwardsto get the fast rates for Svrg.

12




t


T

.




:= g(xt

, ↵t, it


t+1

and introducing xt


kxt+1

� x⇤k2 = kxt

� �gt

� x⇤k2= kx

t

� x⇤k2 + �2kgt

k2 � 2�hxt

� x⇤, gt

i= kx

t

� x⇤k2 + �2kgt

k2 � 2�hxt

� x⇤, gt

i+ 2�hxt

� xt

, gt

i . (13)



is a function of xt

, not xt


t



t



t

� x⇤, gt

i = Ehxt

� x⇤, f 0(xt


hxt

� x⇤, f 0(xt

)i � f(xt

)� f(x⇤) +µ

2kx

t

� x⇤k2. (14)


at+1

(1� �µ

2)a

t

+ �2Ekgt

k2 � 2�et

(15)

+�µEkxt

� xt

k2 + 2�Ehxt

� xt

, gt


, (16)

where at

:= Ekxt

� x⇤k2 and et

:= Ef(xt



t


t


t



12




t


T

.




:= g(xt

, ↵t, it


t+1

and introducing xt


kxt+1

� x⇤k2 = kxt

� �gt

� x⇤k2= kx

t

� x⇤k2 + �2kgt

k2 � 2�hxt

� x⇤, gt

i= kx

t

� x⇤k2 + �2kgt

k2 � 2�hxt

� x⇤, gt

i+ 2�hxt

� xt

, gt

i . (13)



is a function of xt

, not xt


t



t



t

� x⇤, gt

i = Ehxt

� x⇤, f 0(xt


hxt

� x⇤, f 0(xt

)i � f(xt

)� f(x⇤) +µ

2kx

t

� x⇤k2. (14)


at+1

(1� �µ

2)a

t

+ �2Ekgt

k2 � 2�et

(15)

+�µEkxt

� xt

k2 + 2�Ehxt

� xt

, gt


, (16)

where at

:= Ekxt

� x⇤k2 and et

:= Ef(xt



t


t


t



12

xt � xtxt

[3]introducedthe“beforeread”labeling:isincrementedatthestartofeachitera>on.Nodependencyisinjected!

Complica>on:candependonforbecausesomeupdatesarefasterthanothers.

Solu>on:the“a8erread”approach:isincrementeda7erhasbeenread.

Keybenefit:doesnotdependonfutureterms,allowingustowrite:

wherearediagonalmatriceswithtermsinencodingupdatedelays.

xt

xt ir r > t

xt


semantics, which are heavily optimized at the processor level and have minimal overhead.Our experiments with non-thread safe algorithms (i.e. where this property is not verified,see Figure ?? of Appendix ??) show that compare-and-swap is necessary to optimize to highaccuracy.

Finally, as is standard in the literature, we make an assumption on the maximum delayasynchrony can cause – this is the partially asynchronous setting as defined in ?:

Assumption 5 (bounded overlaps) We assume that there exists a uniform bound, called ⌧ ,on the maximum number of iterations that can overlap together. We say that iterations r andt overlap if at some point they are processed concurrently. One iteration is being processedfrom the start of the reading of the shared parameters to the end of the writing of its update.The bound ⌧ means that iterations r cannot overlap with iteration t for r � t+ ⌧ + 1, andthus that every coordinate update from iteration t is successfully written to memory beforethe iteration t+ ⌧ + 1 starts.

Our result will give us conditions on ⌧ subject to which we have linear speedups. ⌧ isusually seen as a proxy for p, the number of cores (which lowerbounds it). However, though⌧ appears to depend linearly on p, it actually depends on several other factors (notablythe data sparsity distribution) and can be orders of magnitude bigger than p in real-lifeexperiments. We can upper bound ⌧ by (p� 1)R, where R is the ratio of the maximum overthe minimum iteration time (which encompasses theoretical aspects as well as hardwareoverhead). More details can be found in Section ??.

Explicit e↵ect of asynchrony. By using the overlap Assumption ?? in the expression (6)for the iterates, we obtain the following explicit e↵ect of asynchrony that is crucially used inour proof:

xt

� xt

= �t�1X

u=(t�⌧)+

St

u

g(xu

, ↵u, iu

), (11)

where St

u

are d⇥ d diagonal matrices with terms in {+1, 0}. We know from our definitionof t and x

t

that every update in xt

is already in xt

– this is the 0 case. Conversely, someupdates might be late: this is the +1 case. x

t

may be lacking some updates from the “past”in some sense, whereas given our global ordering definition, it cannot contain updates fromthe “future”.

3.3 Convergence and speedup results

We now state our main theoretical results. We give a detailed outline of the proof inSection ?? and its full details in Appendix ??.

We first define a notion of problem sparsity, as it will appear in our results.

Definition 6 (Sparsity) As in ?, we introduce �r

:= maxv=1..d

|{(i : v 2 Si

}|. �r

is themaximum right-degree in the bipartite graph of the factors and the dimensions, i.e., themaximum number of data points with a specific feature. For succinctness, we also define� := �

r

/n. We have 1 �r

n, and hence 1/n � 1.

10







xt

� xt

= �t�1X

u=(t�⌧)+

St

u

g(xu

, ↵u, iu

), (11)

where St

u


t


is already in xt


t






:= maxv=1..d

|{(i : v 2 Si

}|. �r


r

/n. We have 1 �r


10







xt

� xt

= �t�1X

u=(t�⌧)+

St

u

g(xu

, ↵u, iu

), (11)

where St

u


t


is already in xt


t






:= maxv=1..d

|{(i : v 2 Si

}|. �r


r

/n. We have 1 �r


10







xt

� xt

= �t�1X

u=(t�⌧)+

St

u

g(xu

, ↵u, iu

), (11)

where St

u


t


is already in xt


t






:= maxv=1..d

|{(i : v 2 Si

}|. �r


r

/n. We have 1 �r


10

Wecompare3algorithms:KROMAGNON(asynchronousSVRG),Hogwild(asynchronousSGD)andASAGAona40-coremachine.

Weuse3datasets:RCV1-f,URLandCovtype.

DenseSparse

Linearspeedupsinnumberofitera>ons.

Upto10xspeedupinrunning>me(hardwareoverhead).

Impactofsparsity.

Idea:Useacleverer(unbiased)es>matorforthegradient,withnaturallyvanishingvariance.

Iterates:

Memory:Historicalgradient

x

+ = x� �

�f

0i(x)� ↵i +

1

n

X

j

↵j

�

↵

+i = f

0i(x)

SAGAupdatesaredensedueto,theaveragehistoricalgradient.

Idea:Projectthistermon,thesupportofthecurrentgradient:

1

n

X

j

↵j

f

0i(x)

x

+ = x� �

�f

0i(x)� ↵i + PSiD

1

n

X

j

↵j

�

Si

DiagonalrenormalizaPonmatrixtoensureunbiasedness


Algorithm 1 Asaga

1: Initialize shared variables x and (↵i

)ni=1

2: keep doing in parallel3: x, (↵

i

)ni=1

= inconsistent reads of x and (↵i

)ni=1

4: Sample i uniformly in {1, ..., n}5: [↵]

Si := 1/nP

n

k=1

[↵k

]Si

6: [�x]Si := ��(f 0

i

(x)� ↵i

+Di

[↵]Si)

7: for v in Si

do8: [x]

v

[x]v

+ [�x]v

// atomic9: [↵

i

]v

[f 0i

(x)]v

10: end for11: end parallel loop

Algorithm 2 Asaga (implementation)

1: Initialize shared x, (↵i

)ni=1

and ↵2: keep doing in parallel3: Sample i uniformly in {1, ..., n}4: Let S

i

be fi

’s support5: [x]

Si = inconsistent read of x on Si

6: ↵i

= inconsistent read of ↵i

7: [↵]Si = inconsistent read of ↵ on S

i

8: [�↵]Si = f 0

i

([x]Si)� ↵

i

9: [�x]Si = ��([�↵]Si +D

i

[↵]Si)

10: for v in Si

do11: [x]

v

= [x]v

+ [�x]v

// atomic12: [↵

i

]v

= [↵i

]v

+ [�↵]v

// atomic13: [↵]

v

= [↵]v

+ 1/n[�↵]v

// atomic14: end for15: end parallel loop

Before stating our convergence result, we highlight some properties of Algorithm ?? andmake one central assumption.

Property 2 (independence) Given the “after read” global ordering, ir

is independent ofxt

8r � t.

We enforce the independence for r = t in Algorithm ?? by having the core read all theshared data parameters and historical gradients before starting their iterations. Althoughthis is too expensive to be practical if the data is sparse, this is required by the theoreticalAlgorithm ?? that we can analyze. As ? stress, this independence property is assumed inmost of the parallel optimization literature. The independence for r > t is a consequence ofusing the “after read” global ordering instead of the “before read” one.

Property 3 (Unbiased estimator) The update, gt

:= g(xt

, ↵t, it

), is an unbiased estima-tor of the true gradient at x

t

(i.e. (??) yields (??) in conditional expectation).

This property is crucial for the analysis, as in most related literature. It follows by theindependence of i

t

with xt

and from the computation of ↵ on line 7 of Algorithm ??, whichensures that E↵

i

= 1/nP

n

k=1

[↵k

]Si = [↵]

Si , making the update unbiased. In practice,recomputing ↵ is not optimal, but storing it instead introduces potential bias issues in theproof (as detailed in Appendix ??).

Property 4 (atomicity) The shared parameter coordinate update of [x]v

on line 11 isatomic.

Since our updates are additions, this means that there are no overwrites, even when severalcores compete for the same resources. In practice, this is enforced by using compare-and-swap

9

Inconsistentreadsandwrites(lock-free)

Atomicupdates

:asparsitymeasure,propor>onaltothemaximumnumberofdatapointswiththesamefeature:

:auniformboundonthemaximumdelaybetweentwoitera>onsprocessedconcurrently.

,thecondi1onnumber.

� :=

1

nmax

v=1..d|{(i : v 2 Si}|

�

•  Mostconvergenceproofsrelyontheassump>onthatthegradientupdateisunbiased.•  Thisassump>onisunverifiedinthetradi>onalframeworkofanalysis.•  Ournovel,simpler“aKerread”labelingsolvesthisissue.

OntheoverlapconstantExperimentsMoPvaPon

Whatdoesrepresent?Usuallydismissedasaproxyforthenumberofcores.

Problem:isorderofmagnitudesbiggerinexperiments.

Insight:encompassesalotmorecomplexity;itheavilydependsontheheterogeneityoftheitera>oncomputa>on>mes.

Factorsthatcaninfluenceincludeclockspeeds,sparsitypa[ernsandtheoveralllengthofthecomputa>on.

⌧ isindeedordersofmagnitudebiggerthanthenumberofcores(buts>llsmallerthan).

Lineardependencyun>l30cores.

Possiblephasetransi>ona8erwards.

n=698k n=581kn=3232k

Theorem

Supposeand

Thenwithanadequatestepsize,ASAGAconvergeslinearlywithratefactor

thusthespeedupislinearwiththenumberofcores.

⌧ O(n) ⌧ O�1p�

max{1, n}�.

⌦�min{ 1

n,1

}�

(same as SAGA)

=)µ

:=L

µ

AnoverlookedquesPon

⌧

⌧

⌧

⌧⌧

⌧

n

t

t

t

t

x

UniversitédeMontréal

f =1

2

�f1 + f2

�, |S1| = 1, |S2| = 106


(a) Suboptimality as a function of time. (b) Speedup as a function of the number of cores

Figure 2: Convergence and speedup for asynchronous stochastic gradient descentmethods. We display results for RCV1 and URL. Results for Covtype can be found inSection ??.

We then examine the speedup relative to the increase in the number of cores. Thespeedup is measured as time to achieve a suboptimality of 10�5 (10�3 for Hogwild) withone core divided by time to achieve the same suboptimality with several cores, averagedover 3 runs. Again, we choose step size leading to fastest convergence. Results are displayedin Figure ??.

As predicted by our theory, we observe linear “theoretical” speedups (i.e. in terms ofnumber of iterations, see Section ??). However, with respect to running time the speedupsseem to taper o↵ after 20 cores. This phenomenon can be explained by the fact that ourhardware model is by necessity a simplification of reality. As noted in ?, in a modernmachine there is no such thing as shared memory. Each core has its own levels of cache(L1, L2, L3) in addition to RAM. The more cores are used, the lower in the memory stackinformation goes and the slower it gets. More experimentation is needed to quantify thate↵ect and potentially get better performance.

6.5 E↵ect of sparsity

Sparsity plays an important role in our theoretical results, where we find that while itis necessary in the “ill-conditioned” regime to get linear speedups, it is not in the “well-conditioned” regime. We confront this to real-life experiments by comparing the convergenceand speedup performance of our three asynchronous algorithms on the Covtype dataset,which is fully dense after standardization. The results appear in Figure ??.

While we still see a significant improvement in speed when increasing the number of cores,this improvement is smaller than the one we observe for sparser datasets. The speedups weobserve are consequently smaller, and taper o↵ earlier than on our other datasets. However,since the observed “theoretical” speedup is linear (see Section ??), we can attribute thisworse performance to higher hardware overhead. This is expected because each update isfully dense and thus the shared parameters are much more heavily contended for than inour sparse datasets.

One thing we notice when computing the � variable for our datasets is that it often failsto capture the full sparsity distribution, being essentially a maximum. This means that �

27


Figure 5: Overlap. Overlap as a function of the number of cores for both Asaga andHogwild on all three datasets.

we have introduced a novel analysis of the algorithm and proven that under mild conditionsAsaga is linearly faster than Saga. Our empirical benchmarks confirm speedups up to 10x.

Our proof technique accommodates more realistic settings than is usually the case in theliterature (such as inconsistent reads and writes and an unbounded gradient); we obtaintighter conditions than in previous work. In particular, we show that sparsity is not alwaysnecessary to get linear speedups. Furthermore, we have proposed a novel perspective toclarify an important technical issue present in a large fraction of the recent convergence rateproofs for asynchronous parallel optimization algorithms. Our analysis of Hogwild andKromagnon shows that it can easily be applied to other algorithms.

? have shown that Sag enjoys much improved performance when combined with non-uniform sampling and line-search. We have also noticed that our �

r

constant (beingessentially a maximum) sometimes fails to accurately represent the full sparsity distributionof our datasets. Finally, while our algorithm can be directly ported to a distributedmaster-worker architecture, its communication pattern would have to be optimized to avoidprohibitive costs. Limiting communications can be interpreted as artificially increasing thedelay, yielding an interesting trade-o↵ between delay influence and communication costs.

These constitute interesting directions for future analysis, as well as a further explorationof the ⌧ term, which we have shown encompasses more complexity than previously thought.

Acknowledgments

We would like to thank Xinghao Pan for sharing with us their implementation of Kromagnon.This work was partially supported by the MSR-Inria Joint Center. FP acknowledges financialsupport from the “Chaire Economie des Nouvelles Donnees”, under the auspices of InstitutLouis Bachelier, Havas-Media and Universite Paris-Dauphine.

31

+1 inria - sierra project-teamr´emi leblond 2 inria ...rleblond/asagafinal.pdf · what is ? how is...

Documents