sapienza university of rome, 00185 roma, italy, … · kadabra is an adaptive algorithm for...

KADABRA is an ADaptive Algorithm for Betweenness

via Random Approximation∗

Michele Borassi1 and Emanuele Natale2

1IMT Insitute for Advanced Studies, 55100 Lucca, Italy,[email protected]

2Sapienza University of Rome, 00185 Roma, Italy, [email protected]

15th August 2016

Abstract

We present KADABRA, a new algorithm to approximate betweenness centrality indirected and undirected graphs, which significantly outperforms all previous approacheson real-world complex networks. The efficiency of the new algorithm relies on two newtheoretical contributions, of independent interest.

The first contribution focuses on sampling shortest paths, a subroutine used bymost algorithms that approximate betweenness centrality. We show that, on realistic

random graph models, we can perform this task in time |E|12+o(1) with high probability,

obtaining a significant speedup with respect to the Θ(|E|) worst-case performance. Weexperimentally show that this new technique achieves similar speedups on real-worldcomplex networks, as well.

The second contribution is a new rigorous application of the adaptive samplingtechnique. This approach decreases the total number of shortest paths that need tobe sampled to compute all betweenness centralities with a given absolute error, andit also handles more general problems, such as computing the k most central nodes.Furthermore, our analysis is general, and it might be extended to other settings, aswell.

1 Introduction

In this work we focus on estimating the betweenness centrality, which is one of the mostfamous measures of centrality for nodes and edges of real-world complex networks [21, 33].The rigorous definition of betweenness centrality has its roots in sociology, dating back to theSeventies, when Freeman formalized the informal concept discussed in the previous decadesin different scientific communities [6, 43, 42, 19, 15], although the definition already appearedin [4]. Since then, this notion has been very successful in network science [48, 34, 25, 33].

A probabilistic way to define the betweenness centrality1 bc(v) of a node v in a graphG = (V,E) is the following. We choose two nodes s and t, and we go from s to t through ashortest path π; if the choices of s, t and π are made uniformly at random, the betweennesscentrality of a node v is the probability that we pass through v.

In a seminal paper [16], Brandes showed that it is possible to exactly compute thebetweenness centrality of all the nodes in a graph in time O(mn), where n is the number ofnodes and m is the number of edges. A corresponding lower bound was proved in [13]: if we

∗This work was done while the authors were visiting the Simons Institute for the Theory of Computing.1As explained in see Section 2, to simplify notation we consider the normalized betweenness centrality.

1

arX

iv:1

604.

0855

3v4

[cs

.DS]

12

Aug

201

6

are able to compute the betweenness centrality of a single node in time O(mn1−ε) for someε > 0, then the Strong Exponential Time Hypothesis [26] is false.

This result further motivates the rich line of research on computing approximations ofbetweenness centrality, with the goal of trading precision with efficiency. The main ideais to define a probability distribution over the set of all paths, by choosing two uniformlyrandom nodes s, t, and then a uniformly distributed st-path π, so that Pr(v ∈ π) = bc(v).As a consequence, we can approximate bc(v) by sampling paths π1, . . . ,πτ according tothis distribution, and estimating b(v) := 1

τ

∑τi=1Xi(v), where Xi(v) = 1 if v ∈ πi (and

v 6= s, t), 0 otherwise.The tricky part of this approach is to provide probabilistic guarantees on the quality of this

approximation: the goal is to obtain a 1− δ confidence interval I(v) = [b(v)−λL, b(v) +λU ]for bc(v), which means that Pr(∀v ∈ V,bc(v) ∈ I(v)) ≥ 1 − δ. Thus, the research forapproximating betweenness centrality has been focusing on obtaining, as fast as possible,the smallest possible I.

Our Contribution

In this work, we propose a new and faster algorithm to approximate betweenness centrality indirected and undirected graphs, named KADABRA. In the standard task of approximatingbetweenness centralities with absolute error at most λ, we show that, on average, the newalgorithm is more than 100 times faster than the previous ones, on graphs with approximately10 000 nodes. Moreover, differently from previous approaches, our algorithm can performmore general tasks, since it does not need all confidence intervals to be equal. As an example,we consider the computation of the k most central nodes: all previous approaches computeall centralities with an error λ, and use this approximation to obtain the ranking. Conversely,our approach allows us to use small confidence interval only when they are needed, andallows bigger confidence intervals for nodes whose centrality values are “well separated”.This way, we can compute for the first time an approximation of the k most central nodesin networks with millions of nodes and hundreds of millions of edges, like the Wikipediacitation network and the IMDB actor collaboration network.

Our results rely on two main theoretical contributions, which are interesting in theirown right, since their generality naturally extends to other applications.

Balanced bidirectional breadth-first search. By leveraging on recent advancedresults, we prove that, on many realistic random models of real-world complex networks, itis possible to sample a random path between two nodes s and t in time m

12 +o(1) if the degree

distribution has finite second moment, or m4−β2 +o(1) if the degree distribution is power law

with exponent 2 < β < 3. The models considered are the Configuration Model [11], andall Rank-1 Inhomogeneous Random Graph models [45, Chapter 3], such as the Chung-Lumodel [32], the Norros-Reittu model [35], and the Generalized Random Graph [45, Chapter3]. Our proof techniques have the merit of adopting a unified approach that simultaneouslyworks in all models considered. These models well represent metric properties of real-worldnetworks [14]: indeed, our results are confirmed by practical experiments.

The algorithm used is simply a balanced bidirectional BFS (bb-BFS): we perform aBFS from each of the two endpoints s and t, in such a way that the two BFSs are likelyto explore about the same number of edges, and we stop as soon as the two BFSs “toucheach other”. Rather surprisingly, this technique was never implemented to approximatebetweenness centrality, and it is rarely used in the experimental algorithm community. Ourtheoretical analysis provides a clear explanation of the reason why this technique improvesover the standard BFS: this means that many state-of-the-art algorithm for real-worldcomplex networks can be improved by the bb-BFS.

2

Adaptive sampling made rigorous. To speed up the estimation of the betweennesscentrality, previous work make use of the technique of adaptive sampling, which consistsin testing during the execution of the algorithm whether some condition on the sampleobtained so far has been met, and terminating the execution of the algorithm as soon asthis happens. However, this technique introduces a subtle stochastic dependence betweenthe time in which the algorithm terminates and the correctness of the given output, whichprevious papers claiming a formal analysis of the technique did not realize (see Section 3 fordetails). With an argument based on martingale theory, we provide a general analysis ofsuch useful technique. Through this result, we do not only improve previous estimators, butwe also make it possible to define more general stopping conditions, that can be decided“on the fly”: this way, with little modifications, we can adapt our algorithm to perform moregeneral tasks than previous ones.

To better illustrate the power of our techniques, we focus on the unweighted, staticgraphs, and to the centrality of nodes. However, our algorithm can be easily adapted tocompute the centrality of edges, to handle weighted graphs and, since its core part consistsmerely in sampling paths, we conjecture that it may be coupled with the existing techniquesin [9] to handle dynamic graphs.

Related Work

Computing Betweenness Centrality. With the recent event of big data, the majorshortcoming of betweenness centrality has been the lack of efficient methods to compute it[16]. In the worst case, the best exact algorithm to compute the centrality of all the nodes isdue to Brandes [16], and its time complexity is O(mn): the basic idea of the algorithm is to

define the dependency δs(v) =∑t∈V

σst(v)σst

, which can be computed in time O(m), for eachv ∈ V (we denote by σst(v) the number of shortest paths from s to t passing through v, andby σst the number of st-shortest paths). In [13], it is also shown that Brandes algorithm isalmost optimal on sparse graphs: an algorithm that computes the betweenness centralityof a single vertex in time O(mn1−ε) falsifies widely believed complexity assumptions, suchas the Strong Exponential Time Hypothesis [26], the Orthogonal Vector conjecture [2],or the Hitting Set conjecture [49]. Corresponding results in the dense, weighted case areavailable in [1]: computing the betweenness centrality exactly is as hard as computingthe All Pairs Shortest Path, and computing an approximation with a given relative erroris as hard as computing the diameter. For both these problems, there is no algorithmwith running-time O(n3−ε), for any ε > 0. This shows that, for dense graphs, having anadditive approximation rather than a multiplicative one is essential for a provably fastalgorithm to exist. These negative results further motivate the already rich line of researchon approaches that overcome this barrier. A first possibility is to use heuristics, thatdo not provide analytical guarantees on their performance [41, 23, 46]. Another line ofresearch has defined variants of betweenness centrality, that might be easier to compute[17, 36, 20]. Finally, a third line of research has investigated approximation algorithms,which trade accuracy for speed [27, 18, 25, 29]. Our work follows the latter approach. Thefirst approximation algorithm proposed in the literature [27] adapts Eppstein and Wang’sapproach for computing closeness centrality [22], using Hoeffding’s inequality and the unionbound technique. This way, it is possible to obtain an estimate of the betweenness centralityof every node that is correct up to an additive error λ with probability δ, by sampling

O(D2

λ2 log nδ ) nodes, where D is the diameter of the graph. In [25], it is shown that this

can lead to an overestimation. Riondato and Kornaropoulos improve this sampling-basedapproach by sampling single shortest paths instead of the whole dependency of a node [39],introducing the use of the VC-dimension. As a result, the number of samples is decreasedto c

λ2 (blog2(VD−2)c+ 1 + log( 1δ )), where VD is the vertex diameter, that is, the minimum

number of nodes in a shortest path in G (it can be different from D + 1 if the graph isweighted). This use of the VC-dimension is further developed and generalized in [40]. Finally,

3

many of these results were adapted to handle dynamic networks [9, 40].

Approximating the top-k betweenness centrality set. Let us order the nodesv1, ..., vn such that bc(v1) ≥ ... ≥ bc(vn) and define TOP (k) = (vi,bc(vi)) : i ≤ k. In [39]and [40], the authors provide an algorithm that, for any given δ, ε, with probability 1− δoutputs a set TOP (k) = (vi, b(vi)) such that: i) If v ∈ TOP (k) then v ∈ TOP (k) and

|bc(v) − b(v)| ≤ ε bc(v); ii) If v ∈ TOP (k) but v 6∈ TOP (k) then b(v) ≤ (bk − ε)(1 + ε)where bk is the k-th largest betweenness given by a preliminary phase of the algorithm.

Adaptive sampling. In [5, 40], the number of samples required is substantiallyreduced using the adaptive sampling technique introduced by Lipton and Naughton in[31, 30]. Let us clarify that, by adaptive sampling, we mean that the termination of thesampling process depends on the sample observed so far (in other cases, the same expressionrefers to the fact that the distribution of the new samples is a function of the previous ones[3], while the sample size is fixed in advance). Except for [37], previous approaches tacitlyassume that there is little dependency between the stopping time and the correctness of theoutput: indeed, they prove that, for each fixed τ , the probability that the estimate is wrongat time τ is below δ. However, the stopping time τ is a random variable, and in principlethere might be dependency between the event τ = τ and the event that the estimate iscorrect at time τ . As for [37], they consider a specific stopping condition and their prooftechnique does not seem to extend to other settings. For a more thorough discussion of thisissue, we defer the reader to Section 3.

Bidirectional BFS. The possibility of speeding up a breadth-first search for theshortest-path problem by performing, at the same time, a BFS from the final end-point, hasbeen considered since the Seventies [38]. Unfortunately, because of the lack of theoreticalresults dealing with its efficiency, the bidirectional BFS has apparently not been considereda fundamental heuristic improvement [28]. However, in [39] (and in some public talks by M.Riondato), the bidirectional BFS was proposed as a possible way to improve the performanceof betweenness centrality approximation algorithms.

Structure of the Paper

In Section 2, we describe our algorithm, and in Section 3 we discuss the main difficulty ofthe adaptive sampling, and the reasons why our techniques are not affected. In Section 4, wedefine the balanced bidirectional BFS, and we sketch the proof of its efficiency on randomgraphs. In Section 5, we show that our algorithm can be adapted to compute the k mostcentral nodes. In Section 6 we experimentally show the effectiveness of our new algorithm.Finally, all our proofs are in the appendix.

2 Algorithm Overview

To simplify notation, we always consider the normalized betweenness centrality of a node v,which is defined by:

bc(v) =1

n(n− 1)

∑s6=v 6=t

σst(v)

σst

where σst is the number of shortest paths between s and t, and σst(v) is the number ofshortest paths between s and t that pass through v. Furthermore, to simplify the exposition,we use bold symbols to denote random variables, and light symbols to denote deterministicquantities. On the same line of previous works, our algorithm samples random pathsπ1, . . . ,πτ , where πi is chosen by selecting uniformly at random two nodes s, t, and then

4

selecting uniformly at random one of the shortest paths from s to t. Then, it estimatesbc(v) with b(v) := 1

τ

∑τi=1Xi(v), where Xi(v) = 1 if v ∈ πi, 0 otherwise. By definition of

πi, E[b(v)

]= bc(v).

The tricky part is to bound the distance between b(v) and its expected value. With astraightforward application of Hoeffding’s inequality (Lemma 5 in the appendix), it is possible

to prove that Pr(∣∣∣b(v)− bc(v)

∣∣∣ ≥ λ) ≤ 2e−2τλ2

. A direct application of this inequality

considers a union bound on all possible nodes v, obtaining Pr(∃v ∈ V, |b(v)− bc(v)| ≥ λ) ≤2ne−2τλ2

. This means that the algorithm can safely stop as soon as 2ne−2τλ2 ≤ δ, that is,after τ = 1

2λ2 log( 2nδ ) steps.

In order to improve this idea, we can start from Lemma 7 in the appendix, instead of

Hoeffding inequality, obtaining that Pr(∣∣∣b(v)− bc(v)

∣∣∣ ≥ λ) ≤ 2 exp(− τλ2

2(bc(v)+λ/3) ).

If we assume the error λ to be small, this inequality is stronger than the previous one forall values of bc(v) < 1

4 (a condition which holds for almost all nodes, in almost all graphsconsidered). However, in order to apply this inequality, we have to deal with the fact thatwe do not know bc(v) in advance, and hence we do not know when to stop. Intuitively, tosolve this problem, we make a “change of variable”, and we rewrite the previous inequalityas

Pr(

bc(v) ≤ b(v)− f)≤ δ(v)

L and Pr(

bc(v) ≥ b(v) + g)≤ δ(v)

U , (1)

for some functions f = f(b(v), δ(v)L , τ), g = g(b(v), δ

(v)U , τ). Our algorithm fixes at the

beginning the values δ(v)L , δ

(v)U for each node v, and, at each step, it tests if f(b(v), δ

(v)L , τ)

and g(b(v), δ(v)U , τ) are small enough. If this condition is satisfied, the algorithm stops. Note

that this approach lets us define very general stopping conditions, that might depend on thecentralities computed until now, on the single nodes, and so on.

Remark 1. Instead of fixing the values δ(v)L , δ

(v)U at the beginning, one might want to decide

them during the algorithm, depending on the outcome. However, this is not formally correct,

because of dependency issues (for example, (1) does not even make sense, if δ(v)L , δ

(v)U are

random). Finding a way to overcome this issue is left as a challenging open problem (moredetails are provided in Section 3).

In order to implement this idea, we still need to solve an issue: (1) holds for each fixedtime τ , but the stopping time of our algorithm is a random variable τ , and there might bedependency between the value of τ and the probability in (1). To this purpose, we use astronger inequality (Theorem 8 in the appendix), that holds even if τ is a random variable.However, to use this inequality, we need to assume that τ < ω for some deterministic ω: inour algorithm, we choose ω = c

λ2

(blog2(VD−2)c+ 1 + log

(2δ

)), because, by the results in

[39], after ω samples, the maximum error is at most λ, with probability 1− δ2 . Furthermore,

also f and g should be modified, since they now depend on the value of ω. The pseudocodeof the algorithm obtained is available in Algorithm 1 (as was done in previous approaches,we can easily parallelize the while loop in Line 5).

The correctness of the algorithm follows from the following theorem, which is the baseof our adaptive sampling, and which we prove in Appendix C (where we also define thefunctions f and g).

Theorem 2. Let b(v) be the output of Algorithm 1, and let τ be the number of samples atthe end of the algorithm. Then, with probability 1− δ, the following conditions hold:

• if τ = ω, |b(v)− bc(v)| < λ for all v;

• if τ < ω, −f(τ , b(v), δ(v)L , ω) ≤ bc(v)− b(v) ≤ g(τ , b(v), δ

(v)U , ω) for all v.

5

Algorithm 1: our algorithm for approximating betweenness centrality.

Input : a graph G = (V,E)Output : for each v ∈ V , an approximation b(v) of bc(v) such that

Pr(∀v, |b(v)− bc(v)| ≤ λ

)≥ 1− δ

1 ω ← cλ2


(2δ

));

2 (δ(v)L , δ

(v)U )← computeDelta();

3 τ ← 0;

4 foreach v ∈ V do b(v)← 0

5 while τ < ω and not haveToStop (b, δL, δU , ω, τ) do6 π = samplePath();

7 foreach v ∈ π do b(v)← b(v) + 1 τ ← τ + 1;

8 end

9 foreach v ∈ V do b(v)← b(v)/τ return b

Remark 3. This theorem says that, at the beginning of the algorithm, we know that, withprobability 1− δ, one of the two conditions will hold when the algorithm stops, independentlyof the final value of τ . This is essential to avoid the stochastic dependence that we discussin Section 3.

In order to apply this theorem, we choose λ such that our goal is reached if all centralitiesare known with error at most λ. Then, we choose the function haveToStop in a way thatour goal is reached if the stopping condition is satisfied. This way, our algorithm is correct,both if τ = ω and if τ < ω. For example, if we want to compute all centralities withbounded absolute error, we simply choose λ as the bound we want to achieve, and we plugthe stopping condition f, g ≤ λ in the function haveToStop. Instead, if we want to computean approximation of the k most central nodes, we need a different definition of f and g,which is provided in Section 5.

To complete the description of this algorithm, we need to specify the following functions.

computeDelta The algorithm works for any choice of the δ(v)L , δ

(v)U s, but a good choice

yields better running times. We propose a heuristic way to choose them in Appendix D.

samplePath In order to sample a path between two random nodes s and t, we use abalanced bidirectional BFS, which is defined in Appendix E.

3 Adaptive Sampling

In this section, we highlight the main technical difficulty in the formalization of adaptivesampling, which previous works claiming analogous results did not address. Furthermore,we sketch the way we overcome this difficulty: our argument is quite general, and it couldbe easily adapted to formalize these claims.

As already said, the problem is the stochastic dependence between the time τ in whichthe algorithm terminates and the event Aτ = “at time τ , the estimate is within the requireddistance from the true value”, since both τ and Aτ are functions of the same randomsample. Since it is typically possible to prove that Pr(¬Aτ ) ≤ δ for every fixed τ , onemay be tempted to argue that also Pr(¬Aτ ) ≤ δ, by applying these inequalities at timeτ . However, this is not correct: indeed, if we have no assumptions on τ , τ could even bedefined as the smallest τ such that Aτ does not hold!

More formally, if we want to link Pr(¬Aτ ) to Pr(¬Aτ ), we have to use the law of total

6

probability, that says that:

Pr(¬Aτ ) =

∞∑τ=1

Pr(¬Aτ | τ = τ) Pr(τ = τ) (2)

= Pr(¬Aτ | τ < τ) Pr(τ < τ) + Pr(¬Aτ | τ ≥ τ) Pr(τ ≥ τ). (3)

Then, if we want to bound Pr(¬Aτ ), we need to assume that

Pr(¬Aτ | τ = τ) ≤ Pr(¬Aτ ) or that Pr(¬Aτ | τ ≥ τ) ≤ Pr(¬Aτ ), (4)

which would allow to bound (2) or (3) from above. The equations in (4) are implicitlyassumed to be true in previous works adopting adaptive sampling techniques. Unfortunately,because of the stochastic dependence, it is quite difficult to prove such inequalities, even ifsome approaches managed to overcome these difficulties [37].

For this reason, our proofs avoid dealing with such relations: in the proof of Theorem 2,we fix a deterministic time ω, we impose that τ ≤ ω, and we apply the inequalities withτ = ω. Then, using martingale theory, we convert results that hold at time ω to results thathold at the stopping time τ (see Appendix C).

4 Balanced Bidirectional BFS

A major improvement of our algorithm, with respect to previous counterparts, is that wesample shortest paths through a balanced bidirectional BFS, instead of a standard BFS. Inthis section, we describe this technique, and we bound its running time on realistic modelsof random graphs, with high probability. The idea behind this technique is very simple: ifwe need to sample a uniformly random shortest path from s to t, instead of performing afull BFS from s until we reach t, we perform at the same time a BFS from s and a BFSfrom t, until the two BFSs touch each other (if the graph is directed, we perform a “forward”BFS from s and a “backward” BFS from t).

More formally, assume that we have visited up to level ls from s and to level lt fromt, let Γls(s) be the set of nodes at distance ls from s, and similarly let Γlt(t) be the set ofnodes at distance lt from t. If

∑v∈Γls (s) deg(v) ≤

∑v∈Γlt (t) deg(v), we process all nodes in

Γls(s), otherwise we process all nodes in Γlt(t) (since the time needed to process level ls isproportional to

∑v∈Γls (s) deg(v), this choice minimizes the time needed to visit the next

level). Assume that we are processing the node v ∈ Γls(s) (the other case is analogous). Foreach neighbor w of v we do the following:

• if w was never visited, we add w to Γls+1(s);

• if w was already visited in the BFS from s, we do not do anything;

• if w was visited in the BFS from t, we add the edge (v, w) to the set Π of candidateedges in the shortest path.

After we have processed a level, we stop if Γls(s) or Γlt(t) is empty (in this case, s and tare not connected), or if Π is not empty. In the latter case, we select an edge from Π, sothat the probability of choosing the edge (v, w) is proportional to σsvσwt (we recall thatσxy is the number of shortest paths from x to y, and it can be computed during the BFSas in [18]). Then, the path is selected by considering the concatenation of a random pathfrom s to v, the edge (v, w), and a random path from w to t. These random paths canbe easily chosen by backtracking, as shown in [39] (since the number of paths might beexponential in the input size, in order to avoid pathological cases, we assume that we canperform arithmetic operations in O(1) time).

7

4.1 Analysis on Random Graph

In order to show the effectiveness of the balanced bidirectional BFS, we bound its runningtime in several models of random graphs: the Configuration Model (CM, [11]), and Rank-1Inhomogeneous Random Graph models (IRG, [45, Chapter 3]), such as the Chung-Lu model[32], the Norros-Reittu model [35], and the Generalized Random Graph [45, Chapter 3].In these models, we fix the number n of nodes, and we give a weight ρu to each node.In the CM, we create edges by giving ρu half-edges to each node u, and pairing thesehalf-edges uniformly at random; in IRG we connect each pair of nodes (u, v) independentlywith probability close to ρuρv/

∑w∈V ρw. With some technical assumptions discussed in

Appendix E, we prove the following theorem.

Theorem 4. Let G be a graph generated through the aforementioned models. Then, foreach fixed ε > 0, and for each pair of nodes s, t, w.h.p., the time needed to compute anst-shortest path through a bidirectional BFS is O(n

12 +ε) if the degree distribution λ has finite

second moment, O(n4−β2 +ε) if λ is a power law distribution with 2 < β < 3.

Sketch of proof. The idea of the proof is that the time needed by a bidirectional BFS isproportional to the number of visited edges, which is close to the sum of the degrees of thevisited nodes, which are very close to their weights. Hence, we have to analyze the weightsof the visited edges: for this reason, if V ′ is a subset of V , we define the volume of V ′ asρV ′ =

∑v∈V ′ ρv.

Our visit proceeds by “levels” in the BFS trees from s and t: if we never process alevel with total weight at least n

12 +ε, since the diameter is O(log n), the volume of the set

of processed vertices is O(n12 +ε log n), and the number of visited edges cannot be much

bigger (for example, this happens if s and t are not connected). Otherwise, assume that,

at some point, we process a level ls in the BFS from s with total weight n12 +ε: then, the

corresponding level lt in the BFS from t has also weight n12 +ε (otherwise, we would have

expanded from t, because weights and degrees are strongly correlated). We use the “birthdayparadox”: levels ls + 1 in the BFS from s, and level lt + 1 in the BFS from t are randomsets of nodes with size close to n

12 +ε, and hence there is a node that is common to both,

w.h.p.. This means that the time needed by the bidirectional BFS is proportional to thevolume of all levels in the BFS tree from s, until ls, plus the volume of all levels in the BFStree from t, until lt (note that we do not expand levels ls + 1 and lt + 1). All levels except

the last have volume at most n12 +ε, and there are O(log n) such levels because the diameter

is O(log n): it only remains to estimate the volume of the last level.By definition of the models, the probability that a node v with weight ρv belongs to the

last level is aboutρvρΓls−1(s)

M ≤ ρvn−12 +ε: hence, the expected volume of Γls(s) is at most∑

v∈V ρv Pr(v ∈ Γls−1(s)) ≤∑v∈V ρ

2vn− 1

2 +ε. Through standard concentration inequalities,we prove that this random variable is concentrated: hence, we only need to compute thisexpected value. If the degree distribution has finite second moment, then

∑v∈V ρ

2v = O(n),

concluding the proof. If the degree distribution is power law with 2 < β < 3, then we haveto consider separately nodes v such that ρv < n

12 and such that ρv > n

12 . In the first case,∑

ρv<n12ρ2v ≈

∑n12

d=0 nd2λ(d) ≈

∑n12

d=0 nd2−β ≈ n1+ 3−β

2 . In the second case, we prove that

the volume of the set of nodes with weight bigger than n12 is at most n

4−β2 . Hence, the total

volume of Γls(s) is at most n−12 +εn1+ 3−β

2 + n4−β2 ≈ n

4−β2 .

5 Computing the k Most Central Nodes

Differently from previous works, our algorithm is more flexible, making it possible to computethe betweenness centrality of different nodes with different precision. This feature can beexploited if we only want to rank the nodes: for instance, if v is much more central than

8

all the other nodes, we do not need a very precise estimation on the centrality of v tosay that it is the top node. Following this idea, in this section we adapt our approach tothe approximation of the ranking of the k most central nodes: as far as we know, this isthe first approach which computes the ranking without computing a λ-approximation ofall betweenness centralities, allowing significant speedups. Clearly, we cannot expect ourranking to be always correct, otherwise the algorithm does not terminate if two of the kmost central nodes have the same centrality. For this reason, the user fixes a parameter λ,and, for each node v, the algorithm does one of the following:

• it provides the exact position of v in the ranking;

• it guarantees that v is not in the top-k;

• it provides a value b(v) such that |bc(v)− b(v)| ≤ λ.

In other words, similarly to what is done in [39], the algorithm provides a set of k′ ≥ knodes containing the top-k nodes, and for each pair of nodes v, w in this subset, either wecan rank correctly v and w, or v and w are almost even, that is, |bc(v)− bc(w)| ≤ 2λ. Inorder to obtain this result, we plug into Algorithm 1 the aforementioned conditions in thefunction haveToStop (see Algorithm 3 in the appendix).

Then, we have to adapt the function computeDelta to optimize the δ(v)L s and the

δ(v)U s to the new stopping condition: in other words, we have to choose the values of

λ(v)L and λ

(v)U that should be plugged into the function computeDelta (we recall that the

heuristic computeDelta chooses the δ(v)L s so that we can guarantee as fast as possible that

b(v)−λ(v)L ≤ bc(v) ≤ b(v)+λ

(v)U ). To this purpose, we estimate the betweenness of all nodes

with few samples and we sort all nodes according to these approximate values b(v), obtaining

v1, . . . , vn. The basic idea is that, for the first k nodes, we set λ(vi)U = b(vi−1)−b(vi)

2 , and

λ(vi)L = b(vi)−b(vi+1)

2 (the goal is to find confidence intervals that separate the betweennessof vi from the betweenness of vi+1 and vi−1). For nodes that are not in the top-k, we

choose λ(v)L = 1 and λ

(v)U = b(vk)− λ(vk)

L − b(vi) (the goal is to prove that vi is not in the

top-k). Finally, if b(vi)− b(vi+1) is small, we simply set λ(vi)L = λ

(vi)U = λ

(vi+1)L = λ

(vi+1)U = λ,

because we do not know if bc(vi+1) > bc(vi), or viceversa.

6 Experimental Results

In this section, we test the four variations of our algorithm on several real-world networks,in order to evaluate their performances. The platform for our tests is a server with 1515 GBRAM and 48 Intel(R) Xeon(R) CPU E7-8857 v2 cores at 3.00GHz, running Debian GNULinux 8. The algorithms are implemented in C++, and they are compiled using gcc 5.3.1.The source code of our algorithm is available at https://sites.google.com/a/imtlucca.it/borassi/publications.

Comparison with the State of the Art

The first experiment compares the performances of our algorithm KADABRA with thestate of the art. The first competitor is the RK algorithm [39], available in the open-source NetworKit framework [44]. This algorithm uses the same estimator as our al-gorithm, but the stopping condition is different: it simply stops after sampling k =cε2


(1δ

)), and it uses a heuristic to upper bound the vertex diameter.

Following suggestions by the author of the NetworKit implementation, we set to 20 thenumber of samples used in the latter heuristic [7].

The second competitor is the ABRA algorithm [40], available at http://matteo.rionda.to/software/ABRA-radebetw.tbz2. This algorithm samples pairs of nodes (s, t), and it

9

https://sites.google.com/a/imtlucca.it/borassi/publications

https://sites.google.com/a/imtlucca.it/borassi/publications

http://matteo.rionda.to/software/ABRA-radebetw. tbz2

http://matteo.rionda.to/software/ABRA-radebetw. tbz2

advogato

as20000102

ca-G

rQ

c

ca-H

epTh

Cele

gans

com

-am

azon.a

ll

dip

20090126

MAX

Dm

ela

nogaster

em

ail-E

nron

HC-B

IOG

RID

Hom

osapie

ns

hprd

pp

Mus

musculu

s

oregon1

010526

oregon2

010526

0.1 sec

1 sec

1 min

1 hour

Network

Tim

eUndirected

as-c

aid

a20071105

cfinder-g

oogle

cit-H

epTh

ego-g

plu

s

ego-twitter

freeassoc

lasagne-s

panishbook

opsahl-openflig

hts

p2p-G

nutella31

polb

logs

soc-E

pin

ions1

subelj-c

ora-c

ora

subelj-jdk-jdk

subelj-jung-j-jung-j

wik

i-Vote

Network

Directed KADABRA

RK

ABRA-Aut

ABRA-1.2

Figure 1: The time needed by the different algorithms, on all the graphs of our dataset.

adds the fraction of st-paths passing from v to the approximation of the betweenness ofv, for each node v. The stopping condition is based on a key result in statistical learningtheory, and there is a scheduler that decides when it should be tested. Following thesuggestions by the authors, we use both the automatic scheduler ABRA-Aut, which uses aheuristic approach to decide when the stopping condition should be tested, and the geometricscheduler ABRA-1.2, which tests the stopping condition after (1.2)ik iterations, for eachinteger i.

The test is performed on a dataset made by 15 undirected and 15 directed real-worldnetworks, taken from the datasets SNAP (snap.stanford.edu/), LASAGNE (piluc.dsi.unifi.it/lasagne), and KONECT (http://konect.uni-koblenz.de/networks/). As in[40], we have considered all values of λ ∈ 0.03, 0.025, 0.02, 0.015, 0.01, 0.005, and δ = 0.1.All the algorithms have to provide an approximation b(v) of bc(v) for each v such that

Pr(∀v,∣∣∣b(v)− bc(v)

∣∣∣ ≤ λ) ≥ 1− δ. In Figure 1, we report the time needed by the different

algorithms on every graph for λ = 0.005 (the behavior with different values of λ is verysimilar). More detailed results are reported in Appendix F.

From the figure, we see that KADABRA is much faster than all the other algorithms,on all graphs: on average, our algorithm is about 100 times faster than RK in undirectedgraphs, and about 70 times faster in directed graphs; it is also more than 1 000 times fasterthan ABRA. The latter value is due to the fact that the ABRA algorithm has large runningtimes on few networks: in some cases, it did not even conclude its computation within onehour. The authors confirmed that this behavior might be due to some bugs in the code,which seems to affect it only on specific graphs: indeed, in most networks, the performancesof ABRA are better than those of the RK algorithm (but, still, not better than KADABRA).

In order to explain these data, we take a closer look at the improvements obtainedthrough the bidirectional BFS, by considering the average number of edges mavg that thealgorithm visits in order to sample a shortest path (for all our competitors, mavg = m, since

they perform a full BFS). In Figure 2, for each graph in our dataset, we plot α =log(mavg)

log(m)

(intuitively, this means that the average number of edges visited is mα).The figure shows that, apart from few cases, the number of edges visited is close to

n12 , confirming the results in Section 4. This means that, since many of our networks

have approximately 10 000 edges, the bidirectional BFS is about 100 times faster than thestandard BFS. Finally, for each value of λ, we report in Figure 3 the number of samplesneeded by all the algorithms, averaged over all the graphs in the dataset.

From the figure, KADABRA needs to sample the smallest amount of shortest paths,and the average improvement over RK grows when λ tends to 0, from a factor 1.14 (resp.,

10

snap.stanford.edu/

piluc.dsi.unifi.it/lasagne

piluc.dsi.unifi.it/lasagne

http://konect.uni-koblenz.de/networks/

advogato

as20000102

ca-G

rQ

c

ca-H

epTh

Cele

gans

com

-am

azon.a

ll

dip

20090126

MAX

Dm

ela

nogaster

em

ail-E

nron

HC-B

IOG

RID

Hom

osapie

ns

hprd

pp

Mus

musculu

s

oregon1

010526

oregon2

010526

0

0.2

0.4

0.6

0.8

1

Network

αUndirected Networks

as-c

aid

a20071105

cfinder-g

oogle

cit-H

epTh

ego-g

plu

s

ego-twitter

freeassoc

lasagne-s

panishbook

opsahl-openflig

hts

p2p-G

nutella31

polb

logs

soc-E

pin

ions1

subelj-c

ora-c

ora

subelj-jdk-jdk

subelj-jung-j-jung-j

wik

i-Vote

Network

Directed Networks

Figure 2: The exponent α such that the average number of edges visited during a bidirectionalBFS is nα.

0.01 0.02 0.03

104

105

106

λ

Iteration

s

Undirected Networks

0.01 0.02 0.03

λ

Directed Networks KADABRA

RK

ABRA-Aut

ABRA-1.2

Figure 3: The average number of samples needed by the different algorithms.

1.14) if λ = 0.03, to a factor 1.79 (resp., 2.05) if λ = 0.005 in the case of undirected (resp.,directed) networks. Again, the behavior of ABRA is highly influenced by the behavior onfew networks, and as a consequence the average number of samples is higher. In any case,also in the graphs where ABRA has good performances, KADABRA still needs a smallernumber of samples.

Computing Top-k Centralities

In the second experiment, we let KADABRA compute the top-k betweenness centralities oflarge graphs, which were unfeasible to handle with the previous algorithms.

The first set of graph is a series of temporal snapshots of the IMDB actor collaborationnetwork, in which two actors are connected if they played together in a movie. Thesnapshots are taken every 5 years from 1940 to 2010, including a last snapshot in 2014, with1 797 446 nodes and 145 760 312 edges. The graphs are extracted from the IMDB website(http://www.imdb.com), and they do not consider TV-series, awards-shows, documentaries,game-shows, news, realities and talk-shows, in accordance to what was done in http:

//oracleofbacon.org.The other graph considered is the Wikipedia citation network, whose nodes are Wikipedia

pages, and which contains an edge from page p1 to page p2 if the text of page p1 contains alink to page p2. The graph is extracted from DBPedia 3.7 (http://wiki.dbpedia.org/),and it consists of 4 229 697 nodes and 102 165 832 edges.

11

http://www.imdb.com

http://oracleofbacon.org

http://oracleofbacon.org

http://wiki.dbpedia.org/

0 0.5 1 1.5

15 min

30 min

45 min

Millions of nodes

Tim

e

0 50 100 150

Millions of edges

Figure 4: The total time of computation of KADABRA on increasing snapshots of theIMDB graph.

We have run our algorithm with λ = 0.0002 and δ = 0.1: as discussed in Section 5, thismeans that either two nodes are ranked correctly, or their centrality is known with precisionat most λ. As a consequence, if two nodes are not ranked correctly, the difference betweentheir real betweenness is at most 2λ. The full results are available in Appendix G.2.

All the graphs were processed in less than one hour, apart from the Wikipedia graph,which was processed in approximately 1 hour and 38 minutes. In Figure 4, we plot therunning times for the actor graphs: from the figure, it seems that the time needed byour algorithm scales slightly sublinearly with respect to the size of the graph. This resultrespects the results in Section 4, because the degrees in the actor collaboration networkare power law distributed with exponent β ≈ 2.13 (http://konect.uni-koblenz.de/networks/actor-collaboration). Finally, we observe that the ranking is quite precise:indeed, most of the times, there are very few nodes in the top-5 with the same ranking, andthe ranking rarely contains significantly more than 10 nodes.

Acknowledgements. The authors would like to thank Matteo Riondato for severalconstructive comments on an earlier version of this work. We also thank Elisabetta Bergamini,Richard Lipton, and Sebastiano Vigna for helpful discussions and Holger Dell for his helpwith the experiments.

12

http://konect.uni-koblenz.de/networks/actor-collaboration

http://konect.uni-koblenz.de/networks/actor-collaboration

References

[1] Amir Abboud, Fabrizio Grandoni, and Virginia Vassilevska Williams. Subcubic equi-valences between graph centrality problems, apsp and diameter. In Proceedings of theTwenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1681–1697.SIAM, 2015.

[2] Amir Abboud, Virginia V. Williams, and Joshua Wang. Approximation and fixedparameter subquadratic algorithms for radius and diameter. In Proceedings of the 26thACM-SIAM Symposium on Discrete Algorithms (SODA), pages 377–391, may 2016.URL: http://arxiv.org/abs/1506.0179, arXiv:1506.0179.

[3] Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive Sampling for k-Means Clustering. In Irit Dinur, Klaus Jansen, Joseph Naor, and Jos Rolim, editors,Approximation, Randomization, and Combinatorial Optimization. Algorithms andTechniques, number 5687 in Lecture Notes in Computer Science. Springer BerlinHeidelberg, 2009.

[4] Jac M Anthonisse. The rush in a directed graph. Stichting Mathematisch Centrum.Mathematische Besliskunde, BN(9/71):1–10, 1971.

[5] David A. Bader, Shiva Kintali, Kamesh Madduri, and Milena Mihail. Approximatingbetweenness centrality. The 5th Workshop on Algorithms and Models for the Web-Graph,2007.

[6] Alex Bavelas. A mathematical model for group structures. Human organization,7(3):16–30, 1948.

[7] Elisabetta Bergamini. private communication, 2016.

[8] Elisabetta Bergamini, Michele Borassi, Pierluigi Crescenzi, Andrea Marino, and HenningMeyerhenke. Computing top-k closeness centrality faster in unweighted graphs. InALENEX, 2016.

[9] Elisabetta Bergamini and Henning Meyerhenke. Fully-dynamic approximation ofbetweenness centrality. In ESA, 2015.

[10] Paolo Boldi, Andrea Marino, Massimo Santini, and Sebastiano Vigna. BUbiNG:Massive crawling for the masses. In Proceedings of the Companion Publication of the23rd International Conference on World Wide Web, pages 227–228, 2014.

[11] Bela Bollobas. A probabilistic proof of an asymptotic formula for the number oflabelled regular graphs. European Journal of Combinatorics, 1(4):311–316, 1980.doi:10.1016/S0195-6698(80)80030-8.

[12] Bela Bollobas, Svante Janson, and Oliver Riordan. The phase transition in inhomogen-eous random graphs. Random Structures and Algorithms, 31(1):3–122, 2007.

[13] Michele Borassi, Pierluig Crescenzi, and Michel Habib. Into the square - On thecomplexity of some quadratic-time solvable problems. In Proceedings of the 16th ItalianConference on Theoretical Computer Science (ICTCS), pages 1–17, 2015.

[14] Michele Borassi, Pierluigi Crescenzi, and Luca Trevisan. An Axiomatic and anAverage-Case Analysis of Algorithms and Heuristics for Metric Properties of Graphs.arXiv:1604.01445 [cs], April 2016. arXiv: 1604.01445.

[15] Stephen P. Borgatti and Martin G. Everett. A graph-theoretic perspective on centrality.Social Networks, 28:466–484, 2006.

13

http://arxiv.org/abs/1506.0179


http://dx.doi.org/10.1016/S0195-6698(80)80030-8

[16] Ulrik Brandes. A faster algorithm for betweenness centrality. The Journal of Mathem-atical Sociology, 25(2):163–177, jun 2001. doi:10.1080/0022250X.2001.9990249.

[17] Ulrik Brandes. On variants of shortest-path betweenness centrality and their genericcomputation. Social Networks, 30:136–145, 2008.

[18] Ulrik Brandes and Christian Pich. Centrality Estimation in Large Networks. Inter-national Journal of Bifurcation and Chaos, 17(07):2303–2318, 2007. doi:10.1142/

S0218127407018403.

[19] Bernard S Cohn and McKim Marriott. Networks and centres of integration in indiancivilization. Journal of social Research, 1(1):1–9, 1958.

[20] Shlomi Dolev, Yuval Elovici, and Rami Puzis. Routing betweenness centrality. J. ACM,57, 2010.

[21] David A. Easley and Jon M. Kleinberg. Networks, crowds, and markets - reasoningabout a highly connected world. In DAGLIB, 2010.

[22] David Eppstein and Joseph Wang. Fast approximation of centrality. J. Graph AlgorithmsAppl., 8:39–45, 2001.

[23] Dra Erds, Vatche Ishakian, Azer Bestavros, and Evimaria Terzi. A divide-and-conqueralgorithm for betweenness centrality. In Proceedings of the 2015 SIAM InternationalConference on Data Mining, pages 433–441, 2015.

[24] Daniel Fernholz and Vijaya Ramachandran. The diameter of sparse random graphs.Random Structures and Algorithms, 31(4):482–516, 2007. doi:10.1002/rsa.

[25] Robert Geisberger, Peter Sanders, Dominik Schultes, and Daniel Delling. Contractionhierarchies: Faster and simpler hierarchical routing in road networks. In Catherine C.McGeoch, editor, Experimental Algorithms: 7th International Workshop, WEA 2008,pages 319–333. Springer Berlin Heidelberg, 2008.

[26] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which Problems HaveStrongly Exponential Complexity? Journal of Computer and System Sciences, 63(4):512–530, dec 2001. doi:10.1006/jcss.2001.1774.

[27] Riko Jacob, Dirk Koschutzki, Katharina Anna Lehmann, Leon Peeters, and DagmarTenfelde-Podehl. Algorithms for centrality indices. In DAGSTUHL, 2004.

[28] Hermann Kaindl and Gerhard Kainz. Bidirectional heuristic search reconsidered. J.Artif. Intell. Res. (JAIR), 7:283–317, 1997.

[29] Yeon-sup Lim, Daniel S Menasche, Bruno Ribeiro, Don Towsley, and Prithwish Basu.Online estimating the k central nodes of a network. Proceedings of IEEE NSW, pages118–122, 2011.

[30] Richard J. Lipton and Jeffrey F. Naughton. Query Size Estimation by AdaptiveSampling. Journal of Computer and System Sciences, 51(1):18–25, August 1995.doi:10.1006/jcss.1995.1050.

[31] Richard J. Lipton and Naughton, Jeffrey F. Estimating the size of generalized transitiveclosures. In Proceedings of the 15th Int. Conf. on Very Large Data Bases, 1989.

[32] Linyuan Lu and Fan R. K. Chung. Complex graphs and networks. Number no. 107in CBMS regional conference series in mathematics. American Mathematical Society,2006.

[33] Mark Newman. Networks: an introduction. OUP Oxford, 2010.

14

http://dx.doi.org/10.1080/0022250X.2001.9990249

http://dx.doi.org/10.1142/S0218127407018403

http://dx.doi.org/10.1142/S0218127407018403

http://dx.doi.org/10.1002/rsa

http://dx.doi.org/10.1006/jcss.2001.1774

http://dx.doi.org/10.1006/jcss.1995.1050

[34] Mark EJ Newman. Scientific collaboration networks. ii. shortest paths, weightednetworks, and centrality. Physical review E, 64(1):016132, 2001.

[35] Ilkka Norros and Hannu Reittu. On a conditionally Poissonian graph process. Advancesin Applied Probability, 38(1):59–75, 2006.

[36] Jurgen Pfeffer and Kathleen M Carley. k-centralities: local approximations of globalmeasures based on shortest paths. In Proceedings of the 21st international conferencecompanion on World Wide Web, pages 1043–1050. ACM, 2012.

[37] Andrea Pietracaprina, Matteo Riondato, Eli Upfal, and Fabio Vandin. Mining Top-K Frequent Itemsets Through Progressive Sampling. Data Mining and KnowledgeDiscovery, 21(2):310–326, September 2010. doi:10.1007/s10618-010-0185-7.

[38] Ira Pohl. Bi-directional and heuristic search in path problems. PhD thesis, Dept. ofComputer Science, Stanford University., 1969.

[39] Matteo Riondato and Evgenios M Kornaropoulos. Fast approximation of betweennesscentrality through sampling. Data Mining and Knowledge Discovery, 30(2):438–475,2015.

[40] Matteo Riondato and Eli Upfal. ABRA: Approximating Betweenness Centrality inStatic and Dynamic Graphs with Rademacher Averages. arXiv preprint 1602.05866,pages 1–27, 2016. arXiv:1602.05866.

[41] Ahmet Erdem Sariyuce, Erik Saule, Kamer Kaya, and Umit V Catalyurek. Shatteringand compressing networks for betweenness centrality. In SIAM Data Mining Conference(SDM). SIAM, 2013.

[42] Marvin E Shaw. Group structure and the behavior of individuals in small groups. TheJournal of psychology, 38(1):139–149, 1954.

[43] Alfonso Shimbel. Structural parameters of communication networks. The bulletin ofmathematical biophysics, 15(4):501–507, 1953.

[44] Christian L. Staudt, Aleksejs Sazonovs, and Henning Meyerhenke. Networkit: aninteractive tool suite for high-performance network analysis. arXiv preprint 1403.3005,pages 1–25, 2014.

[45] Remco van der Hofstad. Random graphs and complex networks. Vol. II. Manuscript,2014.

[46] Flavio Vella, Giancarlo Carbone, and Massimo Bernaschi. Algorithms and heurist-ics for scalable betweenness centrality computation on multi-gpu systems. CoRR,abs/1602.00963, 2016.

[47] Sebastiano Vigna. private communication, 2016.

[48] Stanley Wasserman and Katherine Faust. Social network analysis: Methods andapplications, volume 8. Cambridge university press, 1994.

[49] Ryan Williams and Huacheng Yu. Finding orthogonal vectors in discrete struc-tures. In Proceedings of the 24th ACM-SIAM Symposium on Discrete Algorithms(SODA), pages 1867–1877, 2014. URL: http://epubs.siam.org/doi/abs/10.1137/1.9781611973402.135, doi:10.1137/1.9781611973402.135.

15

http://dx.doi.org/10.1007/s10618-010-0185-7


http://epubs.siam.org/doi/abs/10.1137/1.9781611973402.135

http://epubs.siam.org/doi/abs/10.1137/1.9781611973402.135

http://dx.doi.org/10.1137/1.9781611973402.135

A Pseudocode

Algorithm 2: the function computeDelta.

Input : a graph G = (V,E), and two values λ(v)L , λ

(v)U for each v ∈ V

Output : for each v ∈ V , two values δ(v)L , δ

(v)U

1 α← ω100

;

2 ε← 0.0001;3 foreach i ∈ [1, α] do4 π = samplePath();

5 foreach v ∈ π do b(v)← b(v) + 1

6 end7 foreach v ∈ V do

8 b(v)← b(v)/α;

9 cL(v)← 2b(v)ω

(λ(v)L

)2;

10 cU (v)← 2b(v)ω

(λ(v)U

)2;

11 end

12 Binary search to find C such that∑v∈V exp

(− CcL(v)

)+ exp

(− CcU (v)

)= δ

2− εδ;

13 foreach v ∈ V do

14 δ(v)L ← exp

(− CcL(v)

)+ εδ

2n;

15 δ(v)U ← exp

(− CcU (v)

)+ εδ

2n;

16 end17 return b;

Algorithm 3: the function haveToStop to compute the top-k nodes.

Input : for each node v, the values of b(v), δ(v)L , δ

(v)U , and the values of ω and τ

Output : True if the algorithm should stop, False otherwise1 Sort nodes in decreasing order of b(v), obtaining v1, . . . , vn;2 for i ∈ [1, . . . , k] do

3 if f(b(vi), δ(vi)L , ω, τ) > λ or g(b(vi), δ

(vi)U , ω, τ) > λ then

4 if b(vi−1)− f(b(vi−1), δ(vi−1)

L , ω, τ) < b(vi) + g(b(vi), δ(vi)U , ω, τ) or

b(vi)− f(b(vi), δ(vi)L , ω, τ) < b(vi+1) + g(b(vi+1), δ

(vi+1)

U , ω, τ) then5 return False;6 end

7 end

8 end9 for i ∈ [k + 1, . . . , n] do

10 if f(b(vi), δ(vi)L , ω, τ) > λ or g(b(vi), δ

(vi)U , ω, τ) > λ then

11 if b(vk)− f(b(vk), δ(vk)L , ω, τ) < b(vi) + g(b(vi), δ

(vi)U , ω, τ) then

12 return False;13 end

14 end

15 end16 return True;

B Concentration Inequalities

Lemma 5 (Hoeffding’s inequality). Let X1, . . . ,Xk be independent random variables such

that ai <Xi < bi, and let X =∑ki=1Xi. Then,

Pr (|X − E[X]| ≥ λ) ≤ exp

− 2λ2∑k

i=1(bi − ai)2

.

Remark 6. If we apply Hoeffding’s inequality with Xi = Xπv , X = kb(v) =∑ki=1X

πv ,

ai = 0, bi = 1, we obtain that Pr (|b(v)− bc(v)| > λ) < 2e−2kλ2

. Then, if we fix δ = 2e−2kλ2

,

16

the error is λ =√

log(2/δ)2k , and the minimum k needed to obtain an error λ on the betweenness

of a single node is 12λ2 log(2/δ).

Lemma 7 (Chernoff bound ([32])). Let X1, . . . ,Xk be independent random variables such

that Xi ≤M for each 1 ≤ i ≤ n, and let X =∑ki=1Xi. Then,

Pr (X ≥ E[X] + λ) ≤ exp

− λ2

2(∑ni=1 E[X2

i ] +Mλ/3)

.

Theorem 8 (McDiarmid ’98 ([32])). Let X be a martingale associated with a filter F ,satisfying

• Var (Xi| Fi) ≤ σi for 1 ≤ i ≤ `,

• |Xi −Xi−1| ≤M , for 1 ≤ i ≤ `.

Then, we have

Pr (X − E (X) ≥ λ) ≤ exp

− λ2

2(∑`

i=1 σ2i +Mλ/3

) .

C Proof of Theorem 2

In our algorithm, we sample τ shortest paths πi, where τ is a random variable such thatτ = τ can be decided by looking at the first τ paths sampled (see Algorithm 1). Furthermore,thanks to Eq. (3) in [39], we assume that τ ≤ ω for some fixed ω ∈ R+ such that, after ωsteps, Pr(∀v, |b(v)− bc(v)| ≤ λ) ≥ 1− δ

2 . When the algorithm stops, our estimate of the

betweenness is b(v) := 1τ

∑τi=1Xi(v), where Xi(v) is 1 if v belongs to πi, 0 otherwise.

To estimate the error, we use the following theorem.

Theorem 9. For each node v and for every fixed real numbers δL, δU , it holds

Pr(

bc(v) ≤ b(v)− f(b(v), δL, ω, τ

))≤ δL and

Pr(

bc(v) ≥ b(v) + g(b(v), δU , ω, τ

))≤ δU ,

where

f(b(v), δL, ω, τ

)=

1

τlog

1

δL

1

3− ω

τ+

√√√√(1

3− ω

τ

)2

+2b(v)ω

log 1δL

and (5)

g(b(v), δU , ω, τ

)=

1

τlog

1

δU

1

3+ω

τ+

√√√√(1

3+ω

τ

)2

+2b(v)ω

log 1δU

. (6)

We prove Theorem 9 in Section C.1. In the rest of this section, we show how this theoremimplies Theorem 2. To simplify notation, we often omit the arguments of the function fand g.

Proof of Theorem 2. Let E1 be the event (τ = ω ∧ ∃v ∈ V, |b(v)− bc(v)| > λ), and let E2

be the event (τ < ω ∧ (∃v ∈ V,−f ≥ bc(v)− b(v) ∨ bc(v)− b(v) ≥ g)). Let us also denotebτ (v) = 1

τ

∑τi=1Xi(v) (note that bτ (v) = b(v)).

By our choice of ω and Eq. (3) in [39],

Pr(E1) ≤ Pr(∃v ∈ V, |bω(v)− bc(v)| > λ) ≤ δ

2

17

where bω(v) is the approximate betweenness of v after ω samples. Furthermore, by The-orem 9,

Pr(E2) ≤∑v∈V

Pr(τ < ω ∧ −f ≥ bc(v)− b(v)) + Pr(τ < ω ∧ bc(v)− b(v) ≤ g)

≤∑v∈V

δ(v)L + δ

(v)U ≤ δ

2.

By a union bound, Pr(E1∨E2) ≤ Pr(E1) + Pr(E1) ≤ δ, concluding the proof of Theorem 2.

C.1 Proof of Theorem 9

Since this theorem deals with a single node v, let us simply write bc = bc(v), b = b(v),Xi =Xi(v). Let us consider Y τ =

∑τi=1 (Xi − bc) (we recall that Xi = 1 if v is in the i-th path

sampled, Xi = 0 otherwise). Clearly, Y τ is a martingale, and τ is a stopping time for Y τ :

this means that also Zτ = Y min(τ ,τ) is a martingale.Let us apply Theorem 8 to the martingales Z and −Z: for each fixed λL, λU > 0 we

have

Pr (Zω ≥ λL) = Pr(τ b− τ bc ≥ λL

)≤ exp

(− λ2

L

2 (ω bc +λL/3)

)= δL and (7)

Pr (−Zω ≥ λU ) = Pr(τ b− τ bc ≤ −λU

)≤ exp

(− λ2

U

2 (ω bc +λU/3)

)= δU . (8)

We now show how to prove (5) from (7). The way to derive (6) from (8) is analogous.If we express λL as a function of δL we get

λ2L = 2 log

1

δL

(ω bc +

λL3

)⇐⇒ λ2

L −2

3λL log

1

δL− 2ω bc log

1

δL= 0,

which implies that

λL =1

3log

1

δL±

√1

9

(log

1

δL

)2

+ 2ω bc log1

δL.

Since (7) holds for any positive value λL, it also holds for the value corresponding to thepositive solution of this equation, that is,

λL =1

3log

1

δL+

√1

9

(log

1

δL

)2

+ 2ω bc log1

δL.

Plugging this value into (7), we obtain

Pr

τ b− τ bc ≥ 1

3log

1

δL+

√1

9

(log

1

δL

)2

+ 2ω bc log1

δL

≤ δL. (9)

By assuming b− bc ≥ 13τ log( 1

δL), the event in (9) can be rewritten as

(τ bc)2 − 2 bc

(τ 2b+ ω log

1

δL− 1

3τ log

1

δL

)− 2

3log

1

δLτ b+

(τ b)2

≥ 0.

By solving the previous quadratic equation w.r.t. bc we get

bc ≤ b+ log1

δL

ω

τ 2− 1

3τ−

√√√√( b

log 1δL

+ω

τ 2− 1

3τ

)2

−

(b

log 1δL

)2

+2

3τ

b

log 1δL

,

18

where we only considered the solution which upper bounds bc, since we assumed b− bc ≥13τ log( 1

δL). After simplifying the terms under the square root in the previous expression, we

get

bc ≤ b+ log1

δL

ω

τ 2− 1

3τ−

√√√√( ω

τ 2− 1

3τ

)2

+2bω

τ 2 log 1δL

,

which means thatPr(

bc ≤ b− f(b, δL, ω, τ

))≤ δL,

concluding the proof.

D How to Choose δ(v)L , δ

(v)U

In Appendix C, we proved that our algorithm works for any choice of the values δ(v)L , δ

(v)U .

In this section, we show how we can heuristically compute such values, in order to obtainthe best performances.

For each node v, let λ(v)L , λ

(v)U be the lower and the upper maximum error that we want

to obtain on the betweenness of v: if we simply want all errors to be smaller than λ, we

choose λ(v)L , λ

(v)U = λ, but for other purposes different values might be needed. We want to

minimize the time τ such that the approximation of the betweenness at time τ is in theconfidence interval required. In formula, we want to minimize

minτ ∈ N : ∀v ∈ V,

(f(bτ (v), δ

(v)L , ω, τ

)≤ λ(v)

L ∧ g(bτ (v), δ

(v)U , ω, τ

)≤ λ(v)

U

)(10)

where bτ (v) is the approximation of bc(v) obtained at time τ , and

f(τ, bτ , δL, ω

)=

1

τlog

1

δL

1

3− ω

τ+

√√√√(1

3− ω

τ

)2

+2bτω

log 1δL

and

g(τ, bτ , δU , ω

)=

1

τlog

1

δU

1

3+ω

τ+

√√√√(1

3+ω

τ

)2

+2bτω

log 1δU

.

The goal of this section is to provide deterministic values of δ(v)L , δ

(v)U that minimize

the value in (10), and such that∑v∈V δ

(v)L + δ

(v)U < δ

2 . To obtain our estimate, we replace

bτ (v) with an approximation b(v), that we compute by sampling α paths, before startingthe algorithm (in our code, α = ω

100 ). Furthermore, we consider a simplified versionof (10): in most cases, λL is much smaller than all other quantities in play, and since

ω is proportional to 1λ2L

, we can safely assume f(τ, b(v), δ(v)L , ω) ≈

√2b(v)ωτ2 log 1

δLand

g(τ, b(v), δ(v)U , ω) ≈

√2b(v)ωτ2 log 1

δU. Hence, in place of the value in (10), our heuristic tries

to minimize

min

τ ∈ N : ∀v ∈ V,

√√√√2b(v)ω

τ2log

1

δ(v)L

≤ λ(v)L ∧

√√√√2b(v)ω

τ2log

1

δ(v)U

≤ λ(v)U

.

Solving with respect to τ , we are trying to minimize

maxv∈V

max

√√√√√ 2b(v)ω(

λ(v)L

)2 log1

δ(v)L

,

√√√√√ 2b(v)ω(λ

(v)U

)2 log1

δ(v)U

.

19

which is the same as minimizing maxv∈V max

(cL(v) log 1

δ(v)L

, cU (v) log 1

δ(v)U

)for some con-

stants cL(v), cU (v), conditioned on∑v∈V δ

(v)L +δ

(v)U < δ

2 . We claim that, among the possible

choices of δ(v)L , δ

(v)U , the best choice makes all the terms in the maximum equal: otherwise, if

two terms were different, we would be able to slightly increase and decrease the correspondingvalues, in order to decrease the maximum. This means that, for some constant C, for each

v, cL(v) log 1

δ(v)L

= cU (v) log 1

δ(v)L

= C, that is, δ(v)L = exp(− C

cL(v) ), δ(v)U = exp(− C

cU (v) ). In

order to find the largest constant C such that∑v∈V δ

(v)L + δ

(v)U ≤ δ

2 , we use a binary searchprocedure on all possible constants C.

Finally, if cL(v) = 0 or cU (v) = 0, this procedure chooses δ(v)L = 0: to avoid this problem,

we impose∑v∈V δ

(v)L + δ

(v)U ≤ δ

2 − εδ, and we add εδ2n to all the δ

(v)L s and all the δ

(v)U s (in our

code, we choose ε = 0.001). The pseudocode of the algorithm is available in Algorithm 2.

E Balanced Bidirectional BFS on Random Graphs

In this appendix, we formally prove that the bidirectional BFS is efficient in several modelsof random graphs: the Configuration Model (CM, [11]), and Rank-1 Inhomogeneous RandomGraph models (IRG, [45, Chapter 3]), such as the Chung-Lu model [32], the Norros-Reittumodel [35], and the Generalized Random Graph [45, Chapter 3]. All these models aredefined by fixing the number n of nodes and n weights ρv, and by creating edges at random,in a way that node v gets degree close to ρv.

More formally, the edges are generated as follows:

• in the CM, each node is associated to ρv half-edges, or stubs; edges are created byrandomly pairing these M =

∑v∈V ρv stubs (we assume the number of stubs to be

even, by adding a stub to a random node if necessary).

• in IRG, an edge between a node v and a node w exists with probability f(ρvρwM ), whereM =

∑v∈V ρv, and the existence of different edges is independent. Different choices

of the function f create different models.

– In general, we assume that f satisfies the following conditions:

∗ f is derivable at least twice in 0;

∗ f is increasing;

∗ f ′(0) = 1;

– in the Chung-Lu model, f(x) = min(x, 1);

– in the Norros-Reittu model, f(x) = 1− e−x;

– in the Generalized Random Graph model, f(x) = x1+x .

It remains to define how we choose the weights ρv, when the number of nodes n tendsto infinity. In the line of previous works [35, 24, 45], we consider a sequence of graphs Gi,whose number of nodes ni tends to infinity, and whose degree distribution λi satisfy thefollowing:

1. there is a probability distribution λ such that the λis tend to λ in distribution;

2. M1(λi) tends to M1(λ) <∞, where M1(λ) is the first moment of λ;

3. one of the following two conditions hold:

(a) M2(λi) tends to M2(λ) <∞, where M2(λ) is the second moment of λ;

(b) λ is a power law distribution with 2 < β < 3, and there is a global constant Csuch that, for each d, Pr(λi ≥ d) ≤ C

dβ−1 .

20

For example, these assumptions are satisfied with probability 1 if we choose the degreesindependently, according to a distribution λ with finite mean [45, Section 6.1,7.2].

Remark 10. Note that an aspect often neglected in previous work when it comes to computingshortest paths is the fact that the number of shortest paths between a pair of nodes may beexponential, thus requiring to work with a linear number of bits. While real-world complexnetworks are typically sparse with logarithmic diameter, in order to avoid such issue it issufficient to assume that addition and comparison require constant time.

Remark 11. These assumptions cover the Erdos-Renyi random graph with constant averagedegree, and all power law distributions with β > 2 (because, if β > 3, then M2(λ) is finite).

Remark 12. Assumption 3b seems less natural than the other assumptions. However, it isnecessary to exclude pathological cases: for example, assume that Gi has n− 2 nodes chosenaccording to a power law distribution, and 2 nodes u, v with weight n1−ε. All assumptionexcept 3b are satisfied, but the bidirectional BFS is not efficient, because if s is a neighbor ofu with degree 1, and t is a neighbor of v with degree 1, then a bidirectional BFS from s andt needs to visit all neighbors of u or all neighbors of v, and the time needed is Ω(n1−ε).

We say that a random graph has a property π asymptotically almost surely (a.a.s.) ifPr(π(Gi)) tends to 1 when n tends to infinity. We say that a random graph has a property

π with high probability (w.h.p.) if Pr(π(Gi))

nkitends to 0 for each k > 0.

Before proving the main theorem, we need two more definitions and a technical assump-tion.

Definition 13. In the CM, let ρres = M2(λ)M1(λ) − 1. In IRG, let ρres = M2(λ)

M1(λ) (if λ is a power

law distribution with 2 < β < 3, we simply define ρres = +∞).

Definition 14. Given a set V ′ ⊆ V , the volume of V ′ is ρV ′ =∑v∈V ′ ρv. Furthermore, if

V ′ = Γd(s), we abbreviate ρΓd(s) with rl(s).

The value ρres is closely related to rl+1(s)rl(s)

: informally, the expected value of this fraction

is ρres. For this reason, if ρres < 1, then the size of neighbors tends to decrease, andall connected components have O(log n) nodes. Conversely, if ρres > 1, then the size ofneighbors tends to increase, and there is a giant component of size Θ(n) (for a proof of thesefacts, see [45, Section 2.3 and Chapter 4]). Our last assumption is that ρres > 1, in order toensure the existence of the giant component.

Under these assumptions, we prove Theorem 4, following the sketch in Section 4. Westart by linking the degrees and the weights of nodes.

Lemma 15. For each node v, ρvn−ε ≤ deg(v) ≤ ρvnε w.h.p..

Proof. We use [14, Lemmas 32 and 37]2: these lemmas imply that, for each ε > 0, if ρv > nε,(1− ε)ρv ≤ deg(v) ≤ (1 + ε)ρv w.h.p.. We have to handle the case where ρv < nε: one ofthe two inequalities is empty, while for the other inequality we observe that, if we decreasethe weight of v, the degree of v can only decrease. Hence, if ρv < nε, deg(v) < (1 + ε)nε,and the result follows by changing the value of ε.

Following the intuitive proof, we have linked the number of visited edges with theirweights. Let us define an abbreviation for the volume of the nodes at distance l from s.

Definition 16. We denote by rl(s) the volume of nodes at distance exactly l from s. Inthe CM, we denote by Rl(s) the set of stubs at distance l from s.

Now, we need to show that, if rls(s), rlt(t) > n12 +ε, then d(s, t) ≤ ls + lt + 2 w.h.p..

2This paper uses a further assumption on IRG, but the proofs of Lemmas 32 and 39 do not rely on thisassumption.

21

Lemma 17. Assume that rls(s) > n12 +ε, rlt(t) > n

12 +ε, and rls−1(s), rlt−1(t) < (1−ε)n 1

2 +ε.Then, d(s, t) ≤ ls + lt + 2.

Proof. Let us assume that we know the structure of N ls(s) and N lt(t), that is, for eachpossible structure S of the subgraph induced by all nodes at distance ls from s anddistance lt from t, let ES be the event that N ls(s) and N lt(t) are exactly S. If weprove that Pr(d(s, t) ≤ l + l′ + 2|ES) < ε, then Pr(rl+1(s) > rl(s)) =

∑S Pr(rl+1(s) >

rl(s)|ES) Pr(ES) <∑S εPr(ES) = ε. First of all, if S is such that the two neighborhoods

touch each other, Pr(d(s, t) ≤ l + l′ + 2|ES) = 0 < ε. Otherwise, we consider separately theCM and IRG.

In the CM, conditioned on ES , the stubs that are paired with stubs in Rls(s) are arandom subset of the set of stubs that are not paired in S. This random subset has size at

least εn12 +ε ≥ n 1+ε

2 (because ε is a fixed constant, and n tends to infinity). Since the total

number of stubs is O(n), and since the number of stubs in Rlt(t) is at least εn1+ε2 , one of

the stubs in Rlt(t) is paired with a stub in rls(s) w.h.p., and d(s, t) ≤ ls + lt + 1.In IRG, the probability that a node v is not connected to any node in Γls(s) is

at most∏w∈Γls (s)(1 − f(ρvρwM )) =

∏w∈Γls (s)(1 − Ω(ρwM )) = exp(−

∑w∈Γls (s) Ω(ρwM )) =

exp(−Ω(rls (s)M )) = 1 − Ω(r

ls (s)M ) = 1 − Ω(n−

12 +ε). This means that v belongs to Γls+1(s)

with probability Ω(n−12 +ε), and similarly it belongs to Γlt+1(t) with probability Ω(n−

12 +ε).

Since the two events are independent, the probability that v belongs to both is Ω(n−1+2ε).Since, for each node v, the events that v belongs to Γls+1(s) ∩ Γlt+1(t) are independent,by a straightforward application of Hoeffding’s inequality, w.h.p., there is a node v thatbelongs to Γls+1(s) ∩ Γlt+1(t), and d(s, t) ≤ ls + lt + 2 w.h.p., concluding the proof.

The next ingredient is used to bound the first integers ls, lt such that rls(s), rlt(t) > n12 +ε.

Theorem 18 (Theorem 5.1 in [24] for the CM, Theorem 14.8 in [12] for IRG (see also[45, 14])). The diameter of a graph generated through the aforementioned models is O(log n).

The last ingredient of our proof is an upper bound on the size of rls(s) and rlt(t).

Lemma 19. With high probability, for each s ∈ V and for each l such that∑li=0 r

l(s) <

n12 +ε, rl+1(s) < n

12 +3ε if λ has finite second moment, rl+1(s) < n

4−β2 +3ε if λ is power law

with 2 < β < 3.

Proof. We consider separately nodes with weight at most n12−2ε from nodes with bigger

weights: in the former case, we bound the number of such nodes that are in Rl+1(s), while

in the latter case we bound the total number of nodes with weight at least n12−2ε. Let us

start with nodes with the latter case.Claim: for each ε,

∑ρv≥n

12−ε ρv is smaller than n

12 +3ε if λ has finite second moment,

and it is smaller than n4−β2 +3ε if λ is power law with 2 < β < 3.

Proof of claim. If λ has finite second moment, by Chebyshev inequality, for each α,

Pr(λi > n

12 +α

)≤ Var(λi)

n1+2α≤ M2(λi)

n1+2α= O

(M2(λ)

n1+2α

)= O

(n−1−2α

).

For α = ε, this means that no node has weight bigger than n12 +ε, and for α = −ε, this

means that the number of nodes with weight bigger than n12−ε is at most n2ε. We conclude

that∑ρv≥n

12−ε ρv ≤

∑ρv≥n

12−ε n

12 +ε ≤ n 1

2 +3ε.

22

If λ is power law with 2 < β < 3, by Assumption 3b the number of nodes with weight atleast d is at most Cnd−β+1. Consequently, using Abel’s summation technique,

∑ρv≥n

12−ε

ρv =

+∞∑d=ρv

d|v : ρv = d|

=

+∞∑d=n

12−ε

d(|v : ρv ≥ d| − |v : ρv ≥ d+ 1|)

=

+∞∑d=n

12−ε

d|v : ρv ≥ d| −+∞∑

d=n12−ε+1

(d− 1)|v : ρv ≥ d|

= n12−ε|v : ρv ≥ n

12−ε|+

+∞∑d=n

12−ε+1

|v : ρv ≥ d|

≤ Cn 12−εn1−( 1

2−ε)(β−1) +

+∞∑d=n

12−ε+1

Cnd−β+1

= O(n

4−β2 +εβ + n1−( 1

2−ε)(β−2))

= O(n

4−β2 +εβ

).

By this claim,∑v∈Γl+1(s),ρv≥n

12−2ε ρv is smaller than n

12 +6ε if λ has finite second moment,

and it is smaller than n4−β2 +6ε if λ is power law with 2 < β < 3. To conclude the proof, we

only have to bound∑v∈Γl+1(s),ρv<n

12−2ε ρv.

Claim: with high probability,∑v∈Γl+1(s),ρv<n

12−2ε ρv < n

12 +ε if λ has finite second

moment,∑v∈Γl+1(s),ρv<n

12−2ε ρv < n

4−β2 +ε if λ is power law with 2 < β < 3.

Proof of claim, CM. As in the proof of Lemma 17, we can safely assume that we knowthe structure S of N l(s). Let us sort the stubs in Rl(s), not paired by S, obtaininga1, . . . , ak, and let ai be the stub paired with ai. Let res(a) be the number of stubs

of the node a, minus a, and let Xi = res(ai) if res(ai) < n12−2ε, 0 otherwise: clearly,∑

v∈Γl+1(s),ρv≤n12−2ε ρv ≤

∑ki=1Xi (with equality if there are no horizontal or diagonal

edges in the BFS tree). After the first i− 1 stubs are paired, since i < n12 +ε and since the

number of stubs paired in S is O(n

12 +ε log n

), for each k < n

12−2ε,

Pr (Xi = k) = Pr (res (ai) = k)

=|a ∈ A : a unpaired after i rounds, res(a) = k|

|a ∈ A : a unpaired after i rounds|

=|a ∈ A : res(a) = k|+O

(n

12 +ε)

|A|+O(n

12 +ε)

=(k + 1)λ(k + 1)

M1(λ)+O

(n−

12 +ε).

Consequently, conditioned on all pairings of aj for j < i, E [Xi] =∑n

12−2ε

k=0 k (k+1)λ(k+1)M1(λ) +

O(n−12 +ε log n) = α(n), where α(n) = O(1) if λ has finite second moment, and α(n) =

23

O(n3−β2 ) if λ is power law with 2 < β < 3. Hence, for each ε,

∑ki=1Xi − i(M1(λ) + ε) is a

supermartingale, and by Azuma’s inequality

Pr

(k∑i=1

Xi − kα(n) ≥ α(n)

)≤ exp

(− α(n)2

2∑ki=1 n

12−2ε

)≤ exp(−nε).

Then, w.h.p.,∑ki=1Xi ≤ n

12 +ε(α(n) + 2), concluding the proof of the claim.

Proof of claim, IRG. The number of nodes w with weight at most n12−2ε that belong to

Γl+1(s) is at most∑v∈Γl(s),ρv<n

12−2ε

∑w∈V ρwXv,w, where Xv,w = 1 with probability

f(ρvρwM

)= O

(ρvρwM

)because ρvρw < n1−ε. Moreover,

E

∑v∈Γl(s),ρv<n

12−2ε

∑w∈V

ρwXv,w

= O(rl(s)

∑v∈V ρ

2v

n

)= rl(s)α(n)

where α(n) = O(1) if λ has finite second moment, and α(n) = O(n

3−β2

)if λ is power

law with 2 < β < 3.By Hoeffding inequality,

Pr

∑v∈Γl(s),ρv<n

12−2ε

∑w∈V

ρwXv,w − rl(s)α(n) ≥ rl(s)α(n)

≤ n rl(s)α(n)

rl(s)n12−2ε ≤ n−ε.

This concludes the proof.

This claim lets us conclude the proof of the lemma.

Proof of Theorem 4. Let Dis =

∑v∈Γi(s) deg(v), Dj

t =∑w∈Γj(t) deg(w), and let us suppose

that we have visited until level ls from s, until level lt from t, and that Dlss , D

ltt > n

12 +2ε. If

this situation never occurs, by Theorem 18, the total number of visited edges is at mostO(log n)n

12 +2ε = O(n

12 +3ε), and the conclusion follows. Otherwise, again by Theorem 18,

the number of edges visited in the two BFS trees before levels ls and lt is O(n12 +3ε).

Furthermore, by Lemma 15, rls(s), rlt(t) > n12 +2ε. We claim that, without loss of generality,

we can assume rls−1(s) < εrls(s), to apply Lemma 17. Indeed, if rls−1(s) is too big, weiteratively decrease ls until we find a neighbor verifying rls(s) > (1 − ε′)rls−1(s). Thisprocess can last at most O(log n) steps, and hence it is stopped at a point ls such that

rls(s) > n12 +2ε(1 − ε′)O(logn) ≥ n

12 +ε′ if ε′ is small enough. Similarly, we can suppose

without loss of generality that rlt(t) > (1− ε′)rlt−1(t). By Lemma 17, d(s, t) ≤ ls + lt + 2,and the number of nodes needed to conclude the BFS is at most Dls

s +Dltt (note that, if we

extend twice the visit from s, it means that Dls+1s < Dlt

t ). By Lemma 15, Dlss ≤ nεrls(s),

and by Lemma 19 this value is at most n12 +3ε if λ has finite second moment, and n

4−β2 +3ε if

λ is power law with 2 < β < 3. We conclude that the total number of visited nodes is at

most n12 +3ε +Dls

s +Dltt ≤ n

12 +3ε + rls(s) + rlt(t) ≤ n

12 +4ε (resp., n

4−β2 +4ε) if λ has finite

second moment (resp., if λ is power law with 2 < β < 3). The theorem follows by changingthe value of ε.

24

F Detailed Experimental Results

Table 1: Detailed experimental results (undirected graphs). Empty values correspond tographs on which the algorithm needed more than 1 hour.

Number of iterations Time (s) EdgesGraph KADABRA RK ABRA-Aut ABRA-1.2 KADABRA RK ABRA-Aut ABRA-1.2 KADABRAλ = 0.005advogato 64427 126052 174728 185998 0.193 11.450 9.557 10.498 261.2as20000102 115797 126052 18329844 4126626 0.231 6.990 611.584 136.764 377.6ca-GrQc 61611 146052 142982 129165 0.126 5.574 3.500 2.839 353.4ca-HepTh 31735 146052 121587 129165 0.222 14.921 7.389 8.168 9.9C elegans 69729 146052 204634 185998 0.132 6.876 5.693 5.261 270.7com-amazon.all 40711 166052 69708 74747 0.340 122.020 12.011 11.849 21.9dip20090126 MAX 156552 166052 1.374 34.595 15354.9D melanogaster 51227 126052 144680 154998 0.123 19.253 15.061 16.882 520.8email-Enron 74745 146052 257989 267838 0.280 79.296 101.529 106.278 1408.0HC-BIOGRID 78804 146052 245780 223198 0.177 7.751 7.534 6.951 713.2Homo sapiens 60060 146052 156973 154998 0.151 32.078 23.716 24.449 643.8hprd pp 59125 146052 151499 154998 0.127 18.323 13.425 13.458 456.4Mus musculus 92081 146052 504669 385688 0.168 4.058 7.723 6.083 226.6oregon1 010526 114829 126052 6798931 2865712 0.228 13.281 442.370 185.711 681.6oregon2 010526 115764 126052 5714183 2865712 0.236 15.823 452.554 229.234 822.2λ = 0.010advogato 19811 31513 47076 48243 0.081 2.804 2.576 2.788 258.2as20000102 29062 31513 2688614 1070372 0.071 1.777 88.886 35.049 377.3ca-GrQc 18535 36513 37529 33501 0.049 1.417 0.987 0.753 350.6ca-HepTh 13761 36513 31721 33501 0.188 3.771 2.078 2.275 10.0C elegans 19888 36513 54327 48243 0.048 1.803 1.586 1.483 269.4com-amazon.all 14641 41513 18007 19386 0.312 31.004 5.196 7.623 21.5dip20090126 MAX 39314 41513 0.395 8.578 15301.7D melanogaster 15136 31513 37219 40202 0.063 4.983 3.891 4.715 519.9email-Enron 21637 36513 65392 69471 0.198 19.877 24.997 27.296 1387.2HC-BIOGRID 22924 36513 62413 57892 0.052 1.979 1.989 1.906 712.5Homo sapiens 20273 36513 41006 40202 0.085 7.876 6.442 6.636 642.7hprd pp 18403 36513 39994 40202 0.074 4.348 4.097 3.714 456.4Mus musculus 25146 36513 130384 100040 0.061 1.055 1.965 1.718 223.9oregon1 010526 30514 31513 1104167 743313 0.087 3.254 70.383 47.740 683.3oregon2 010526 29117 31513 954515 743313 0.088 3.983 73.942 59.103 822.1λ = 0.015advogato 9570 14006 21027 22204 0.050 1.428 1.227 1.299 261.0as20000102 13035 14006 705483 492651 0.047 0.776 22.939 16.136 377.6ca-GrQc 8668 16228 17419 15419 0.031 0.637 0.493 0.361 345.8ca-HepTh 7524 16228 15002 15419 0.167 1.641 0.939 1.050 11.5C elegans 10956 16228 25233 22204 0.034 0.782 0.740 0.732 267.6com-amazon.all 8228 18451 15419 0.301 13.814 7.785 21.9dip20090126 MAX 17578 18451 0.203 3.851 15197.2D melanogaster 9350 14006 17229 18503 0.053 2.216 1.904 2.182 519.3email-Enron 11209 16228 29134 31974 0.170 8.845 10.510 12.423 1367.4HC-BIOGRID 12694 16228 28805 26645 0.043 0.858 0.946 0.947 708.6Homo sapiens 10142 16228 18491 18503 0.072 3.717 3.076 3.061 640.4hprd pp 10659 16228 17969 18503 0.056 1.919 1.719 1.752 451.5Mus musculus 11825 16228 59756 46043 0.033 0.458 0.906 0.812 222.8oregon1 010526 13662 14006 426845 342118 0.056 1.522 26.420 21.871 681.4oregon2 010526 13024 14006 333638 342118 0.060 1.773 26.070 27.298 833.6

25

Number of iterations Time (s) EdgesGraph KADABRA RK ABRA-Aut ABRA-1.2 KADABRA RK ABRA-Aut ABRA-1.2 KADABRAλ = 0.020advogato 5874 7879 11993 12915 0.054 0.710 0.665 0.765 260.3as20000102 7436 7879 312581 238814 0.037 0.441 10.066 7.819 376.2ca-GrQc 5313 9129 9939 10762 0.032 0.356 0.293 0.268 347.9ca-HepTh 5115 9129 8708 8968 0.191 0.891 0.694 0.611 10.5C elegans 7172 9129 14871 12915 0.030 0.439 0.436 0.439 263.5com-amazon.all 5467 10379 12232 10762 0.331 7.683 4.338 5.459 17.9dip20090126 MAX 9966 10379 0.148 2.165 15188.3D melanogaster 5610 7879 10201 10762 0.056 1.236 1.265 1.306 520.9email-Enron 7458 9129 16443 15498 0.174 4.916 6.102 6.034 1371.7HC-BIOGRID 8459 9129 17406 15498 0.026 0.505 0.602 0.582 716.6Homo sapiens 6292 9129 10481 10762 0.064 1.944 1.672 1.814 644.8hprd pp 6611 9129 10501 10762 0.050 1.089 0.930 1.050 449.8Mus musculus 7227 9129 31634 26782 0.026 0.255 0.507 0.532 221.0oregon1 010526 7733 7879 220948 199011 0.051 0.863 13.584 12.989 679.2oregon2 010526 7381 7879 152242 165842 0.059 1.031 11.676 13.290 836.0λ = 0.025advogato 3883 5043 7439 7110 0.052 0.450 0.421 0.468 263.4as20000102 4829 5043 130506 157779 0.033 0.285 4.097 5.108 373.5ca-GrQc 3982 5843 6427 5925 0.028 0.242 0.180 0.162 342.1ca-HepTh 3773 5843 6016 5925 0.176 0.573 0.374 0.416 11.8C elegans 4477 5843 9557 8532 0.025 0.292 0.293 0.293 266.6com-amazon.all 4059 6643 58995 14745 0.338 4.744 9.644 7.217 21.3dip20090126 MAX 6457 6643 0.125 1.397 15193.8D melanogaster 3993 5043 6279 7110 0.056 0.793 0.827 0.870 522.6email-Enron 4576 5843 11001 12287 0.574 3.289 3.888 4.705 1381.5HC-BIOGRID 5940 5843 11109 10239 0.029 0.321 0.414 0.404 714.0Homo sapiens 4796 5843 7109 7110 0.077 1.245 1.154 1.215 647.2hprd pp 5071 5843 6772 7110 0.052 0.687 0.579 0.647 446.3Mus musculus 4477 5843 18626 17694 0.026 0.168 0.302 0.385 219.8oregon1 010526 5027 5043 92520 109568 0.058 0.516 5.762 7.014 681.0oregon2 010526 4763 5043 86287 91306 0.050 0.638 7.140 7.420 847.5λ = 0.030advogato 3256 3502 5521 5090 0.048 0.361 0.335 0.322 260.6as20000102 3388 3502 122988 94140 0.029 0.199 3.899 3.182 378.7ca-GrQc 2981 4057 4686 4241 0.025 0.169 0.145 0.175 344.7ca-HepTh 2992 4057 4022 4241 0.190 0.435 0.286 0.341 7.9C elegans 3707 4057 6905 6108 0.026 0.198 0.218 0.217 265.9com-amazon.all 3157 4613 39917 12668 0.330 3.631 8.491 6.852 17.5dip20090126 MAX 4499 4613 12373086 0.300 0.972 1958.083 15199.0D melanogaster 2893 3502 4883 5090 0.052 0.562 0.620 0.807 510.4email-Enron 3619 4057 7321 7330 0.172 2.735 2.724 2.806 1399.7HC-BIOGRID 3883 4057 7499 7330 0.024 0.367 0.316 0.307 720.8Homo sapiens 3322 4057 4982 5090 0.066 0.897 0.842 0.877 654.2hprd pp 3355 4057 5028 5090 0.048 0.478 0.458 0.503 448.8Mus musculus 3806 4057 14290 10556 0.033 0.127 0.237 0.233 221.4oregon1 010526 3542 3502 85854 78450 0.052 0.366 5.402 5.039 675.7oregon2 010526 3355 3502 61841 65375 0.048 0.509 4.972 5.302 822.8

26

Table 2: Detailed experimental results (directed graphs). Empty values correspond to graphson which the algorithm needed more than 1 hour.

Number of iterations Time (s) EdgesGraph KADABRA RK ABRA-Aut ABRA-1.2 KADABRA RK ABRA-Aut ABRA-1.2 KADABRAλ = 0.005as-caida20071105 103488 146052 546951 462826 0.253 35.652 96.312 85.201 1066.4cfinder-google 137313 146052 0.820 14.190 554.4cit-HepTh 98054 166052 481476 462826 0.579 22.651 38.339 37.720 5773.1ego-gplus 37862 66052 2388093 0.136 6.266 11.912 1.9ego-twitter 37125 66052 154998 0.178 6.181 4.804 2.3freeassoc 41602 166052 89424 89697 0.116 9.384 1.036 0.997 223.5lasagne-spanishbook 112266 146052 8918751 4126626 0.250 17.374 687.815 318.784 552.8opsahl-openflights 73744 146052 200164 185998 0.179 6.191 5.165 4.849 431.1p2p-Gnutella31 39193 166052 81335 89697 0.254 50.542 10.213 10.662 162.1polblogs 71423 126052 387278 321406 0.174 1.165 3.522 3.017 190.3soc-Epinions1 58223 146052 109607 107637 0.671 100.516 62.524 62.167 671.9subelj-cora-cora 68112 186052 180740 185998 0.185 19.012 8.464 8.873 440.4subelj-jdk-jdk 42361 146052 84549 89697 0.110 2.955 0.230 0.257 51.5subelj-jung-j-jung-j 43637 126052 84225 89697 0.216 2.397 0.238 0.211 45.9wiki-Vote 47003 126052 100153 107637 0.131 5.916 2.990 3.219 162.4λ = 0.010as-caida20071105 30382 36513 132997 120048 0.135 8.902 22.251 20.315 1066.1cfinder-google 34452 36513 0.156 3.664 553.2cit-HepTh 27203 41513 117633 120048 0.255 5.654 8.803 9.677 5798.8ego-gplus 13123 16513 4602412 0.085 1.584 22.510 2.3ego-twitter 13310 16513 83366 0.086 1.518 3.500 2.2freeassoc 13222 41513 23586 23264 0.080 2.335 0.238 0.227 220.7lasagne-spanishbook 32527 36513 1366576 1070372 0.101 4.339 104.916 83.610 553.4opsahl-openflights 22473 36513 52196 48243 0.059 1.475 1.348 1.339 432.0p2p-Gnutella31 13101 41513 21567 23264 0.192 12.950 2.677 2.831 162.1polblogs 22286 31513 101466 83366 0.046 0.298 1.078 0.834 190.6soc-Epinions1 17061 36513 28493 27917 0.320 27.194 16.516 15.974 659.5subelj-cora-cora 23078 46513 47936 48243 0.128 4.797 1.988 2.101 432.4subelj-jdk-jdk 14047 36513 22038 23264 0.066 0.734 0.099 0.075 52.2subelj-jung-j-jung-j 14894 36513 22266 23264 0.064 0.696 0.113 0.083 46.4wiki-Vote 17380 31513 26352 27917 0.088 1.446 0.792 0.870 155.7λ = 0.015as-caida20071105 14157 16228 55049 55252 0.477 3.963 8.518 8.914 1059.6cfinder-google 15400 16228 0.123 1.666 558.1cit-HepTh 13002 18451 47035 46043 0.232 2.529 3.807 3.766 5883.0ego-gplus 7205 7340 2118317 0.080 0.710 12.808 2.2ego-twitter 7403 7340 1958981 114573 0.082 0.704 14.021 5.304 2.3freeassoc 7095 18451 10956 10707 0.297 1.072 0.115 0.110 222.0lasagne-spanishbook 14542 16228 437041 410542 0.068 1.936 34.098 33.153 552.8opsahl-openflights 11550 16228 24433 22204 0.034 0.649 0.643 0.648 433.9p2p-Gnutella31 7227 18451 10002 10707 0.190 5.732 1.317 1.444 157.1polblogs 10296 14006 46648 38369 0.029 0.136 0.516 0.435 189.5soc-Epinions1 9273 16228 13571 12849 0.450 12.115 7.661 7.629 662.0subelj-cora-cora 11297 20673 20940 22204 0.502 2.135 0.937 1.073 445.6subelj-jdk-jdk 8360 14006 10045 10707 0.052 0.288 0.080 0.049 51.6subelj-jung-j-jung-j 8712 16228 10319 10707 0.046 0.312 0.068 0.042 45.6wiki-Vote 8668 14006 12406 12849 0.408 0.659 0.380 0.429 152.6

27

Number of iterations Time (s) EdgesGraph KADABRA RK ABRA-Aut ABRA-1.2 KADABRA RK ABRA-Aut ABRA-1.2 KADABRAλ = 0.020as-caida20071105 9086 9129 31242 32139 0.104 2.226 4.954 5.087 1064.2cfinder-google 8745 9129 0.353 0.946 551.9cit-HepTh 8679 10379 27755 32139 1.249 1.442 2.225 2.684 5758.0ego-gplus 4785 4129 1478684 0.081 0.395 9.234 2.6ego-twitter 4950 7879 138201 0.083 0.743 5.079 2.4freeassoc 4268 10379 6509 6227 0.065 0.609 0.078 0.073 216.4lasagne-spanishbook 8338 9129 294793 286577 0.058 1.074 22.405 22.468 555.0opsahl-openflights 7392 9129 14202 12915 0.029 0.364 0.390 0.391 432.3p2p-Gnutella31 4697 10379 5700 6227 0.190 3.162 0.695 0.816 156.7polblogs 6325 7879 25593 22318 0.023 0.076 0.283 0.252 188.4soc-Epinions1 5489 9129 7686 7473 0.457 6.738 4.506 4.335 651.8subelj-cora-cora 6325 11629 12437 12915 0.500 1.203 0.571 0.520 450.8subelj-jdk-jdk 5456 9129 6070 6227 0.191 0.192 0.062 0.044 52.3subelj-jung-j-jung-j 5643 9129 6227 0.217 0.176 0.045 46.6wiki-Vote 4939 7879 7125 7473 0.075 0.368 0.221 0.259 152.2λ = 0.025as-caida20071105 5723 5843 21020 21233 0.022 1.465 3.129 3.340 1093.4cfinder-google 6275 5843 0.019 0.648 758.0cit-HepTh 5206 6643 15915 21233 0.034 0.940 1.351 1.891 6130.5ego-gplus 2989 5043 4200646 0.013 0.485 20.309 2.6ego-twitter 2958 2643 157779 0.012 0.248 6.291 2.4freeassoc 2804 6643 4285 4114 0.009 0.399 0.061 0.058 261.5lasagne-spanishbook 5409 5043 129999 131482 0.013 0.592 10.040 10.221 626.1opsahl-openflights 4557 5843 10116 8532 0.009 0.236 0.290 0.267 561.3p2p-Gnutella31 3069 6643 3931 4114 0.043 2.149 0.590 0.663 176.8polblogs 3880 5043 15986 14745 0.007 0.049 0.185 0.176 241.9soc-Epinions1 3689 5843 5060 4937 0.188 4.158 2.798 2.791 888.1subelj-cora-cora 5264 7443 7699 8532 0.020 0.781 0.360 0.408 436.5subelj-jdk-jdk 3201 5843 9428 4937 0.008 0.122 0.065 0.036 57.2subelj-jung-j-jung-j 3168 5043 13471 5925 0.007 0.098 0.057 0.045 57.7wiki-Vote 3265 5043 4566 4937 0.009 0.241 0.137 0.178 174.7λ = 0.030as-caida20071105 3956 4057 12696 15202 0.017 1.029 1.973 2.434 1285.2cfinder-google 4419 4057 0.013 0.412 770.5cit-HepTh 4062 4613 13172 12668 0.033 0.672 1.195 1.059 6131.6ego-gplus 2434 1835 4330990 0.009 0.188 21.395 3.1ego-twitter 2270 1835 98839 135562 0.008 0.174 4.909 5.510 2.2freeassoc 2105 4613 3008 3534 0.006 0.285 0.101 0.091 250.7lasagne-spanishbook 3820 4057 158028 94140 0.010 0.487 12.564 7.855 656.8opsahl-openflights 3450 4057 6556 6108 0.007 0.165 0.184 0.195 481.4p2p-Gnutella31 2367 4613 2874 2945 0.036 1.412 0.422 0.445 166.5polblogs 3567 3502 11357 8796 0.007 0.036 0.151 0.122 207.9soc-Epinions1 2659 4057 3585 3534 0.312 3.211 2.186 2.046 918.3subelj-cora-cora 3790 5169 5681 5090 0.016 0.564 0.272 0.265 422.6subelj-jdk-jdk 2425 4057 25575 5090 0.006 0.097 0.100 0.064 57.4subelj-jung-j-jung-j 2436 3502 43584 5090 0.006 0.079 0.140 0.059 57.0wiki-Vote 2633 3502 3467 3534 0.006 0.188 0.148 0.149 188.2

G Wikipedia and IMDB Results

In this section, we report our results on the Wikipedia citation network, and on all snapshotsof the IMDB actors collaboration network. In the ranking column, we report one number ifthe position in the ranking is guaranteed with probability 0.9, otherwise we report a lowerand an upper bound, which hold with the same probability.

We remark that, as for the IMDB database, the top-k betweenness centralities of asingle snapshot of a similar graph (hollywood-2009 in [10]) have been previously computedexactly, with one week of computation on a 40-core machine [47].

G.1 The Results on the IMDB Graph

In 2014, the most central actor is Ron Jeremy, who is listed in the Guinness Book of WorldRecords for “Most Appearances in Adult Films”, with more than 2000 appearances. Amonghis non-adult ones, we mention The Godfather Part III, Ghostbusters, Crank: High Voltageand Family Guy3. His topmost centrality in the actor collaboration network has beenpreviously observed by similar experiments on betweenness centrality [47]. Indeed, around 3actors out of 100 in the IMDB database played in adult movies, which explains why the

3The latter is a TV-series, which are not taken into account in our data.

28

high number of appearances of Ron Jeremy both in the adult and non-adult film industryrises his betweenness to the top.

The second most-central actor is Lloyd Kaufman, which is best known as a co-founderof Troma Entertainment Film Studio and as the director of many of their feature films,including the cult movie The Toxic Avenger. His high betweenness score is likely due to hiscentral role in the low-budget independent film industry.

The third “actor” is the historical German dictator Adolf Hitler, since his appearancesin several historical footages, that were re-used in several movies (e.g. in The ImitationGame), are credited by IMDB as cameo role. Indeed, he appears among the topmost actorssince the 1984 snapshot, being the first one in the 1989 and 1994 ones, and during thoseyears many movies about the World War II were produced.

Observe that the betweenness centrality measure on our graph does not discriminatebetween important and marginal roles. For example, the actress Bess Flowers, who appearsamong the top actors in the snapshots from 1959 to 1979, rarely played major roles, but sheappeared in over 700 movies in her 41 years career.

G.2 The Results on the Wikipedia Graph

All topmost pages in the betweenness centrality ranking, except for the World War II, arecountries. This is not surprising if we consider that, for most topics (such as importantpeople or events), the corresponding Wikipedia page refers to their geographical context(since it mentions the country of origin of the given person or where a given event tookplace). It is also worth noting the correlation between the high centrality of the World WarII Wikipedia page and that of Adolf Hitler in the IMDB graph.

Interestingly, a similar ranking is obtained by considering the closeness centrality measurein the inverse graph, where a link from page p1 to page p2 exists if a link to page p1 appearsin page p2 [8]. However, in contrast with the results in [8] when edges are oriented in theusual way, the pages about specific years do not appear in the top ranking. We note thatthe betweenness centrality of a node in a directed graph does not change if the orientationof all edges is flipped.

Finally, the most important pages is the United States, confirming a common conjecture.Indeed, in http://wikirank.di.unimi.it/, it is shown that the United States are thecenter according to harmonic centrality, and many other measures. Further evidence forthis conjecture comes from the Six Degree of Wikipedia game (http://thewikigame.com/6-degrees-of-wikipedia), where a player is asked to go from one page to the otherfollowing the smallest possible number of links: a hard variant of this game forces the playernot to pass from the United States page, which is considered to be central. Our results thusconfirm that the conjecture is indeed true for the betweenness centrality measure.

29

http://wikirank.di.unimi.it/

http://thewikigame.com/6-degrees-of-wikipedia

http://thewikigame.com/6-degrees-of-wikipedia

Table 3: The top-k betweenness centralities of the Wikipedia graph computed by KADABRAwith δ = 0.1 and λ = 0.0002.

Ranking Wikipedia page Lower bound Estimated betweenness Upper bound1) United States 0.046278 0.047173 0.0480842) France 0.019522 0.020103 0.0207013) United Kingdom 0.017983 0.018540 0.0191154) England 0.016348 0.016879 0.0174285-6) Poland 0.012092 0.012287 0.0124865-6) Germany 0.011930 0.012124 0.0123217) India 0.009683 0.010092 0.0105188-12) World War II 0.008870 0.009065 0.0092658-12) Russia 0.008660 0.008854 0.0090538-12) Italy 0.008650 0.008845 0.0090458-12) Canada 0.008624 0.008819 0.0090188-12) Australia 0.008620 0.008814 0.009013

Table 4: The top-k betweenness centralities of a snapshot of the IMDB collaborationnetwork taken at the end of 1939 (69011 nodes), computed by KADABRA with δ = 0.1 andλ = 0.0002.

Ranking Actor Lower bound Estimated betweenness Upper bound1) Meyer, Torben 0.022331 0.022702 0.0230492) Roulien, Raul 0.021361 0.021703 0.0220713) Myzet, Rudolf 0.014229 0.014525 0.0147474) Sten, Anna 0.013245 0.013460 0.0137235) Negri, Pola 0.012509 0.012768 0.0129436-7) Jung, Shia 0.012250 0.012379 0.0125096-7) Ho, Tai-Hau 0.012195 0.012324 0.0124548) Goetzke, Bernhard 0.010721 0.010978 0.0112019-10) Yamamoto, Togo 0.010095 0.010224 0.0103549-10) Kamiyama, Sojin 0.010087 0.010215 0.010344


Ranking Actor Lower bound Estimated betweenness Upper bound1) Meyer, Torben 0.018320 0.018724 0.0191362) Kamiyama, Sojin 0.012629 0.012964 0.0133083-4) Jung, Shia 0.010751 0.010901 0.0110533-4) Ho, Tai-Hau 0.010704 0.010854 0.0110055) Myzet, Rudolf 0.010365 0.010514 0.0106666-7) Sten, Anna 0.009778 0.009928 0.0100806-7) Goetzke, Bernhard 0.009766 0.009915 0.0100668) Yamamoto, Togo 0.009108 0.009327 0.0095399) Parıs, Manuel 0.008649 0.008859 0.00910810) Hayakawa, Sessue 0.007916 0.008158 0.008369

30


Ranking Actor Lower bound Estimated betweenness Upper bound1) Meyer, Torben 0.016139 0.016679 0.0172362) Kamiyama, Sojin 0.012351 0.012822 0.0133123) Parıs, Manuel 0.011104 0.011552 0.0118614) Yamamoto, Togo 0.010342 0.010639 0.0110865-6) Jung, Shia 0.008926 0.009120 0.0093185-6) Goetzke, Bernhard 0.008567 0.008762 0.0089627-9) Paananen, Tuulikki 0.008147 0.008341 0.0085397-9) Sten, Anna 0.007969 0.008164 0.0083637-9) Mayer, Ruby 0.007967 0.008162 0.00836210-12) Ho, Tai-Hau 0.007538 0.007732 0.00793010-12) Hayakawa, Sessue 0.007399 0.007593 0.00779210-12) Haas, Hugo (I) 0.007158 0.007352 0.007552

Table 7: The top-k betweenness centralities of a snapshot of the IMDB collaborationnetwork taken at the end of 1954 (120430 nodes), computed by KADABRA with δ = 0.1and λ = 0.0002.

Ranking Actor Lower bound Estimated betweenness Upper bound1) Meyer, Torben 0.013418 0.013868 0.0143342) Kamiyama, Sojin 0.010331 0.010726 0.0110893-4) Ertugrul, Muhsin 0.009956 0.010141 0.0103313-4) Jung, Shia 0.009643 0.009826 0.0100135-6) Singh, Ram (I) 0.008657 0.008841 0.0090305-6) Paananen, Tuulikki 0.008383 0.008567 0.0087557-9) Parıs, Manuel 0.007886 0.008070 0.0082577-10) Goetzke, Bernhard 0.007802 0.007987 0.0081767-10) Yamaguchi, Shirley 0.007531 0.007716 0.0079058-10) Hayakawa, Sessue 0.007473 0.007657 0.007845


Ranking Actor Lower bound Estimated betweenness Upper bound1-2) Singh, Ram (I) 0.010683 0.010877 0.0110751-2) Frees, Paul 0.010372 0.010566 0.0107633) Meyer, Torben 0.009478 0.009821 0.0102354-5) Jung, Shia 0.008623 0.008816 0.0090134-5) Ghosh, Sachin 0.008459 0.008651 0.0088476-7) Myzet, Rudolf 0.007085 0.007278 0.0074766-7) Yamaguchi, Shirley 0.006908 0.007101 0.0072998) de Cordova, Arturo 0.006391 0.006582 0.0067789-11) Kamiyama, Sojin 0.005861 0.006054 0.0062549-12) Paananen, Tuulikki 0.005810 0.006003 0.0062029-12) Flowers, Bess 0.005620 0.005813 0.00601210-12) Parıs, Manuel 0.005442 0.005635 0.005835

31


Ranking Actor Lower bound Estimated betweenness Upper bound1) Frees, Paul 0.013140 0.013596 0.0140672) Meyer, Torben 0.007279 0.007617 0.0078563-4) Harris, Sam (II) 0.006813 0.006967 0.0071243-5) Myzet, Rudolf 0.006696 0.006849 0.0070054-5) Flowers, Bess 0.006422 0.006572 0.0067266) Kong, King (I) 0.005909 0.006104 0.0064227) Yuen, Siu Tin 0.005114 0.005264 0.0054208) Miller, Marvin (I) 0.004708 0.004859 0.0050159-12) de Cordova, Arturo 0.004147 0.004299 0.0044579-18) Haas, Hugo (I) 0.003888 0.004039 0.0041979-18) Singh, Ram (I) 0.003854 0.004004 0.0041609-18) Kamiyama, Sojin 0.003848 0.003999 0.00415510-18) Sauli, Anneli 0.003827 0.003978 0.00413510-18) King, Walter Woolf 0.003774 0.003923 0.00407810-18) Vanel, Charles 0.003716 0.003867 0.00402410-18) Kowall, Mitchell 0.003684 0.003834 0.00399010-18) Holmes, Stuart 0.003603 0.003752 0.00390710-18) Sten, Anna 0.003582 0.003733 0.003890


Ranking Actor Lower bound Estimated betweenness Upper bound1) Frees, Paul 0.010913 0.011446 0.0120052-3) Yuen, Siu Tin 0.006157 0.006349 0.0065472-3) Tamiroff, Akim 0.006097 0.006291 0.0064904-6) Meyer, Torben 0.005675 0.005869 0.0060694-7) Harris, Sam (II) 0.005639 0.005830 0.0060274-8) Rubener, Sujata 0.005427 0.005618 0.0058155-8) Myzet, Rudolf 0.005253 0.005444 0.0056416-8) Flowers, Bess 0.005136 0.005328 0.0055269-10) Kong, King (I) 0.004354 0.004544 0.0047419-10) Sullivan, Elliott 0.004208 0.004398 0.004596

32


Ranking Actor Lower bound Estimated betweenness Upper bound1) Frees, Paul 0.008507 0.008958 0.0092952) Chen, Sing 0.007734 0.008056 0.0085073) Welles, Orson 0.006115 0.006497 0.0069034-5) Loren, Sophia 0.005056 0.005221 0.0053924-7) Rubener, Sujata 0.004767 0.004933 0.0051065-8) Harris, Sam (II) 0.004628 0.004795 0.0049675-8) Tamiroff, Akim 0.004625 0.004790 0.0049626-10) Meyer, Torben 0.004382 0.004548 0.0047208-12) Flowers, Bess 0.004259 0.004425 0.0045988-12) Yuen, Siu Tin 0.004229 0.004397 0.0045719-12) Carradine, John 0.004026 0.004192 0.0043649-12) Myzet, Rudolf 0.003984 0.004151 0.004325


Ranking Actor Lower bound Estimated betweenness Upper bound1) Chen, Sing 0.007737 0.008220 0.0086472) Frees, Paul 0.006852 0.007255 0.0077373-5) Welles, Orson 0.004894 0.005075 0.0052633-6) Carradine, John 0.004623 0.004803 0.0049893-6) Loren, Sophia 0.004614 0.004796 0.0049854-6) Rubener, Sujata 0.004284 0.004464 0.0046517-17) Tamiroff, Akim 0.003516 0.003696 0.0038857-17) Meyer, Torben 0.003479 0.003657 0.0038447-17) Quinn, Anthony (I) 0.003447 0.003626 0.0038157-17) Flowers, Bess 0.003446 0.003625 0.0038157-17) Mitchell, Gordon (I) 0.003417 0.003596 0.0037857-17) Sullivan, Elliott 0.003371 0.003551 0.0037407-17) Rietty, Robert 0.003368 0.003547 0.0037357-17) Tanba, Tetsuro 0.003360 0.003537 0.0037247-17) Harris, Sam (II) 0.003331 0.003510 0.0036997-17) Lewgoy, Jose 0.003223 0.003402 0.0035907-17) Dalio, Marcel 0.003185 0.003364 0.003553

33


Ranking Actor Lower bound Estimated betweenness Upper bound1) Chen, Sing 0.007245 0.007716 0.0082182-4) Welles, Orson 0.005202 0.005391 0.0055872-4) Frees, Paul 0.005174 0.005363 0.0055592-5) Hitler, Adolf 0.004906 0.005094 0.0052904-6) Carradine, John 0.004744 0.004932 0.0051275-7) Mitchell, Gordon (I) 0.004418 0.004606 0.0048026-8) Jurgens, Curd 0.004169 0.004356 0.0045517-8) Kinski, Klaus 0.003938 0.004123 0.0043189-12) Rubener, Sujata 0.003396 0.003585 0.0037859-12) Lee, Christopher (I) 0.003391 0.003576 0.0037719-12) Loren, Sophia 0.003357 0.003542 0.0037389-12) Harrison, Richard (II) 0.003230 0.003417 0.003614


Ranking Actor Lower bound Estimated betweenness Upper bound1-2) Hitler, Adolf 0.005282 0.005467 0.0056581-3) Chen, Sing 0.005008 0.005192 0.0053822-4) Carradine, John 0.004648 0.004834 0.0050273-4) Harrison, Richard (II) 0.004515 0.004697 0.0048875-6) Welles, Orson 0.004088 0.004271 0.0044625-9) Mitchell, Gordon (I) 0.003766 0.003948 0.0041396-9) Kinski, Klaus 0.003691 0.003874 0.0040656-11) Lee, Christopher (I) 0.003610 0.003793 0.0039846-11) Frees, Paul 0.003582 0.003766 0.0039608-13) Jurgens, Curd 0.003306 0.003486 0.0036768-13) Pleasence, Donald 0.003299 0.003479 0.00367010-13) Mitchell, Cameron (I) 0.003105 0.003285 0.00347610-13) von Sydow, Max (I) 0.002982 0.003161 0.003350

34


Ranking Actor Lower bound Estimated betweenness Upper bound1) Hitler, Adolf 0.005227 0.005676 0.0061642-6) Harrison, Richard (II) 0.003978 0.004165 0.0043622-6) von Sydow, Max (I) 0.003884 0.004069 0.0042642-7) Lee, Christopher (I) 0.003718 0.003907 0.0041062-7) Carradine, John 0.003696 0.003883 0.0040792-7) Chen, Sing 0.003683 0.003871 0.0040684-10) Jeremy, Ron 0.003336 0.003524 0.0037227-11) Pleasence, Donald 0.003253 0.003439 0.0036377-11) Rey, Fernando (I) 0.003234 0.003420 0.0036177-15) Smith, William (I) 0.003012 0.003199 0.0033978-15) Welles, Orson 0.002885 0.003072 0.00327110-15) Mitchell, Gordon (I) 0.002851 0.003036 0.00323210-15) Kinski, Klaus 0.002705 0.002890 0.00308710-15) Mitchell, Cameron (I) 0.002671 0.002858 0.00305810-15) Quinn, Anthony (I) 0.002640 0.002826 0.003026


Ranking Actor Lower bound Estimated betweenness Upper bound1) Jeremy, Ron 0.007380 0.007913 0.0084842) Hitler, Adolf 0.004601 0.005021 0.0054803-4) Lee, Christopher (I) 0.003679 0.003849 0.0040283-4) von Sydow, Max (I) 0.003604 0.003775 0.0039535-6) Harrison, Richard (II) 0.003041 0.003211 0.0033905-7) Carradine, John 0.002943 0.003114 0.0032966-11) Chen, Sing 0.002662 0.002834 0.0030187-14) Rey, Fernando (I) 0.002569 0.002740 0.0029227-14) Smith, William (I) 0.002559 0.002729 0.0029107-14) Pleasence, Donald 0.002556 0.002725 0.0029067-14) Sutherland, Donald (I) 0.002449 0.002617 0.0027968-14) Quinn, Anthony (I) 0.002307 0.002476 0.0026588-14) Mastroianni, Marcello 0.002271 0.002440 0.0026218-14) Saxon, John 0.002251 0.002420 0.002602

35


Ranking Actor Lower bound Estimated betweenness Upper bound1) Jeremy, Ron 0.010653 0.011370 0.0121362) Hitler, Adolf 0.005333 0.005840 0.0063963-4) von Sydow, Max (I) 0.003424 0.003608 0.0038023-4) Lee, Christopher (I) 0.003403 0.003587 0.0037815-6) Kier, Udo 0.002898 0.003081 0.0032755-8) Keitel, Harvey (I) 0.002646 0.002828 0.0030236-12) Hopper, Dennis 0.002424 0.002607 0.0028046-16) Smith, William (I) 0.002322 0.002504 0.0027007-17) Sutherland, Donald (I) 0.002241 0.002422 0.0026177-23) Carradine, David 0.002149 0.002329 0.0025267-23) Carradine, John 0.002147 0.002328 0.0025247-23) Harrison, Richard (II) 0.002054 0.002234 0.0024308-23) Sharif, Omar 0.002043 0.002222 0.0024188-23) Steiger, Rod 0.001988 0.002165 0.0023588-23) Quinn, Anthony (I) 0.001974 0.002151 0.0023448-23) Depardieu, Gerard 0.001966 0.002148 0.0023469-23) Sheen, Martin 0.001913 0.002093 0.00229110-23) Rey, Fernando (I) 0.001866 0.002044 0.00223810-23) Kane, Sharon 0.001857 0.002038 0.00223710-23) Pleasence, Donald 0.001859 0.002037 0.00223210-23) Skarsgard, Stellan 0.001848 0.002026 0.00222110-23) Mueller-Stahl, Armin 0.001789 0.001969 0.00216610-23) Hong, James (I) 0.001780 0.001957 0.002152


Ranking Actor Lower bound Estimated betweenness Upper bound1) Jeremy, Ron 0.010531 0.011237 0.0119912) Hitler, Adolf 0.005500 0.006011 0.0065683-4) Kaufman, Lloyd 0.003620 0.003804 0.0039973-4) Kier, Udo 0.003472 0.003654 0.0038455-6) Lee, Christopher (I) 0.003056 0.003240 0.0034355-8) Carradine, David 0.002866 0.003050 0.0032456-8) Keitel, Harvey (I) 0.002659 0.002840 0.0030346-9) von Sydow, Max (I) 0.002532 0.002713 0.0029078-13) Hopper, Dennis 0.002237 0.002419 0.0026169-15) Skarsgard, Stellan 0.002153 0.002333 0.0025299-15) Depardieu, Gerard 0.002001 0.002181 0.0023779-15) Hauer, Rutger 0.001894 0.002074 0.0022719-15) Sutherland, Donald (I) 0.001875 0.002054 0.00225010-15) Smith, William (I) 0.001811 0.001990 0.00218610-15) Dafoe, Willem 0.001805 0.001986 0.002186

36

Table 19: The top-k betweenness centralities of a snapshot of the IMDB collaboration networktaken in 2014 (1797446 nodes), computed by KADABRA with δ = 0.1 and λ = 0.0002.

Ranking Actor Lower bound Estimated betweenness Upper bound1) Jeremy, Ron 0.009360 0.010058 0.0108082) Kaufman, Lloyd 0.005936 0.006492 0.0071003) Hitler, Adolf 0.004368 0.004844 0.0053734-6) Kier, Udo 0.003250 0.003435 0.0036314-6) Roberts, Eric (I) 0.003178 0.003362 0.0035574-6) Madsen, Michael (I) 0.003120 0.003305 0.0035017-9) Trejo, Danny 0.002652 0.002835 0.0030307-9) Lee, Christopher (I) 0.002551 0.002734 0.0029317-12) Estevez, Joe 0.002350 0.002534 0.0027329-17) Carradine, David 0.002116 0.002296 0.0024929-17) von Sydow, Max (I) 0.002023 0.002206 0.0024059-17) Keitel, Harvey (I) 0.001974 0.002154 0.00235210-17) Skarsgard, Stellan 0.001945 0.002125 0.00232310-17) Dafoe, Willem 0.001899 0.002080 0.00227910-17) Hauer, Rutger 0.001891 0.002071 0.00226910-17) Depardieu, Gerard 0.001763 0.001943 0.00214210-17) Rochon, Debbie 0.001745 0.001926 0.002126

37

sapienza university of rome, 00185 roma, italy, … · kadabra is an adaptive algorithm for...

Documents