[ieee 2014 sixth international conference on communication systems and networks (comsnets) -...

6
Online Network Inference under Dynamic Cascade Updates: A Node-Centric Approach Srinivas Karthik IISc, Bangalore Rama Kumar Pasumarthi IBM India Research Lab [email protected] [email protected]in Ayush Choure Amazon.com [email protected] Vinayaka Pandit IBM India Research Lab [email protected] Abstct-Network inference is the process of inferring the structure of the unknown underlying network, based on the observations of the propagations of different contagions through the network. All the existing works consider the setting in which the information of the different propagations is available to the computation at the beginning. We introduce the problem of online network inference when the propagation information is revealed dynamically in batches. We present a new greedy heuristic that is amenable for online extension and derive two online inference algorithms. We present extensive experimental results show the computational gains that the online algorithms provide without losing much on the accuracy of the inferences. I. INTRODUCTION Diffusion models are essential for studying propagation of infection or information, on an underlying network. It is a popular tool in diverse domains like epidemiology [20], viral marketing [22], social networks [10], [8], and information propagation in social media [1], [6], [16]. There is a vast literature based on diffusion models, focused on understanding the influence that the structure of the network has on the patterns of propagation. Recently, there has been considerable focus on the inverse pblem of infeing the structure of the unknown underlying graph based on the observations of propagation of different contagions. Propagation of each contagion is observed in terms of a cascade which essentially records the set of nodes that adopted the particular contagion and the times at which they adopted them. Typically, the network inference step is a pre- cursor to important applications like influence maximization, identifying targets for viral marketing, etc. There has been a flurry of work on network inference using observations of cascades [6], [16], [7], [17], [19], [18], [21], [4] based on the independent cascade (IC) model [8]. Subsequent works from an algorithmic viewpoint, these approaches fall into two categories: greedy approaches [6], [17] and those based on convex optimization [16], [7], [21]. The greedy ap- proaches are more closely related to our work and make greedy choices iteratively based on different likelihood formulations of observing the cascade data given an inferred network. The convex optimization approaches typically formulate optimiza- tion programs for maximizing the likelihood of the cascade data given the inferred network and solve them using suitable optimization techniques. All of the above works assume offline setting, i.e, the entire cascade data is available at the beginning of the computation. To the best of our knowledge, the setting of online network 978-1-4799-3635-9/14/$31.00 ©2014 IEEE inference, introduced in this paper, where the cascade informa- tion is revealed dynamically in batches has not been considered before. A. Motivation for Online Network Inference There are scenarios where the cascade information comes in the form of dynamic updates and the network inference needs to be a continuous process consistent with the informa- tion received hitherto. We describe an example scenario called innovation jams. Innovation jam is a concept popularised by IBM to enable collaborative innovation. It is a specially orchestrated crowd- sourcing event within an organization or a set of cooperating organizations to harness the expertise and insight of informal innovation networks of people which are often hidden om rigid organizational structure. The participation in the jams are voluntary and typically en- couraged by recognizing the contributions of the participants. Cuently, the recognitions are simplistic like "early birds", "most equent contributors", etc. But, this misses out on the basic thesis of leveraging the hidden innovation networks. Since there are a large number of threads of discussion (similar to tags in twitter) and the contributors are typically known, the discussion on a thread represents a cascade, of thread's idea, through the innovation network (for instance, people typically exchange thoughts on emails, instant messengers before post- ing publicly). Inferring the hidden innovation network om the cascades can be used for more effective recognition like, "most influential contributors" based on the strength of influence over other participants rather than simplistic measures like equency. The example of innovation jam still resembles the offline case. But, the most important aspects of the jams are the following: (i) they are very short-duration events, say few hours to few days at most, (ii) the recognition of the contrib- utors has to be done visibly and periodically. For instance, if a jam is run for a few hours, the service might want to regularly update the list of most influential contributors. While the size of the participation is not internet scale, it is still significant. Therefore, for every periodic update, one cannot afford to either ignore historical contribution or redo the network inference from scratch. Therefore, the computation has to be incremental, discarding only those inferences that need to be revisited based on the dynamic update. We call this as online network inference under dynamic updates. B. Our Contributions The main contributions om our work are:

Upload: vinayaka

Post on 27-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2014 Sixth International Conference on Communication Systems and Networks (COMSNETS) - Bangalore, India (2014.01.6-2014.01.10)] 2014 Sixth International Conference on Communication

Online Network Inference under Dynamic Cascade Updates: A Node-Centric Approach

Srinivas Karthik IISc, Bangalore

Rama Kumar Pasumarthi IBM India Research Lab

[email protected] srinivas [email protected]

Ayush Choure Amazon.com

[email protected]

Vinayaka Pandit IBM India Research Lab

[email protected]

Abstract-Network inference is the process of inferring the structure of the unknown underlying network, based on the observations of the propagations of different contagions through the network. All the existing works consider the setting in which the information of the different propagations is available to the computation at the beginning. We introduce the problem of online network inference when the propagation information is revealed dynamically in batches. We present a new greedy heuristic that is amenable for online extension and derive two online inference algorithms. We present extensive experimental results show the computational gains that the online algorithms provide without losing much on the accuracy of the inferences.

I. INTRODUCTION

Diffusion models are essential for studying propagation of infection or information, on an underlying network. It is a popular tool in diverse domains like epidemiology [20], viral marketing [22], social networks [10], [8], and information propagation in social media [1], [6], [16]. There is a vast literature based on diffusion models, focused on understanding the influence that the structure of the network has on the patterns of propagation.

Recently, there has been considerable focus on the inverse problem of inferring the structure of the unknown underlying graph based on the observations of propagation of different contagions. Propagation of each contagion is observed in terms of a cascade which essentially records the set of nodes that adopted the particular contagion and the times at which they adopted them. Typically, the network inference step is a pre­cursor to important applications like influence maximization, identifying targets for viral marketing, etc.

There has been a flurry of work on network inference using observations of cascades [6], [16], [7], [17], [19], [18], [21], [4] based on the independent cascade (IC) model [8]. Subsequent works from an algorithmic viewpoint, these approaches fall into two categories: greedy approaches [6], [17] and those based on convex optimization [16], [7], [21]. The greedy ap­proaches are more closely related to our work and make greedy choices iteratively based on different likelihood formulations of observing the cascade data given an inferred network. The convex optimization approaches typically formulate optimiza­tion programs for maximizing the likelihood of the cascade data given the inferred network and solve them using suitable optimization techniques.

All of the above works assume offline setting, i.e, the entire cascade data is available at the beginning of the computation. To the best of our knowledge, the setting of online network

978-1-4799-3635-9/14/$31.00 ©2014 IEEE

inference, introduced in this paper, where the cascade informa­tion is revealed dynamically in batches has not been considered before.

A. Motivation for Online Network Inference

There are scenarios where the cascade information comes in the form of dynamic updates and the network inference needs to be a continuous process consistent with the informa­tion received hitherto. We describe an example scenario called innovation jams.

Innovation jam is a concept popularised by IBM to enable collaborative innovation. It is a specially orchestrated crowd­sourcing event within an organization or a set of cooperating organizations to harness the expertise and insight of informal innovation networks of people which are often hidden from rigid organizational structure.

The participation in the jams are voluntary and typically en­couraged by recognizing the contributions of the participants. Currently, the recognitions are simplistic like "early birds", "most frequent contributors", etc. But, this misses out on the basic thesis of leveraging the hidden innovation networks. Since there are a large number of threads of discussion (similar to tags in twitter) and the contributors are typically known, the discussion on a thread represents a cascade, of thread's idea, through the innovation network (for instance, people typically exchange thoughts on emails, instant messengers before post­ing publicly). Inferring the hidden innovation network from the cascades can be used for more effective recognition like, "most influential contributors" based on the strength of influence over other participants rather than simplistic measures like frequency.

The example of innovation jam still resembles the offline case. But, the most important aspects of the jams are the following: (i) they are very short-duration events, say few hours to few days at most, (ii) the recognition of the contrib­utors has to be done visibly and periodically. For instance, if a jam is run for a few hours, the service might want to regularly update the list of most influential contributors. While the size of the participation is not internet scale, it is still significant. Therefore, for every periodic update, one cannot afford to either ignore historical contribution or redo the network inference from scratch. Therefore, the computation has to be incremental, discarding only those inferences that need to be revisited based on the dynamic update. We call this as online network inference under dynamic updates.

B. Our Contributions The main contributions from our work are:

Page 2: [IEEE 2014 Sixth International Conference on Communication Systems and Networks (COMSNETS) - Bangalore, India (2014.01.6-2014.01.10)] 2014 Sixth International Conference on Communication

• We introduce the problem of online network inference under dynamic batch updates. Our formulation of the problem is based on a goal to achieve a balance between the accuracy of the incrementally inferred networks upon dynamic updates and the extent of discarded inferences.

• We present two different online inference algorithms and extensive experimental results on synthetic and real-life datasets and investigate all the issues involved in the trade-off between performance and runtime.

II. NETWORK INFERENCE UNDER DYNAMIC UPDATES

The model of dynamic cascade information and online inference have not been considered in the literature. The formulation developed here is one of our main contributions. Mathematical models of diffusion, first studied in a seminal work by Kermack and McKendrick [9], are the main tools in studying information propagation. Our setting corresponds to the Susceptible- Infected-Recovered (SIR) model

A. Cascade Model

Cascades are a temporal sequence of infected nodes over a hidden directed network G* = (V, E*) where V is the set of nodes and E* is the edges between the nodes (here * denotes ground truth). If there is an edge in E* from a node U E V to a node v E V, then, the node u is said to be the parent of v and v is said to be a descendant of u. For a node u, the set of its descendants is denoted by D ( u) and the set of its parents is denoted by P(u). Each cascade is essentially a contagion which originates by infecting some node in the network and the infection traverses through the network by spreading from infected parents to uninfected descendants over a period of time. In what follows, we describe how the cascade of a single contagion propagates as per the independent cascade model used in [6].

A cascade begins with the contagion first infecting an arbitrary node Va at time tva' We call Va as the seed of the cascade. Suppose a node v gets infected at tv. The infection spreads from an infected node v to each node w E D ( v) which is uninfected at tv as follows: (i) v's attempt to infect w succeeds with a probability (-3, (ii) if it succeeds, then, v picks an incubation time b.v,w from the incubation time distribution, and (iii) the infection time proposed by v to w is Tv,w = tv + b.v,w If an infected node v fails in its attempt to infect w, we define Tv,w to be 00. Note that the attempt of infection from an infected v to an uninfected node W E D( v) is made only once.

A node v f Va gets infected if the infection attempt made by any of its infected parent succeeds. The time at which it gets infected is given by tv = argminuEIP(v){Tu,v} where IP(v) denotes the set of infected parents of v. Let I ( t) denote the set of nodes infected before time t. The cascade comes to a halt at a time tend when the following property holds: infection attempt along all the edges from I(tend) to V \ I(tend) have failed.

In this paper, we work with three popular incubation time distributions used in previous works based on evidences observed in [2], [13].

Power-law: f(b.u,v; a) ex 6.� n,'/)

.6.v . . v

Exponential: f(b.u,v; a) ex e-�

6,. , -.6.�" lJ Rayleigh: f(b.u,v; a) ex �,u e�

B. Dynamic Cascade Information

Cascade Information: The input to the network inference problem are a set of independently occuring cascades over the same underlying network. We denote the ith cascade by Ci. The set of nodes infected by Ci is denoted by V(Ci). The cascade information for Ci is given by the set {(v, tv), VV E V(Ci)}. For ease of notation, we use Ci to denote the cascade information as well. Since a node v can be infected by multiple cascades, we denote its infection time in Ci by tt . If v is not infected in G\, tt is defined to be 00. We denote the set of all cascades by C.

In the dynamic cascade model, the time is divided into multiple epochs Bj = ('ij,1j+IJ for j 2: O. For a cascade Ci, its information in the epoch Bj is denoted by Cf =

{(v, tv)lv E V(Ci) AND 1j < tt S 1j+d. We denote the set of all epochs by B. The dynamic information model reveals the information in batches. In batch j, information corresponding to jth epoch is released, i.e, C1 vCi E C , which is denoted by Cj.

Cj = U Cl C;EC

C. Online Network Inference

In [6], the offline network inference is defined as follows: Given the cascade information for G\ E C, (-3, and the incuba­tion distribution function, infer the graph H containing at most kedges S.t. H = argmaxIGI:SkL(ClG). Here, L(ClG) denotes the function which evaluates to the likelihood of observing the cascade information C if the underlying network is G. Further, L(CIG) itself is given by

L(ClG) = II L(G\IG) C;EC

L (Ci I G) = likelihood of observing Ci if G* = G

In online network inference, goal is to infer network Hj at the end of each epoch Bj so as to maximize L(uz:SjCIIHj). At the same time, the online computation cannot afford to completely discard the decisions made in arriving at Hj-I.

Let Hj denote the network inferred by the computation at the end of jth batch.

Hj = arg maxIGI:SkL(UI:SjCzIG)

Due to the new information in CJ+ 1, the computation has to discard some of the old inferences judiciously. In the offline setting, all those old inferences invalidated by newer data are discarded. The online version gives an approximate solution to the offline problem by reusing Hj-I as much as possible while minimizing the number of discarded inferences, thus saving time.

Comment: In this paper, we develop a heuristic approach for the online inference problem. Quantifying the extent of

Page 3: [IEEE 2014 Sixth International Conference on Communication Systems and Networks (COMSNETS) - Bangalore, India (2014.01.6-2014.01.10)] 2014 Sixth International Conference on Communication

reuse of previous computations (say, using Hj-1 for comput­ing Hj) in a theoretical framework is left open for future work.

A novel algorithmic approach is needed for online network inference because the existing algorithms [16], [7], [21] are either inherently offline or are based on greedy approaches to maximize likelihood which are not amenable for incremental computation

III. NODE-CENTRIC ApPROACH FOR ONLINE INFERENCE

As explained in Section II, the problem with an approach that attempts to approximate the likelihood of observed in­formation conditioned on the inferred graph is that the new information can affect the likelihood computations in a manner that makes it difficult to identify all the past decisions that are invalidated. Therefore, we take a different approach that makes it easy to determine the past decisions that need to be discarded.

We think of the task of the algorithm as that of inferring the parents of every node v E V (this is similar in spirit to [17]). At any point of the inference process, we associate a measure with each node that indicates how much the parent neighborhood of v still needs to be inferred in order to explain its occurrence in the cascades that it appears in. At each point in decision making, we first pick the node whose measure is highest and then pick a parent which provides highest evidence of being one of the parent (based on cascade information). Because of this, when a new batch appears, it is easy to identify the nodes for which the greedy measure used in the past computation changes. It is precisely those nodes which appear in the current batch and for which some parent has already been inferred.

We use i to denote the index of cascades and j to denote index of epochs (we shall specifically highlight if we deviate from this norm). At any point in time, we denote the inferred graph by GI. For a node v, we denote the set of its parents in GI by pI (v) and its descendants by DI (v).

We need some notations to denote the parent-descendant relationships suggested by observed cascades. With respect to a cascade Ci and a node v that appears in Ci, we denote the set of potential parents of v when Ci is viewed in isolation by PPi(v) and PPi(v) = {u E Cilt� < ttl. For a node v, we denote the set of potential parents as per the set of all cascades by PP(v), i.e, PP(v) = UC;EC PPi(V) .

We need some notations to indicate the set of cascades relevant from the viewpoint of nodes and edges. For a node u, we denote the set of all cascades containing u by C ( u). Similarly, for an edge (u, v), we denote the set of all cascades in which both u, v appear with u preceding v by C (u, v). We first present the offline algorithm to bring out the main ideas of our approach and then extend it to the online setting.

A. Offline Algorithm

As explained before, the key aspect of our approach is that it tries to pick (parent, descendant) relationships in such a way that, for any v E V and i E C ( v ), the picked parents pI (v) is such that: when Ci is viewed in isolation, there is a high likelihood of the infection reaching v at tt from one of the nodes in pI ( v). Conceptually, this is where we depart from likelihood computations (which take all cascades into

account) and this turns out to be a limiting factor as seen in the experiments. To formalize this, we need some more notations.

For two nodes u and v appearing in a cascade Ci such that t� < t�, we define fi(U, v) as follows: fi(U, v) = f(t�lt�,; a) . For a given cascade Ci, Li (u, v) is the probability of u being the parent of v among all the possibilities suggested by the cascade Ci alone; this is defined only when both u and v appear in Ci and t� < t�.

L . ( ) _ fi ( U, v)

, u v -, LU'EPP;(v) fi( uf, v)

(I)

Note that Li (u, v) is defined in such a way that

LUEP Pi (v) Li (u, v) = 1. This leads us to the definition of another main quantity, Reqi (v), which denotes the extent to which the occurrence of the node v in Ci still needed to be explained (or the extent of "requirement" of another parent for this node in the cascade).

Li(u, v) UEPT(v)npp;(v)

Req( v) denotes the extent to which v's occurrence across the different cascades still requires to be explained. Note that initially, Req( v) = 1 for all v. We define Req( v) such that it achieves a balance between two quantities: average of its Reqi(v) and maxiECv Reqi(v). It is computed as Req(v) =

Avg(Reqi( v)) . M ax ( {Reqi( v)}).

Our node-centric greedy approach first picks the node which has the highest requirement and after that, it greedily picks a parent which provides highest explanation across C(v). The formal algorithm for picking k edges of GI is given in Algorithm 1, henceforth referred to as NOGA (Node based Offline Greedy Algorithm). The main advantages of this algorithm are that (i) it is nearly as accurate as the best known network inference algorithm NETINF, but much faster, and (ii) it is amenable for easy extension to online simulation.

1 Initialization: pI (v) = 0, GI = 0, and Req( v) = 1 for all v E V;

2 while (IGI I ::; k) do 3 Pick the vertex v with highest Req( v) breaking ties

arbitrarily; 4 For u E PP(v) \ pI(v), set W(u,v) = 0; 5 for u E PP(v) \ pI(v) do 6 for Ci E C ( u, v) do 7 I update W(u,v) as W(u,v)+ = Li(u,v); 8 end 9 end

10 Let u = maxu'EPp(v)\Pl(v) W(u,v); 11 Add u to pI(v) and (u,v) to GI; 12 Update Req( v) as per definition; 13 end

B. Algorithms for Online Network Inference

We now present the modifications to the offline algorithm when information arrives for a batch Bj. The modifications are

Page 4: [IEEE 2014 Sixth International Conference on Communication Systems and Networks (COMSNETS) - Bangalore, India (2014.01.6-2014.01.10)] 2014 Sixth International Conference on Communication

such that the result of the online computation is essentially the same as offline computation. The online computation is designed to save discarded decisions. As before, GI will denote the current inferred graph. With each edge in GI, we associate the Reqp value of the child node when the edge was included in G . We remember the Req() value of the last node picked for explanation before the arrival of jth batch and denote as ReqBar.

Consider the information Cj that arrives in epoch j. If there is a node v that appears in some cascade in Cj, but has never appeared in any cascade before Bj, then, it just naturally gets included in the computation as Req() for that node is defined only after its arrival.

If a node v appears in some cascade Ci during the jth batch and if it has a parent in GI, then, its parent information needs to be evaluated again as f (tt It� ; CY) is now defined for its parents in Ci. So, our strategy is as follows. We discard all the parents of any node v that satisfies the condition that v appears in some cascade in Cj and pI (v) f 0. After that, we pick edges exactly in the same greedy manner as the offline Algorithm 1 till the Req() of the last node selected is below ReqBar. When we stop, if there are more than k edges, then, we just select the top k edges based on the Req value at which they were picked. The formal algorithm is given in Algorithm 2 and referred to as SIMNoGA (online Simulation of NOGA).

1 GI = Hj-1, and for all v E V, Req(v) is as per definition at end of Bj_ l;

2 Information Maintained: for every edge (u, v) included in GI, we associate the Req( v) at the time of including (u, v). Further, ReqBar denotes the least ReqO value associated with the edges in GI;

3 Let RlVI denote the set of all nodes such that v appears in Cj and pI (v) f 0. Set pI(v) = 0 for all v E RM (GI is correspondingly changed);

4 Update ReqO for nodes in RlVI and new nodes in Cj. LastReq = 1; repeat

5 Pick the vertex v with highest Req ( v) breaking ties arbitrarily;

6 LastReq = Req(v); 7 For u E PP (v) \pI(V), set W(u,v) = 0; 8 for u E PP (v) \pI(V) do 9 for Ci E C(u, v) do

10 I update W(u, v) as W(u, v) + = Li(u, v); 11 end 12 end 13 Let u = maxu'EPp(v)\Pl(v) W(u,v); 14 Add u to pI (v) and (u, v) to GI; 15 Update Req( v) as per definition; 16 until ((IGII::;' k) i\ (LastReq S ReqBar)); 17 Retain top k edges in GI based on the Req value of the

child node at the time they were included; 18 Output GI as Hj;

19ont m : j arrIVes (SIMNoGA).

It can be easily seen that the result of the online compu­tation Hj at the end of Bj for all j ::;. 1 is exactly same as the network inferred if the offline Algorithm 1 was given the

entire information up to Bj, i.e, Uj'EjCj, as its input.

The advantage of SIMNoGA is that it simulates NOGA exactly. However, it discards as many inferences as needed in order to be faithful to the offline algorithm. This could potentially be all the inferences so far if the new epoch contains infections for all nodes. We now present another online algorithm designed to reduce the number of discarded inferences. However, its output could be different from that of offline algorithm. This can be used in settings where the response upon each update has to be near real time at the cost of accuracy.

Instead of deleting pI ( v) for v E RM (as in Step 3 of Algorithm 2), we do the following. For all the nodes that appear in Cj, even if they have appeared in GI, we set their Req to be l. We then greedily pick nodes for explanation and their parents (as in Steps 5 to 16 of Algorithm 2). We do this till the Req value for all the nodes comes below l. After this, we execute Step 17 to narrow down Hj and move on to the next batch. We call this as the ONHEUS (Online Heuristic) algorithm.

IV. EMPIRICAL EVALUATION

In this section, we present the details of the experiments we have conducted for evaluating the efficacy of our approach.

The NETINF algorithm [6] is the best algorithm for solving the offline network inference problem in the continuous time setting. We take NETINF to be the "gold standard" for accuracy. We conduct experiments to show that: (i) NOGA is comparable to NETINF in accuracy, (ii) SIMNoGA simulates NOGA exactly while being faster than NOGA and NETINF, (iii) ONHEUS provides significant saving in runtime compared to SIMNOGA. We study the trade-off between accuracy and runtime between SIMNOGA and ONHEUS.

A. Details of Datasets

We use all the major synthetic data sets used in evaluating NETINF[6] and generated using the same code that has been made available at 1. The details of the different synthetic datasets follows:

• Ground Truth: Experiments are run on two classes of synthetic networks which model the structure of directed social networks: Kronecker [I2] and Forest Fire [14]. In the Kronecker class of networks, we consider three type of networks: a core-periphery network [15] with parameter matrix [0.9, 0.5; 0.5, 0.3], a hierarchical network [3] with parameter matrix [0.9, 0.1; 0.1, 0.9]' and a ErdOs - Renyi network [5] with parameter matrix [0.5, 0.5; 0.5, 0.5]. All graphs generated have 512 nodes and 1024 edges.

• Cascade Generation: We generated cascades accord­ing to the generative model described in section 2.1. We generated the incubation times using Weibull [23] and Power-law distributions. The Weibull distribution has a shape parameter k and reduces to Exponential distribution when k = 1 and Rayliegh distribution when k = 2. We have used different values of (3,

1 http://snap.stanford.eduinetinti'

Page 5: [IEEE 2014 Sixth International Conference on Communication Systems and Networks (COMSNETS) - Bangalore, India (2014.01.6-2014.01.10)] 2014 Sixth International Conference on Communication

WeibLlll:k I Weibul1:k 1.6 WeibLlll:k 2 Power I"aw NOUA NOUA NET ,\1 NOUA NET ,\1 NOUA NET ,\1

CP 0.80 0.82 0.76 0.80 0.77 0.85 OA2 (l81

III 0.97 0.96 0.99 0.99 0.99 0.99 o.n 0.99

RND O.HH 0.92 0.90 0.93 O.HH 0.91 0.66 0.H9

FF 0.84 0.84 0.84 0.84 0.83 0.82 0.72 0.77

TABLE I. COMPARISON OF BREAK EVEN POINTS FOR NETINF AND NOGA FOR f3 = 0.5

but present results on f3 = 0.2 and f3 = 0.6 which are in the ranged used in [6].

Number of cascades We generated cascades such that, for at least 95% edges, the infection crosses them III at least one cascade.

1) Real Dataset: Digg 2 is a social news website which has several users and are linked byfollowJollower relationship and where each user can upvote on a story. The dataset made available by Lerman and Ghosh [11] has about 1 million users and 3500 stories, collected over a period of one month in 2009. Each story would correspond to a cascade in our model where users get influenced by the activity of their friends to upvote. We construct a ground truth network of 800 nodes and 5000 edges by using data of 2000 cascades by considering neighbor­hood of only those nodes which satisfy the sample complexity lower bound presented in [17] on the minimum number of observations required to reliably reconstruct neighborhoods.

B. Empirical Results

We now present the results of our experiments from the viewpoint of the objectives of the empirical evaluation men­tioned in the beginning of this section.

NOGA Versus NETINF Study: We use Break Even Point (BEP) to measure accuracy of the algorithms. BEP is the point at which precision is equal to recall, higher the break­even point better is the accuracy. In Figure 1, we also present example Precision-Recall curves for NETINF and NOGA. In the figure, the continuous line corresponds to NOGA and the dotted line corresponds to NETINF. From the Figure 1, Table I, and Table II, we can see that BEP of NETINF and NOGA are close to each other. The NETINF is more consistent and NOGA performs poorly on the Power-law incubation distribution when transmission probability is also high. This is due to following facts: (i) in this setting, the cascades are quite long both in terms of time and number of nodes, (ii) since the Power-law has considerable mass even at long incubation times, the ReqO values do not improve as smoothly as they do in exponential families. One could remedy this by using alternative ReqO for this setting. But, we have decided to use the same formulation for consistency. This does highlight an area of improvement for our approach. Table III and Table IV present the running times of NETINF and NOGA for different f3 settings. We see that NOGA in most cases takes at most half the time as NETINF, sometimes even as low as i tho The first and fourth rows of Table V show the relative performance of the two approaches on the Digg set described in Section IV-A. We observe the same pattern of accuracy and run time performance.

NOGA Versus SIMNoGA Study: We show that the output of NOGA and SIMNOGA are same. We have verified this

2http://www.digg.com

, � 0.8 "

Ca) Core-periphery net- Cb) Hierarchical network Cc) Random network work

fJ = 0.2, k = 1.6 fJ = 0.5, k = 1 fJ = 0.5, k = 2

Fig. 1. Precision-Recall plots for Core-Periphery, Hierarchical and Random networks with Wei bull incubation time model

Weibul1:kl WeibLlll:k 1.6 Weibul1:k 2 Power I"aw NOCiA

CP 0.91

III 0.97

KN[) 0.92

FF 0.90

TABLE II.

NETI'\F NOCiA NETI'\F

0.91 0.92 0.93

0.97 0.9H O.9H

0.92 0.93 0.93

0.89 0.82 0.81

NOCiA NETINF

0.93 0.94

0.97 0.97

0.91 0.92

0.85 0.84

NOCiA NETINF

0.88 0.93

0.96 0.96

0.90 0.91

0.80 0.85

COMPARISON OF BREAK EVEN POINTS FOR NETINF AND NOGA FOR f3 = 0.2

in every experiment. To study the gain in performance, we repeated all the experiments as follows: divided the cascade information into ten epochs, picking the epoch intervals uni­formly, to mimic incremental update in cascades. At the end of each epoch, we ran NOGA and added the total running time. For SIMNOGA, we added the time it takes between successive epochs. Tables VI and VII show the gain in running time without losing any accuracy. The savings is at least 50% and could be as much as 80%.

SIMNOGA Versus ONHEUS Study: We studied the rel­ative accuracy versus computational gains of SIMNOGA and ONHEUS on the same set of four epochs data used for NOGA versus SIMNoGA comparison. Tables VIII and IX show the gain in run time and the loss in accuracy for ONHEUS in comparison to SIMNoGA when f3 = 0.5. Tables X and XI show the gain in run time and the loss in accuracy for ONHEUS in comparison to SIMNOGA when f3 = 0.2. We see that whenever ONHEUS is saving on discarding inference (thereby gaining in run time), its accuracy suffers.

Wcihull:k-l Wcihull:k-l.6 Weibull:k-2 Power Law Ot;/I.

CP 1m 35s

HI 3.G9s

RND 50s

FF 2.GOs

TABLE III.

OUA liT Ni'

3m 20s 1m 32s 3m 10� 1m 4CIiJ 3m 25� 1m 2s 2m 4CIiJ

G.95s 4.188 8.808 2.848 4.748 2.51s 4.898

3m 37s 1m 3s 4m40s 1m 31s 5m 57s 30s 3m 32s

2A8s I.G I � 1.59� 1.37� IAI � 2A8s 2.52�

RUNNING TIME COMPARISON FOR NETINF AND NOGA FOR f3 = 0.5

Weibul1:kl Weibul1:k 2 Weibul1:k1 Power I,aw NonA

CP 2.64s

III l.1Hs

KN[) 1.36�

rr 0.H4s

TABLE IV.

NETI'\F NonA NETI'\F

1O.H5s 3.05s 11.64s

O.55s l.J ls O.62s

1.01s IA5� I. l ls

O.7Js l.J7s 1.IHs

NonA NETINF

2.74s 11.54s

l.24s O.97s

OA4� 0.95s

l.4s 1.Hs

NOCiA NETINF

2.3Hs I1.1Hs

1.1s 0.51 s

1.33s 1.08�

O.92s 1.0s

RUNNING TIME COMPARISON FOR NETINF AND NOGA FOR f3 = 0.2

Page 6: [IEEE 2014 Sixth International Conference on Communication Systems and Networks (COMSNETS) - Bangalore, India (2014.01.6-2014.01.10)] 2014 Sixth International Conference on Communication

Accuracy Break-Even Point Run Time

SIMNoGA 0.608 41m

ONHElJS 0.475 26m

NETINF 0.578 48 m

TABLE V. COMPARISON OF P RECISION, RECALL AND RUNNING T IMES ON DIGG DATASET

CP

III R.N[)

IT

Weibllll:k_ l Weibllll:k-l.6 NOOA 0'\ SI\1 NOClA

lO.56s 4.X7s 12.2s

4.72s l.20s 5.24s

5A4� 1.39s .'l.8s

.36s 0.X6s S.4Xs

5.77s

1.32s

lA8�

1.95s

WeibLI1l:k-2 NOOA 0'\ SI\1

1O.96s 2.94s

4.96s l.26s

5.76� IA7s

5.6s l.4Ss

Power r�aw NOClA N .'111.1

9.52s 3.X2s

4.4s 1.X5s

.'l.32� 2.22�

3.6Xs 1.64s

ON STM is same as STMNOGA

TABLE VI. RUNNING TIME COMPARISON OF NOGA AND SIMNOGA FOR f3 = 0.2

CP

III R.N[)

Weibllll:k-I NOClA 0'\ SIM

6m 20s 1m 33s

lU6s 3.Ss

Gm 4s 1m 37 �

lO.4s 2.X ls

WeibLI1l:k-I.6 NOOA N ,'111.1

6m Xs 1m 34s

16.72s 5.6Xs

4m 12� Im IO�

6.44s 2.37s

Weibllll:k_2 NOClA 0'\ .'111.1

6m 20s 1m 37s

14.76s 3.XXs

3m20� .'l3.27�

5.4Xs 1.42s

Power r�aw NOOA 0'\ SI\1

5m 4s 1m 30s

1O.04s 4.X9s

2m 3G.8.'ls

9.92s 3.46s

TABLE VII. RUNNING TIME COMPARISON OF NOGA AND SIMNOGA FOR f3 = 0.5

Weiblll1:k1 N .'111.1 N Hr.u

CP O.XO 0.76

III o.n RN[) 0.88

FF 0.84

0.97

0.8G

0.83

WeibLlll:k I.G WeibLlll:k 2 ['ower r�aw 0'\ SI\1 0'\ HFT 0'\ ,'111.1 UN HEr UN , 1M N HEU

0.76 0.73 0.76 0.73 0.43 O.4X

0.99 0.99 0.99 0.99 0.92 0.70

0.90 0.88 0.88 0.8G 0.66 O.GI

0.84 0.81 0.83 0.83 0.72 0.71

TABLE VIII. COMPARISON OF BREAK EVEN POINTS FOR SIMNOGA AND ONHEUS FOR f3 = 0.5

REFERENCES

[ 1] E. Adar and L. Adamic. Tracking information epidemics in b1ogspace. In Web Intelligence, 2005. Proceedings. The 2005 IEEElWICIACM International Conference on, pages 207-214. IEEE, 2005.

[2] A L Barabasi. The origin of bursts and heavy tails in human dynamics. Nature, page 435, 2005.

[3] A. C1auset, C. Moore, and M. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(719 1):98- 10 1, 2008.

[4] N. Du, L. Song, A. Smola, and M. Yuan. Learning networks of hetero­geneous influence. In Proceedings (J/ Advances in Neural InfcJrmation Processing Systems (NIPS), pages 2789-2797, 20 12.

[5] P. Erdos and A Renyi. On the evolution of random graphs. In PUBLICA­TION OF THE MATHEMATICAL INSTITUTE OF THE HUNGARIAN ACADEMY OF SCJF:NCES, pages 17-6 1, 1960.

[6] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. ACM Transactions on Knowledge Discovery .from Data (TKDD), 5(4):21, 20 12.

[7] M. Gomez-Rodriguez and B. Scholkopf. Submodular inference of dif­fusion networks from multiple trees. In Proceedings 0./ the International Conference on Machine Learning (ICML), 20 12.

[8] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137-146. ACM, 2003.

[9] W. Kermack and A. McKendrick. A contribution to the mathematical theory of epidemics. Proceedings of the Royal Society of London, 1 15:700-72 1, 1927.

[ 10] T. Lappas, E. Terzi, D. Gunopulos, and H. Mannila. Finding effectors in social networks. In Proceedings of the 16th ACM SIGKDD international con.ference on Knowledge discovery and data mining, pages 1059-1068. ACM,2010.

[II] K. Lerman and R. Ghosh. Information contagion: An empirical study of spread of news on DIG and Twitter social networks. In Proceedings

Weibllll: k l

(b SU,l ON l!L

CP 10UJ� 8.'l'(Js

III 3.S0s 2.7Xs

RND 97.0s X4.0s

WeibLIIl: k i.G ON SlM ON l!U

94.ChJ 79,(Js

7.43s S.90s

70.0s S7.16s

WeibLlll: k 2 ON SnI (b LL

97'(Js 81s

4.23s 3.74s

53.27s 44.24s

power law

(bS1.'.I (bIl!L

90'(Js 82.0�

4.X9s 2.46s

36.XSs 3S.72s

TABLE IX. RUNNING TIME COMPARISON OF SIMNOGA VS. ONHEUS FOR f3 = 0.5

CP

HI RND

Weibull:k 1 (b SU,l ON l!L

0.91 0.91

0.97 0.90

0.92 o.n

Weibllll:k 1.6 ON SlM ON l!U

0.92 0.91

0.98 0.%

0.93 0.93

Wcibllll:k 2 ON SnI (b LL

0.93 0.92

0.97 0.97

0.91 0.90

Power Law

(bS1.'.I (b l!L

0.88 0.82

0.% 0.62

0.90 O.X6

TABLE X. COMPARISON OF BREAK EVEN POINTS FOR SIMNOGA AND ONHEUS FOR f3 = 0.2

Weibllll: k l 0'\ ,'111.1 ON HEr

CP 4.X7s 2.34s

III l.20s O.XOs

R.N[) 1.39� 1.13s

WeibLIIl: k I.G N .'111.1 N HEU

5.77s 4.S4s

l. lXs 0.7Ss

lAI � 0.91 s

WeibLlll: k 2 N SI\1 0'\ HFT

3.6Xs 2.XXs

1.27s O.X9s

lAOs 0.9ChJ

3.X2s 3.22s

USs 0.94s

2.22s 0.93�

TABLE XI. RUNNING TIME COMPARISON OF SIMNOGA VS. ONHEUS FOR f3 = 0.2

of the tintrth International Conference on Weblogs and Social Media (ICWSM), 20 10.

[ 12] J. Leskovec and C. Faloutsos. Scalable modeling of real graphs using kronecker multiplication. In ICML, pages 497-504, 2007.

[ 13] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Patterns of cascading behavior in large blog graphs. In Proceedings of the SIAM Conference on Data Mining (SDM), 2007.

[ 14] Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD, pages 177-187, 2005.

[IS] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Statistical properties of community structure in large social and information networks. In WWW, pages 695-704, 2008.

[ 16] S. Myers and J. Leskovec. On the convexity of latent social network inference. In Proceedings 0./ the Advances in Neural Inj(mnation Processing (NIPS), 20 10.

[ 17] P. Netrapalli and S. Sanghavi. Learning the graph of epidemic cascades. In ACM SIGMETRICS Perfbrmance Evaluation Review, pages 2 1 1-222, 20 12.

[ 18] M. Gomez Rodriguez, J. Leskovec, and B. Schi:ilkopf. Structure and dynamics of information pathways in online media. In Proceedings of the sixth ACM international conference on Web search and data mining (WSDM), pages 23-32, 20 13.

[ 19] T. Snowsill, N. Fyson, T. De Bie, and N. Cristianini. Refining causality: who copied from whom? In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 466-474, 20 1 1.

[20] J. Wallinga and P. Teunis. Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. American Journal or Epidemiology, 160(6):509-5 16, 2004 .

[2 1] L. Wang, S. Ermon, and J. Hopcroft. Feature-enhanced probabilistic models for diffusion network inference. In ECMUPKDD, pages 499-5 14,2012.

[22] D. Watts and P. Dodds. Influentials, networks, and public opinion formation. Journal 0./ Consumer Research, 34(4):441-458, 2007.

[23] W. Weibul1. A statistical distribution function of wide applicability. Journal or applied mechanics, 18(3):293-297, 195 1.