oden: simultaneous approximation of multiple motif counts
TRANSCRIPT
odeN: Simultaneous Approximation of Multiple Motif Counts inLarge Temporal Networks
Ilie SarpeDepartment of Information Engineering
University of PadovaPadova, Italy
Fabio VandinDepartment of Information Engineering
University of PadovaPadova, Italy
ABSTRACT
Counting the number of occurrences of small connected subgraphs,called temporal motifs, has become a fundamental primitive for theanalysis of temporal networks, whose edges are annotated with thetime of the event they represent. One of the main complicationsin studying temporal motifs is the large number of motifs thatcan be built even with a limited number of vertices or edges. As aconsequence, since in many applications motifs are employed forexploratory analyses, the user needs to iteratively select and ana-lyze several motifs that represent different aspects of the network,resulting in an inefficient, time-consuming process. This problem isexacerbated in large networks, where the analysis of even a singlemotif is computationally demanding. As a solution, in this workwe propose and study the problem of simultaneously counting thenumber of occurrences of multiple temporal motifs, all correspond-ing to the same (static) topology (e.g., a triangle). Given that forlarge temporal networks computing the exact counts is unfeasible,we propose odeN, a sampling-based algorithm that provides anaccurate approximation of all the counts of the motifs. We provideanalytical bounds on the number of samples required by odeNto compute rigorous, probabilistic, relative approximations. Ourextensive experimental evaluation shows that odeN enables theapproximation of the counts of motifs in temporal networks in afraction of the time needed by state-of-the-art methods, and that italso reports more accurate approximations than such methods.
CCS CONCEPTS
โข Mathematics of computing โ Probabilistic algorithms; โขTheory of computationโ Graph algorithms analysis.
KEYWORDS
temporal motifs, sampling algorithm, temporal networks, random-ized algorithm
ACM Reference Format:
Ilie Sarpe and Fabio Vandin. 2021. odeN: Simultaneous Approximation ofMultiple Motif Counts in Large Temporal Networks. In Proceedings of the30th ACM International Conference on Information and Knowledge Manage-ment (CIKM โ21), November 1โ5, 2021, Virtual Event, QLD, Australia. ACM,New York, NY, USA, 14 pages. https://doi.org/10.1145/3459637.3482459
CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australiaยฉ 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.This is the authorโs version of the work. It is posted here for your personal use. Notfor redistribution. The definitive Version of Record was published in Proceedings of the30th ACM International Conference on Information and Knowledge Management (CIKMโ21), November 1โ5, 2021, Virtual Event, QLD, Australia, https://doi.org/10.1145/3459637.3482459.
1 INTRODUCTION
Networks are ubiquitous representations that model a wide range ofreal-world systems, such as social networks [9], citation networks[10], biological systems [12], and many others [32]. One of themost fundamental primitives in network analysis is the mining ofmotifs [30, 31, 42] (or graphlets [7, 36]), which requires to countthe occurrences of small connected subgraphs of ๐ nodes. Motifsrepresent key building blocks of networks, and they provide usefulinsights in wide range of applications such as network classification[29, 43], network clustering [3], and community detection [2].
Modern networks contain rich information about their edgesor nodes [8, 20, 39, 50] in addition to graph structure. One of themost important information is the time at which the interactions,represented by edges, occur. Networks for which such informa-tion is available are called temporal [15, 16]; novel insights aboutthe underlying dynamics of the systems can be uncovered by theanalysis of such networks [22โ24]. In recent years, many primi-tives [17, 21, 34, 41] have been proposed as counterpart, in temporalnetworks, to the study of subgraph patterns for nontemporal net-works, with each primitive capturing different temporal aspects ofa network. One of the most important such primitives is the studyof temporal motifs [34]. Temporal motifs are small connected sub-graphs with ๐ nodes and โ edges occurring with a prescribed orderwithin a time interval of duration ๐ฟ . Temporal motifs describe thepatterns shaping interactions over the network, e.g., networks fromsimilar domains tend to have similar temporal motif counts [34],and their analysis is useful in many applications, e.g., anomaliesdetection [4], network classification [45], and social networks [6].
The temporal dimension poses several challenges in the analysesof motifs. A major challenge is represented by the large numberof temporal motifs that can be build even with a limited numberof vertices and edges. For example, even considering directed (andconnected) temporal motifs with only 3 vertices and 3 edges, thereare 32 such motifs. In several domains when motifs are studied inthe exploratory analysis of a temporal network it is almost impos-sible for the data analyst to known a priori which motif is the mostinteresting and useful. In social networks, a set of 3 vertices repre-sents the smallest non trivial community, and different temporalmotifs with 3 vertices describe different patterns of interactions insuch community. Hence, studying all such motifs can provide novelinsights on the interactions within such communities. In networkclassification, considering the counts of all the 32 motifs with 3vertices and 3 edges lead to models with improved accuracy [45].
However, since state-of-the-art approaches for general temporalmotifs only allow the analysis of one motif at the time, the userneeds to iteratively select and analyze the various motifs, resulting
arX
iv:2
108.
0873
4v1
[cs
.SI]
19
Aug
202
1
CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia Ilie Sarpe and Fabio Vandin
in an inefficient and time consuming process, in particular for largenetworks.
In this paper, we define and study the problem of simultaneouslycounting the occurrences of various temporal motifs. In particu-lar, we consider all motifs corresponding to the same static targettemplate (e.g., all triangles - see Fig. 1a). This problem is extremelychallenging, since computing the count of even a single temporalmotif is NP-Hard in general [26], with existing state-of-the-art ap-proaches having complexity exponential in the number of edges ofthe motif to obtain even a single motifโs count [26, 40, 47].
The task of counting temporal motifs is hindered by the sheer sizeof modern datasets and, therefore, scalable techniques are needed todeal with such amount of data. Since exact approaches [13, 27, 34]are impractical, rigorous and efficient approximation algorithmsproviding tight guarantees are needed. In this work we developodeN, a sampling algorithm that provides a high quality approxi-mation for the problem of counting multiple temporal motifs withthe same static topology. Our main contributions are as follows:โข We propose the motif template counting problem, where,given a temporal network, a ๐-node target template graph๐ป ,the number โ of edges of each temporal motif, and a bound ๐ฟon the duration of the temporal motifs, the problem requiresto output all the counts of the temporal motifs whose statictopology corresponds to ๐ป and having exactly โ temporaledges, occurring within ๐ฟ-time.โข We propose odeN, a randomized sampling algorithm pro-viding a high quality approximation for the motif templatecounting problem. odeNโs approach is to sample a set of mo-tif occurrences, ensuring that they all share the same statictopology ๐ป . Thus, odeN takes advantage of the constraintthat all motifs must share a common target template ๐ป , ag-gregating the computation of all motif counts in a sample.odeNโs approximation, as in other data mining applications,is controlled by two parameters Y, [, which control respec-tively the quality and the confidence of the approximations.โข We show a tight and efficiently computable bound on thenumber of samples required by odeN for the approximationto be within Y error with confidence > 1 โ [ for all temporalmotifโs counts.โข We perform large scale experiments using datasets with upto billions of temporal edges, showing that odeN requires afraction of the time required by state-of-the-art approxima-tion algorithms for single motif counts, and that it reportssharper estimates. We then provide a parallel implemen-tation of odeN displaying almost linear speedup in manyconfigurations. We also show how odeN provides novel in-sights on the dynamics of a real-world temporal network.
2 PRELIMINARIES
In this section we introduce the basic notions that we will usethroughout the work, and we define the computational problemof counting multiple temporal motifs sharing a common targettemplate graph. We start by defining temporal networks.
Definition 2.1. A temporal network is a pair ๐ = (๐ , ๐ธ) where,๐ = {๐ฃ1, . . . , ๐ฃ๐} and ๐ธ = {(๐ฅ,๐ฆ, ๐ก) : ๐ฅ,๐ฆ โ ๐ , ๐ฅ โ ๐ฆ, ๐ก โ R+} with|๐ | = ๐ and |๐ธ | =๐.
1
2
3
4
5
6
7
8
TargetTemplate ๏ฟฝ
?
72
5
3, 811
6, 18 9
14,27
20, 35
10, 15
13
19
21 24
TemporalNetwork )
(a)
v2 v3
v1
ordering ฯ
ใ(v1, v2), (v3, v1), (v2, v3)ใ
t3
t1 t2
(b)
2 6
5
20
6 9
2 6
5
35
6 9
2 6
5
20
18 9
2 6
5
35
18 9
(c)
Figure 1: (1a): Motif template counting problem overview:
given a temporal network and a (static) target template, com-
pute the counts of all temporal motifs that map on the tem-
plate. (1b): Temporal motif, with ๐ = 3, โ = 3, and its order-
ing ๐ . (1c): Sequences of edges of the network in (1a) among
nodes {2, 5, 6} thatmap topologically on themotif in (1b). For
๐ฟ = 15 only the green sequence is a ๐ฟ-instance of the motif,
since the timestamps respect ๐ and ๐ก โฒโโ ๐ก โฒ1 = 20 โ 6 โค ๐ฟ . The
red sequences are not ๐ฟ-instances, since they do not respect
such constraint or do not respect the ordering ๐ .
Given (๐ฅ,๐ฆ, ๐ก) โ ๐ธ, we say that ๐ก is the timestamp of the directededge (๐ฅ,๐ฆ). Given a temporal network ๐ , by ignoring the times-tamps of its edges we obtain the associated undirected projectedstatic network, defined as follows.
Definition 2.2. The undirected projected static network of a tempo-ral network๐ = (๐ , ๐ธ) is the pair๐บ๐ = (๐ , ๐ธ๐ ) that is an undirectednetwork, such that ๐ธ๐ = {{๐ฅ,๐ฆ} : (๐ฅ,๐ฆ, ๐ก) โ ๐ธ}.
We will often use the term static network to denote a networkwhose edges are without timestamps. Next we introduce the defini-tion of temporal motifs as defined by Paranjape et al. [34], whichare small, connected subgraphs representing patterns of interest.
Definition 2.3. A ๐-node โ-edge temporal motif ๐ is a pair๐ =
(K, ๐) where K = (๐K , ๐ธK ) is a directed and weakly connectedmultigraph where ๐K = {๐ฃ1, . . . , ๐ฃ๐ }, ๐ธK = {(๐ฅ,๐ฆ) : ๐ฅ,๐ฆ โ ๐K , ๐ฅ โ
๐ฆ} s.t. |๐K | = ๐ and |๐ธK | = โ , and ๐ is an ordering of ๐ธK .
Note that a ๐-node โ-edge temporal motif ๐ = (K, ๐) is alsoidentified by the sequence โจ(๐ฅ1, ๐ฆ1), . . . , (๐ฅโ , ๐ฆโ )โฉ of edges orderedaccording to ๐ ; we will often use such representation for a motif๐(see Fig. (1b) for an example). Given a ๐-node โ-edge temporal motif๐ , the values of ๐ and โ are determined by ๐K and ๐ธK . We willtherefore use the term temporal motif, or simply motif, when ๐ andโ are clear from context. Given a temporal motif๐ = ((๐K , ๐ธK ), ๐),we denote with๐บ๐ข [๐] the undirected graph corresponding to theunderlying undirected graph structure of the multigraph K of๐ ,that is ๐บ๐ข [๐] = (๐K , ๐ธ๐ข๐ ) where ๐ธ
๐ข๐
= {{๐ฅ,๐ฆ} : (๐ฅ,๐ฆ) โจ (๐ฆ, ๐ฅ) โ
odeN: Simultaneous Approximation of Multiple Motif Counts in Large Temporal Networks CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia
๐ธK } (i.e., ๐ธ๐ข๐ is the set of undirected edges associated to the multiset๐ธK ). Notice that directed edges of the form (๐ฅ,๐ฆ), (๐ฆ, ๐ฅ) as well asmultiple directed edges (๐ฅ,๐ฆ), (๐ฅ,๐ฆ), . . . from ๐ธK are representedby the same undirected edge {๐ฅ,๐ฆ} in ๐ธ๐ข
๐.
For a fixed temporal motif ๐ , we are interested in identifyingits realizations in ๐ appearing within at most ๐ฟ-time duration, ascaptured by the following definition.
Definition 2.4. Given a temporal network ๐ = (๐ , ๐ธ) and ๐ฟ โR+, a time ordered sequence ๐ = โจ(๐ฅ โฒ1, ๐ฆ
โฒ1, ๐กโฒ1), . . . , (๐ฅ
โฒโ, ๐ฆโฒ
โ, ๐ก โฒโ)โฉ of โ
unique temporal edges from ๐ is a ๐ฟ-instance of the temporal motif๐ = โจ(๐ฅ1, ๐ฆ1), . . . , (๐ฅโ , ๐ฆโ )โฉ if:
(1) there exists a bijection ๐ on the vertices such that ๐ (๐ฅ โฒ๐) = ๐ฅ๐
and ๐ (๐ฆโฒ๐) = ๐ฆ๐ , ๐ = 1, . . . , โ ; and
(2) the edges of ๐ occur within ๐ฟ time, i.e., ๐ก โฒโโ ๐ก โฒ1 โค ๐ฟ .
Exploring different values of ๐ฟ in the above definition oftenleads to different insights on the temporal network that may bediscovered through the analysis of the motifs [1, 15, 21, 33]. Notethat in a ๐ฟ-instance of the temporal motif ๐ = (K, ๐) the edgetimestamps must be sorted according to the ordering ๐ (see Fig. (1c)for an example). In fact, ๐ plays a key role in defining a temporalmotif, with different orderings of the same multigraphK reflectingdiverse dynamic properties captured by the motif.
For a given directed multigraph K with |๐ธK | = โ edges, ingeneral not all the โ! orderings of its edges define distinct temporalmotifs. We therefore introduce the following equivalence relation.
Definition 2.5. Let๐1 and๐2 be two temporal motifs. Let๐1 =
โจ(๐ฅ11 , ๐ฆ
11), . . . , (๐ฅ
1โ, ๐ฆ1
โ)โฉ, and ๐2 = โจ(๐ฅ2
1 , ๐ฆ21), . . . , (๐ฅ
2โ, ๐ฆ2
โ)โฉ be the se-
quences of edges of๐1 and๐2, respectively. We say that๐1 and๐2are not distinct (denoted with๐1 ๏ฟฝ๐ ๐2) if there exists a bijection๐ on the vertices such that ๐(๐ฅ1
๐) = ๐ฅ2
๐and ๐(๐ฆ1
๐) = ๐ฆ2
๐, ๐ = 1, . . . , โ .
We provide an example of the definition above in Figure 2.Given two networks (undirected or temporal) ๐บ,๐บ โฒ we say that
๐บ โฒ = (๐ โฒ, ๐ธ โฒ) is a subgraph of ๐บ = (๐ , ๐ธ) (denoted with ๐บ โฒ โ ๐บ)if ๐ โฒ โ ๐ and ๐ธ โฒ โ ๐ธ. Note that we require a subgraph to beedge induced. To conclude the preliminary notions, we recall thedefinition of static graph isomorphism.
Definition 2.6. Given two graphs๐บ = (๐๐บ , ๐ธ๐บ ) and๐ป = (๐๐ป , ๐ธ๐ป )we say that the two graphs are isomorphic, denoted with ๐บ โ ๐ป
if and only if there exists a bijection ๐ : ๐๐บ โฆโ ๐๐ป on the verticessuch that ๐ = (๐ข, ๐ฃ) โ ๐ธ๐บ โ ๐ โฒ = (๐ (๐ข), ๐ (๐ฃ)) โ ๐ธ๐ป .
Let U(๐,๐ฟ) = {๐ผ : ๐ผ is a ๐ฟ-instance of ๐} be the set of (all) ๐ฟ-instances of the motif๐ in๐ . The count of๐ is๐ถ๐ (๐ฟ) = |U(๐,๐ฟ) |,denoted with ๐ถ๐ when ๐ฟ is clear from the context.
Given a static undirected graph ๐ป , which we call the targettemplate, we are interested in solving the problem of computingthe number of ๐ฟ-instances of all temporal motifs with โ edgesand all corresponding to the same static graph ๐ป . More formally,given the target template ๐ป = (๐๐ป , ๐ธ๐ป ), which is a simple andconnected graph, and โ โฅ |๐ธ๐ป | โ Z+, letM(๐ป, โ) be the set ofdistinct temporal motifs with โ edges whose underlying undirectedgraph structure corresponds to ๐ป , that isM(๐ป, โ) contains mo-tifs ๐๐ = ((๐ ๐
K , ๐ธ๐K ), ๐๐ ), ๐ = 1, 2, . . . , such that i) ๐บ๐ข [๐๐ ] โ ๐ป ; ii)
|๐ธ๐K | = โ ; and iii)๐๐ ๏ฟฝ๐ ๐๐ ,โ๐ โ ๐ .
x y
zM1
โผ=ฯ
xโฒ yโฒ
zโฒM2
t1
t2t3
t2
t3t1
x y
zM1
๏ฟฝฯ
xโฒ yโฒ
zโฒM3
t1
t2t3
t1
t3t2
Figure 2: (Left): The two motifs are not distinct: let ๐1 =
โจ(๐ฆ, ๐ฅ), (๐ฆ, ๐ง), (๐ฅ, ๐ง)โฉ and ๐2 = โจ(๐ฅ โฒ, ๐งโฒ), (๐ฅ โฒ, ๐ฆโฒ), (๐งโฒ, ๐ฆโฒ)โฉ corre-sponding to ๐1 and ๐2, then the function ๐ : ๐ 1
K โฆโ ๐ 2K de-
fined by ๐ (๐ฅ) = ๐งโฒ, ๐ (๐ฆ) = ๐ฅ โฒ, ๐ (๐ง) = ๐ฆโฒ preserves both the
topology and the ordering as from Definition 2.5. (Right):
The two motifs are distinct since there is no map ๐ : ๐ 1K โฆโ
๐ 3K preserving both the topology and ordering.
Let us explain intuitively the constrains above. First, ๐ป imposesa constraint on the undirected static topology the temporal motifsof interest (that are directed subgraphs) should have. That is, itrequires all the motifs to have the same underlying graph structure(๐บ๐ข [๐]), which must be isomorphic to ๐ป . This is a useful way torepresent multiple related temporal motifs. For example, in socialnetwork analysis by fixing ๐ป as an undirected triangle we considerinM(๐ป, โ) all temporal motifs that characterize the communicationbetween groups of three friends (i.e., each motif will represent adifferent form of communication among all such groups [34]). Thesecond constraint requires each motif๐๐ โ M(๐ป, โ) to have exactlyโ โฅ |๐ธ๐ป | edges, with โ provided in input by the user. Fixing theparameter โ is motivated by the fact that motifs with different valuesof โ (evenwith the same target template structure๐ป ) reflect differentpatterns of interaction (e.g, a group of friends that exchanges โ = 3or โ = 4 messages). As we will show empirically in Section 5.4, suchcounts vary significantly with โ for fixed ๐ป and ๐ฟ . Finally, the thirdconstraint ensures that we only count distinct motifs, i.e., motifsrepresenting different patterns.
We now define the motif template counting problem.
Problem 1. Motif template counting problem. Given a tem-poral network ๐ , a static undirected target graph ๐ป = (๐๐ป , ๐ธ๐ป ),โ โ Z+, โ โฅ |๐ธ๐ป |, and a parameter ๐ฟ โ R+, find the counts๐ถ๐๐
(๐ฟ) ofmotifs๐๐ โ M(๐ป, โ), ๐ = 1, . . . , |M(๐ป, โ) | in ๐ .
We now provide an example of the different motifs to be countedfor different values of โ with a fixed target template ๐ป .
Example 2.7. Let ๐ป = ({๐ฃ1, ๐ฃ2}, {{๐ฃ1, ๐ฃ2}}), that is, the targettemplate is an edge. Let ๐1 = (๐ฃ1, ๐ฃ2) and ๐2 = (๐ฃ2, ๐ฃ1). By vary-ing โ โ {2, 3} the motifs inM(๐ป, โ), for which we want to com-pute the counts, are: ๐1 = โจ๐1, ๐1โฉ and ๐2 = โจ๐1, ๐2โฉ for โ = 2(i.e., |M(๐ป, 2) | = 2) while ๐1 = โจ๐1, ๐1, ๐1โฉ, ๐2 = โจ๐1, ๐2, ๐1โฉ, ๐3 =
โจ๐1, ๐2, ๐2โฉ, ๐4 = โจ๐1, ๐1, ๐2โฉ for โ = 3 (i.e., |M(๐ป, 3) | = 4).
Since solving the counting problem exactly is NP-Hard in gen-eral1 even for one single temporal motif, we aim at providing high-quality approximations to the motif counts as follows.
Problem 2. Motif template approximation problem. Giventhe input parameters of Problem 1 and additional parameters Y โ1The hardness depends on the topology of the motif. For example for triangles andsingle edges there exist polynomial time-algorithms, even if they are impracticableon very large networks. Interestingly, counting temporal star-shaped motifs is NP-Hard [26], while on static networks such motifs can be counted in polynomial time.
CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia Ilie Sarpe and Fabio Vandin
R+, [ โ (0, 1), compute approximations ๐ถ โฒ๐๐(๐ฟ) of counts ๐ถ๐๐
(๐ฟ)of motifs ๐๐ โ M(๐ป, โ), ๐ = 1, . . . , |M(๐ป, โ) |, such that P[โ๐ โ{1, . . . , |M(๐ป, โ) |} : |๐ถ โฒ
๐๐(๐ฟ) โ ๐ถ๐๐
(๐ฟ) | โฅ Y๐ถ๐๐(๐ฟ)] โค [, that is
๐ถ โฒ๐๐(๐ฟ) is a relative Y-approximation to the count ๐ถ๐๐
(๐ฟ) with prob-ability โฅ 1 โ [ for all ๐ = 1, . . . , |M(๐ป, โ) | simultaneously.
3 RELATEDWORKS
Much work has been done on enumerating and approximating ๐-node motifs in (nontemporal) networks. We refer the interestedreader to the surveys [38, 48]. However, such works cannot be easilyadapted to temporal motifs since they do not properly account forthe temporal information [15, 34]. Many different definitions oftemporal networks and temporal patterns have been proposed: herewe will focus only on those works that are relevant for our work,the interested reader may refer to [15, 16, 18, 28] for a more generaloverview.
Our work builds on the work of Paranjape et al. [34] whichfirst introduced the definition of temporal motif used here, andthe problem of counting single temporal motifs. The authors pro-vided a general algorithm for counting a single temporal motif byenumerating all the subsequences of edges that map on a singlestatic subgraph. Their approach is not feasible on large datasetssince it requires exhaustive enumeration of all subgraphs of theundirected projected static network ๐บ๐ that are isomorphic to thetarget template ๐ป . The authors also proposed efficient algorithmsand data-structures for counting 3-node 3-edge motifs, which maybe used for the exact counting subroutines within odeN samplingframework. In addition to the algorithmic contributions, the authorsalso showed that networks from similar domains tend to exhibitsimilar temporal motif counts. They also showed how motif countscan provide significant insights on the communication patterns inmany networks, highlighting the importance of studying temporalmotifs in temporal networks.
Other exact algorithms have been proposed for the problem ofcounting a single motif, or for slightly different problems. Mackeyet al. [27] presented a backtracking algorithm for counting a singletemporal motif that can be use for any motif. Boekhout et al. [6]developed exact algorithms for counting temporal motifs in multi-layer temporal networks (i.e., each edge is a tuple (๐ฅ,๐ฆ, ๐ก, ๐) with๐ denoting the layer of each edge), they also discuss efficient data-structures for counting 4-node 4-edge motifs, which may also beadapted for the exact counting subroutines in our sampling frame-work odeN. Being exact, both such algorithms do not scale onmassive datasets due large time and memory requirements.
Several approximation algorithms have been proposed in re-cent years for estimating the count of a single motif. Liu et al. [26]proposed a temporal-partition based sampling approach. Wang etal. [47] introduced a sampling-based algorithm that selects tempo-ral edges with a fixed probability specified by the user. Lastly, Sarpeand Vandin [40] proposed PRESTO, an algorithm based on uniformsampling of small windows of the temporal network ๐ . All suchsampling algorithms can be used to analyze a single temporal motifbut become inefficient as the number of motifs to be counted grows,such as in Problem 2. In fact, they cannot leverage the additionalinformation that all motifs ๐1, . . . , ๐ |M(๐ป,โ) | must share a com-mon static topology isomorphic to ๐ป . As stated in Section 1, when
analysing a temporal networks it is hard to know a-priori whichmotif is representing important functions for the network, thereforeone often relies on testing all possible orderings ๐ over one fixedtarget template ๐ป for fixed โ, ๐ฟ [34, 45] (as in Prob. 1) resulting in atime consuming and inefficient procedure. Our approach insteadsupports the direct analysis of multiple temporal motifs, enablingthe study of hundreds of temporal motifs on massive networks in avery limited time.
4 ODEN
In this section we present odeN, our algorithm to address the motiftemplate approximation problem (Prob. 2). We start in Section 4.1with an overview of odeN. We then describe the algorithm inSection 4.2, analyze its time complexity in Section 4.3 and its theo-retical guarantees, including an efficiently computable bound onthe number of samples required to obtain the desired probabilisticguarantees, in Section 4.4.
4.1 Overview of odeN
Our algorithm odeN estimates of the counts of motifs inM(๐ป, โ).The main idea is to avoid the explicit generation all the motifs๐๐ โ M(๐ป, โ), ๐ = 1, . . . , |M(๐ป, โ) | to count them one at the timeas it is required by existing algorithms that approximate a singlemotif count. odeN instead leverages the fact that the topology of allmotifs must to be isomorphic to the target template ๐ป , by reusingthe computation while estimating the motif counts.
An overview of the main strategy adopted by our algorithm ispresented in Figure 3. Given the input parameters of Problem 2,where ๐ป is the target template, the idea behind our procedure is toconsider the undirected static projected graph ๐บ๐ of the input tem-poral network ๐ and proceed as follows: i) find a set of subgraphsin the static graph๐บ๐ that are isomorphic to ๐ป by first sampling anedge ๐๐ of ๐บ๐ with some probability ๐๐๐ , where ๐๐๐ depends, po-tentially, on ๐๐ and the temporal network๐ , and then enumeratingall subgraphs of๐บ๐ isomorphic to ๐ป and containing ๐๐ ; ii) for eachsuch subgraph, consider the corresponding temporal subgraph andcompute all the counts of the subsequences of โ edges occurringwithin ๐ฟ-time in such temporal subgraph; iii) for each such sub-sequence identified, find the corresponding motif inM(๐ป, โ), forwhich the subsequence is a ๐ฟ-instance of, and update a count foreach motif identified; iv) weight each motif count opportunely inorder to maintain an unbiased estimate of global motif counts; v)repeat steps i)-iv) a sufficient number of iterations to guarantee thedesired (Y, [)-approximation (see Problem 2).
4.2 Algorithm Description
odeN is described in Algorithm 1. It first computes ๐บ๐ = (๐ , ๐ธ๐ ),the undirected projected static graph of ๐ (line 1), and initializes๐ถ๐๐ ๐ก๐๐๐๐ก๐๐ (line 2) used to store the estimates of motif counts, whichare used to compute the estimators ๐ถ โฒ
๐๐, ๐ = 1, . . . , |M(๐ป, โ) |. Then
it repeats ๐ times (line 3) the following procedure: i) pick a randomedge ๐๐ from๐บ๐ (line 4) according to some probability distributionover the edges of ๐ธ๐ ; ii) enumerate all the subgraphs โ of ๐บ๐ suchthat โ โ ๐ป and ๐๐ โ โ (line 5); note that this enumeration step islocal to ๐๐ ; iii) for each such โ (line 6), collect the correspondingtemporal graph, i.e., all edges in ๐ for which their static projected
odeN: Simultaneous Approximation of Multiple Motif Counts in Large Temporal Networks CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia
Figure 3: Overview of odeNโs approximation strategy. Let
๐ป be a triangle, and โ = 3, ๐ฟ = 40. odeN first collects the
static projected network ๐บ๐ , then samples an edge ๐๐ โ ๐บ๐
randomly (๐๐ = {1, 2} in the figure) and enumerates all the
subgraphs of ๐บ๐ isomorphic to ๐ป containing ๐๐ . For each
subgraph it collects the corresponding temporal network,
counts the ๐ฟ-instances of the motifs, and combines the dif-
ferent counts to obtain unbiased estimates of motif counts.
This procedure is repeated to obtain concentrated estimates.
edge is an edge of โ (line 7), sort the sequence of edges of suchgraph by increasing timestamps and apply some pruning criteria(lines 8-9); iv) if the sequence is not pruned, then update the es-timates of the number of ๐ฟ-instances of each temporal motif bycalling the routine FastUpdate (line 10). FastUpdate features anefficient implementation of the general algorithm by Paranjapeet al. [34], for which we devised efficient encodings of the mo-tifs within integers through bitwise operations. Such function up-dates ๐ถ๐๐ ๐ก๐๐๐๐ก๐๐ in order to maintain for each motif the count thatwill be used to output its unbiased estimate (see Appendix B). Let๐ถ๐๐(๐) be the number of ๐ฟ-instances in๐ of๐๐ , ๐ = 1, . . . , |M(๐ป, โ) |
whose undirected projected static network contains edge ๐ โ ๐บ๐ .FastUpdate updates the estimate of the counts for each motif๐๐
by summing its unbiased estimate obtained at the ๐-th iteration(i.e., ๐ ๐
๐๐= ๐ถ๐๐
(๐๐ )/(|๐ธ๐ป |๐๐๐ )). Once the procedure is repeated ๐ times, for each motif ๐๐ โ M(๐ป, โ), ๐ = 1, . . . , |M(๐ป, โ) |, odeNcomputes the final estimate ๐ถ โฒ
๐๐= 1
๐
โ๐ ๐=1 ๐
๐
๐๐where ๐
๐
๐๐=
1|๐ธ๐ป |
โ๐โ๐บ๐
๐ถ๐๐(๐)๐๐/๐๐ is the estimate obtained at the ๐-th it-
eration (with๐๐ being a bernoulli random variable denoting if edge๐ โ ๐บ๐ is sampled at the ๐-th iteration, s.t. P[๐๐ = 1] = ๐๐ ) and out-puts it together with the motif (we output ๐๐ over the node-set ๐๐ป )(lines 12-13). We show in Lemma 4.1 that odeN outputs unbiasedestimates for all the motif counts.
We briefly discuss the pruning criteria used in line 9. Given acandidate temporal graph ๐ for which ๐บ๐ โ ๐ป holds, we check inlinear time if ๐ can contain a ๐ฟ-instance of a motif or not: since ๐ isalready sorted by increasing timestamps (see line 8), we efficientlycheck if there are at least โ edges within ๐ฟ-time. If not, then we
Algorithm 1: odeNInput: ๐ = (๐ , ๐ธ), ๐ป = (๐๐ป , ๐ธ๐ป ), ๐ฟ, ๐ , โOutput: (๐๐ ,๐ถ
โฒ๐๐), ๐ = 1, . . . , |M(๐ป, โ) | where ๐ถ โฒ
๐๐is an
estimate of ๐ถ๐๐for the motifs inM(๐ป, โ).
1 ๐บ๐ = (๐ , ๐ธ๐ ) โ UndirectedStaticProjection(๐ )2 ๐ถ๐๐ ๐ก๐๐๐๐ก๐๐ โ {}3 for ๐ โ 1 to ๐ do4 ๐๐ = {๐ฅ๐ , ๐ฆ๐ } โ RandomEdge(๐ (๐) : ๐ โ ๐ธ๐ )5 H โ {โ โ ๐บ๐ : โ โ ๐ป, {๐ฅ๐ , ๐ฆ๐ } โ โ}6 foreach โ โ H do
7 ๐ โ {(๐ฅ,๐ฆ, ๐ก), (๐ฆ, ๐ฅ, ๐ก) โ ๐ธ : {๐ฅ,๐ฆ} โ โ}8 SortInPlace(๐) โฒ By increasing timestamps
9 if *Pruning criteria are not met* then
10 FastUpdate(๐ฟ, ๐,๐ถ๐๐ ๐ก๐๐๐๐ก๐๐ , ๐ (๐๐ ), ๐ป )
11 foreach (๐,๐๐ ) โ ๐ถ๐๐ ๐ก๐๐๐๐ก๐๐ do
12 ๐ถ โฒ๐โ ๐๐
๐
13 output (๐,๐ถ โฒ๐)
prune the sequence (since by definition a ๐ฟ-instance of a motif with๐-nodes, and โ-edges must have โ edges occurring within ๐ฟ-time).We thus avoid calling the subroutine FastUpdate, which has anexponential complexity in general (see Section 4.3), on ๐ .
We now discuss the probability distribution used to sample arandom edge ๐๐ from๐บ๐ (line 4), while we describe the subroutineFastUpdate that updates the motif estimates at each iteration (line10) and the algorithms employed for the static enumeration inAppendix B for space constraints (Sections B.1 and B.2).
Since our final estimate is an average over ๐ samples of thevariables ๐ ๐
๐๐, ๐ = 1, . . . , |M(๐ป, โ) |, ๐ = 1, . . . , ๐ , and given that ๐ ๐
๐๐
is an unbiased estimate (see Lemma 4.1) the final estimate is also aconsistent estimator (i.e., it converges to๐ถ๐๐
as ๐ โโ) if each edgehas a positive probability of being sampled2. Thus any probabilitymass assigning positive probabilities on edges can be adopted. Weconsidered different distributions over the edges of ๐ธ๐ :
(1) Uniform: ๐๐ = 1/|๐ธ๐ |, ๐ โ ๐ธ๐ ;(2) Static degree based: ๐๐ = ๐ (๐)/(โ๐โฒโ๐ธ๐ ๐ (๐ โฒ)), ๐ โ ๐ธ๐ where
๐ (๐ = {๐ฅ,๐ฆ}) = ๐ (๐ฅ) + ๐ (๐ฆ) is the degree of the edge as sumof the degree of its nodes ๐ฅ,๐ฆ โ ๐ in ๐บ๐ ;
(3) Temporal degree based: ๐๐ = ๐ (๐)/(โ๐โฒโ๐ธ๐ ๐ (๐ โฒ)) with๐ (๐ = {๐ฅ,๐ฆ}) = |{๐ก : โ(๐ฅ, ๐ง, ๐ก) โจ (๐ง, ๐ฅ, ๐ก) โ ๐ธ}| + |{๐ก :โ(๐ง,๐ฆ, ๐ก) โจ (๐ฆ, ๐ง, ๐ก) โ ๐ธ, ๐ง โ ๐ฅ}|, ๐ โ ๐ธ๐ ;
(4) Temporal edge weight based: ๐๐={๐ฅ,๐ฆ } = |{(๐ฅ,๐ฆ, ๐ก), (๐ฆ, ๐ฅ, ๐ก) โ๐ธ}|/๐, ๐ โ ๐ธ๐ ;
We empirically found the distribution (4) to be the fastest toconverge for small number ๐ of iterations, thus we use it in ouranalysis. We observe that many other candidate distributions can bedesigned (e.g., combining two of those already listed with weightsb, 1 โ b, b โ (0, 1)) making our framework extremely versatile.
We conclude by summarizing some nice properties of our algo-rithm: 1) it computes the estimates only for the temporal motifs
2More formally it is only necessary to assign to each ๐ฟ-instance a known positivesampling probability.
CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia Ilie Sarpe and Fabio Vandin
occurring in the input temporal network ๐ (except for the veryunpractical case where the motifs inM(๐ป, โ) have all zero counts)without generating all the possible candidates, while existing sam-pling techniques require to first generate all the candidates and thento execute the algorithms on such candidates, even for motifs withzero counts; 2) it takes advantage of the constraint that all motifsshare the same underlying topology (๐ป ), saving computation whenestimating the different counts; 3) it is trivially parallelizable: allthe ๐ iterations can be executed in parallel; 4) it can easily use mostof the fast state-of-the-art subgraph enumeration algorithms devel-oped for the exact subgraph isomorphism problem (see AppendixB.2).
4.3 Time Complexity
In this section we briefly describe the time complexity of odeN.odeN needs to compute the probabilities ๐ (๐) of edges in advance,which requires a ๐ ( |๐ธ๐ |) preprocessing step. Interestingly, thisstep does not depend on the target template ๐ป , so it can be reusedfor different target templates ๐ป . One of the most expensive stepsin Algorithm 1 is the local enumeration to identify the set Hwhich in general requires exponential time (line 5). For specifictopologies this step can be implemented very efficiently with sym-metry breaking conditions and min-degree expansion. For exam-ple, if ๐ป is a triangle this โlocal" enumeration to ๐๐ = {๐ฅ๐ , ๐ฆ๐ }can be done in ๐ (min(๐๐ฅ๐ , ๐๐ฆ๐ )) time. Let |Hโ | be the maximumcardinality of a set of subgraphs isomorphic to ๐ป and adjacentto an edge in ๐บ๐ . Let |๐โ | denote the maximum cardinality of aset ๐ collected (in line 7) by our algorithm odeN. Sorting ๐โ re-quires ๐ ( |๐โ | log |๐โ |) time. The subroutine FastCount has a com-plexity dominated by ๐ (( |๐โ | + โ) |๐ธ๐ป |โ ) (see [34] and App. B.1for more details). So overall the complexity of our procedure is๐ ( |๐ธ๐ | + ๐ (Z๐๐๐ข๐ + |Hโ | ( |๐โ | log( |๐โ |) + |๐ธ๐ป |โ ( |๐โ | + โ)))), whereZ๐๐๐ข๐ is the time required by the static enumerator used as sub-routine to compute the set Hโ. Such step in general is exponen-tial in the number of edges of |๐ธ๐ | and depends on the exacttechnique used as subroutine. The final complexity accounts forthe cycle (in line 3) that is repeated ๐ times. The parallel versionof our algorithm, which executes the cycle of line 3 in parallelon ๐ processing units available, leads to a time complexity of๐ ( |๐ธ๐ | + ๐ /๐ (Z๐๐๐ข๐ + |Hโ | ( |๐โ | log( |๐โ |) + |๐ธ๐ป |โ ( |๐โ | + โ)))).
4.4 Theoretical Guarantees
In this section we present the theoretical guarantees provided byodeN. All proofs are provided in Appendix D.
Recall that our algorithm outputs, for each motif ๐๐ โM(๐ป, โ), ๐ = 1, . . . , |M(๐ป, โ) |, the following estimate: ๐ถ โฒ
๐๐=
1๐
โ๐ ๐=1 ๐
๐
๐๐= 1
๐ |๐ธ๐ป |โ๐
๐=1โ๐โ๐บ๐
๐ถ๐ (๐)๐๐/๐๐ . The followingshows that such estimates are unbiased estimates of ๐ถ๐๐
, ๐ =
1, . . . , |M(๐ป, โ) |.
Lemma 4.1. For eachmotif-count pair (๐๐ ,๐ถโฒ๐๐) reported in output
by odeN, ๐ถ โฒ๐๐
is an unbiased estimate to ๐ถ๐๐, that is E[๐ถ โฒ
๐๐] = ๐ถ๐๐
Let ๐ผ = min{๐ฅ,๐ฆ }โ๐ธ๐ {|{(๐ฅ,๐ฆ, ๐ก), (๐ฆ, ๐ฅ, ๐ก) โ ๐ธ}|}, i.e., the mini-mum number of temporal edges of ๐ that map on an edge in ๐บ๐ .We now give an upper bound to the variance of the estimates pro-vided by Algorithm 1 for each motif reported in output.
Lemma 4.2. For eachmotif-count pair (๐๐ ,๐ถโฒ๐๐) reported in output
by odeN, it holds Var[๐ถ โฒ๐๐] โค
๐ถ2๐๐
๐
(๐
๐ผ |๐ธ๐ป | โ 1)
To give a bound on the number ๐ of samples required by odeNto output a Y-approximation that holds on all motifs in outputwith probability > 1 โ [, we combine Bennettโs inequality [5],an advanced result on the concentration of sums for independentrandom variables as reported in [40], with a union bound, obtainingthe following main result.
Theorem 4.3. Let ๐ be the number of iterations of odeN, let Y โ R+,and [ โ (0, 1). If ๐ โฅ
(๐
๐ผ |๐ธ๐ป | โ 1)
1(1+Y) ln(1+Y)โY ln
(2 |M(๐ป,โ) |
[
)then
P[โ๐ โ {1, . . . , |M(๐ป, โ) |} : |๐ถ โฒ๐๐โ๐ถ๐๐
| โฅ Y๐ถ๐๐] โค [.
5 EXPERIMENTAL EVALUATION
We implemented odeN and tested it on several large datasets (seeSection 5.1 for details on setup, and data). Our experimental evalu-ation has the following goals: compare odeN with state-of-the-artalgorithms for approximating motif counts (Section 5.2); evaluatethe scalability of a simple parallel implementation of odeN (Sec-tion 5.3); provide a case study highlighting the usefulness of usingodeN (Section 5.4) to analyze real-world temporal networks.
5.1 Setup, and Datasets
We briefly describe the setup and the large-scale datasets used inour experimental evaluation.
We implemented our algorithm odeN in C++20 and compiledit under gcc 9.3 with optimization flag enabled (implementationavailable at https://github.com/VandinLab/odeN), additional detailson the implementation are in Appendix C. We compared odeNwithfour different baselines, denoted as PRESTO-A (PR-A), PRESTO-E(PR-E) [40], LS [26], and ES [47]. We used the original implemen-tations available from the authors. We performed all experimentsunder Ubuntu 20.04 on a machine with 64 cores, Intel Xeon E5-26982.3GHz, running each algorithm single threaded and with 300GBof maximum RAM allowed.
The datasets used in our experimental evaluation are reported inTable 1, which shows the number of nodes and edges of ๐ , the pre-cision of the timestamps, the timespan of the network, the number|๐ธ๐ | of undirected edges in the corresponding undirected projectedstatic network ๐บ๐ , the maximum degree ๐max of a node in ๐บ๐ andthe maximum number๐คmax of temporal edges that are mapped onthe same static edge in๐บ๐ . The datasets are from different domains:SO is a network that models interactions from the Stack-Overflowplatform [34], BI is a network of Bitcoin transactions [26], RE anetwork built from comments on the platform Reddit [26], and ECis a bipartite temporal network build from IPv4 packets exchangedbetween Chicago and Seattle [40]. See the original papers for moredetails on the networks and the processes they model.
When measuring the running times for the various algorithmswe exclude the time to read the dataset. Since ESโs implementationsupports only values of โ up to 4, we do not report results for ES andโ > 4. Unless otherwise stated we used ๐ฟ = 86400 for SO and RE,๐ฟ = 43200 on BI, and ๐ฟ = 50000 on EC, as done in previousworks [26,34, 47]. Since all algorithms used in our comparison have different
odeN: Simultaneous Approximation of Multiple Motif Counts in Large Temporal Networks CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia
Table 1: Datasets used and their statistics. See Section 5.1 for
details on the statistics reported.
Name ๐ ๐ |๐ธ๐ | ๐max ๐คmax Precision Timespan
SO 2.58M 47.9M 28.1M 44K 594 sec 2774 (days)BI 48.1M 113M 84.3M 2.4M 24.2K sec 2585 (days)RE 8.40M 636M 435.3M 0.3M 165K sec 3687 (days)EC 11.16M 2.32B 66.8M 0.3M 3.8M `-sec 62.0 (mins)
parameters and only odeN counts multiple motifs simultaneously,we used the following procedure to choose the parameters. For agiven target template๐ป and โ , we run PRESTO-A, PRESTO-E, LS, andES for each motif inM(๐ป, โ) with fixed parameters, and computedtheir running time as the sum of the running times required bythe single motifs in M(๐ป, โ). We then fixed the parameters ofodeN so that its running time would be at most the same as theother methods, or be close to it. All the parameters used in theexperiments (including sample sizes) are reported with the sourcecode. To extract the exact counts of motifs we used a modifiedversion of the algorithm by Mackey et al. [27]. We do not reportthe running times of such algorithm since, even though it employsparallelism, it still runs several orders of magnitude slower thanapproximate approaches.
5.2 Approximation Quality and Running Time
In this section we compared the quality of the estimates and therunning times of odeN and the baseline sampling approaches.
To evaluate the approximations qualities we used the MAPE(Mean Average Percentage Error) metric over ten executions of eachalgorithm and parameter configuration. The MAPE is computedas follows: let ๐ถ โฒ
๐๐be the estimate of ๐ถ๐๐
, ๐ = 1, . . . , |M(๐ป, โ) |,returned by an algorithm, then the relative error of such estimateis |๐ถ โฒ
๐๐โ๐ถ๐๐
|/๐ถ๐๐. The MAPE is the average over the ten runs of
the relative errors, in percentage. On each of the ten runs we alsomeasured the running time of each algorithm, for which we willreport the arithmetic mean.
We first discuss the quality of the estimates for different datasetswhen ๐ป is a triangle and โ โ {4, 5}. For โ = 4 there are |M(๐ป, โ) | =96 triangles, while for โ = 5, |M(๐ป, โ) | is 800. So as long as โ in-creases the approximation task becomes more challenging, due tothe exponential growth of the number of motifs. We also observethat, to the best of our knowledge, such a huge number of temporalmotifs was never tested before on large datasets due to the limita-tions of existing algorithms, while, as we will show, odeN rendersthe approximation task practical even on hundreds of motifs.
The results on the SO dataset are shown in Figure 4a. odeN pro-vides much sharper estimates than state-of-the-art sampling tech-niques for single motif estimations on motifs ๐1, . . . , ๐ |M(๐ป,โ) | :the relative error on โ = 4-edge triangles is bounded by 5%, andfor โ = 5-edge triangles (where |M(๐ป, โ) | = 800) the relative erroris bounded by 12% while state-of-the-art algorithms report muchless accurate estimates, with twice the relative error of odeN, oneach configuration. We report the running times to obtain suchestimates in Table 2. Interestingly, odeN is more than 3ร faster withโ = 4 than any sampling algorithm and 1.7ร faster with โ = 5. Forthe other datasets, since extracting all the exact counts for โ > 4 isextremely time consuming, requiring up to months of computation,
PRESTO-APRESTO-E LS ES odeN
100
101
% R
elat
ive
Erro
r (M
APE)
Methods comparison StackOverflow, = 4
PRESTO-A PRESTO-E LS odeN
100
101
102
% R
elat
ive
Erro
r (M
APE)
Methods comparison StackOverflow, = 5
(a)
PRESTO-APRESTO-E LS ES odeN
101
102
% R
elat
ive
Erro
r (M
APE)
Methods comparison Bitcoin, = 4
PRESTO-APRESTO-E LS ES odeN
101
102
% R
elat
ive
Erro
r (M
APE)
Methods comparison Reddit, = 4
(b)
PRESTO-A PRESTO-E LS odeN
101
102
% R
elat
ive
Erro
r (M
APE)
Methods comparison EquinixChicago, = 4
PRESTO-APRESTO-E LS ES odeN
101
% R
elat
ive
Erro
r (M
APE)
Methods comparison Stackoverflow, = 4
(c)
Figure 4: Approximation error on different datasets. (4a): SO
dataset, ๐ป is a triangle, for โ = 4 (left) and โ = 5 (right). (4b):
๐ป is a triangle, โ = 4, BI dataset (left) and RE dataset (right).
(4c): EC dataset, ๐ป is an edge, โ = 4 (left); SO dataset, ๐ป is a
square, โ = 4.
we will not discuss the approximation qualities for โ = 5 (since wedo not have the exact counts to evaluate them).
On dataset BI (Figure 4b left) odeN provides more concentratedestimates for the |M(๐ป, โ) | = 96 triangles than other algorithmsbut ES, which also has a smaller running time than odeN. This maybe related to the static graph structure of BI, which has some veryhigh-degree nodes (see Table 1). Therefore odeNmay sample edgeswith very high degree nodes, introducing an over counting in itsestimates. Nonetheless, for higher values of โ this issue is amortizedover the growing number of motifs |M(๐ป, โ) |.
On dataset RE (Figure 4b right) the estimates by odeN are allwithin 13% of relative error and improve significantly over state-of-the-art sampling algorithms, up to one order of magnitude ofprecision. Such estimates were notably obtained with significantlysmaller running time than state-of-the-art sampling algorithms,improving up to 2ร the running time of ES and 1.4ร over PRESTO(as reported in Table 2).
Finally, on the EC datasets, which is a bipartite temporal networkwith more than 2 billion edges we evaluated the approximationqualities with ๐ป being an edge and โ = 4 (for which |M(๐ป, โ) | =8), such motifs have fundamental importance in the analysis oftemporal networks since they can be seen as building blocks [16,
CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia Ilie Sarpe and Fabio Vandin
Table 2: Running times (in seconds) to obtain the results in
Figure 4 (results are showed following the order in Figure 4).
Under๐ป we report the topolology of๐ป used: T for triangles, Efor edges, and S for squares. โ-โ denotes not applicable, while
โโโ denotes out of RAM.
Dataset โ ๐ป PR-A PR-E LS ES odeN
SO 4 T 533.4 537.7 555.5 567.2 174.4
SO 5 T 4405 4408 4390 - 2515
BI 4 T 2048.6 2065.2 2754.6 1602.9 1948.9RE 4 T 9787.1 10165.8 14289.7 13172.3 6814.9
EC 4 E 2581.5 3014.9 2981.9 โ 1234.3
SO 4 S 15613.7 16718.7 14344.6 26118.3 4517.9
49]. We report the results on such motifs in Figure 4c (left) (ES isnot shown since it did not terminate with the allowed memorybudget). The estimates of odeN are well concentrated and within20% of relative error, while other sampling approaches provideapproximations with a relative error up to 90% or more. Moreover,odeNโs results were obtained with a speedup of at least 2ร over allthe other sampling algorithms, rendering the approximations taskfeasible in a small amount of time on very large temporal networks.
To illustrate the enormous advantage of odeN over existing stateof the art exact and approximation algorithms, we compared thevarious algorithms on dataset SO when ๐ป is set to be a squareand โ = 4, for which |M(๐ป, โ) | = 48. As [47] observed, amongthe 4-edge square motifs there are 16 motifs that do not grow as asingle component (i.e., their orderings start with โจ(1, 2) (3, 4) ยท ยท ยท โฉ).Estimating the counts of such motifs is particularly hard for most ofthe current state-of-the-art sampling algorithms since they generatea large number of partial matchings, while such aspect does notimpact odeN. The results are shown in Figure 4c (right). odeNprovides tight approximations under 9% of relative error for allfour-edge square motifs, while other sampling algorithms fail toprovide sharp estimates for some of the motifs. Surprisingly, asshown in Table 2, to obtain such estimates odeN required less than1.3 hours of computation while the exact computation of the countsrequired more than two weeks, and odeN it is at least 3ร timesfaster than all algorithms, and it is 5.4ร times faster than ES.
Overall, these results show that our algorithm odeN achievesmuch more precise estimates within a significant smaller runningtime than state of the art sampling algorithms when estimatingthe counts๐ถ๐1 , . . . ,๐ถ๐|M(๐ป,โ ) | for different values of โ and differenttopologies of the target template ๐ป (see Problem 1 in Section 2).
5.3 Parallel Implementation
In this section we briefly describe the advantages of a simple parallelimplementation of Algorithm 1. As discussed in Section 4.2 thefor cycle (from line 3) can be trivially parallelized, therefore weimplemented such strategy through a thread pooling design pattern.
We describe the results obtained with ๐ป set to be a triangle,โ = 4, and on the dataset SO; similar results are observed for otherdatasets. We tested the speedup achieved with ๐ โ {2, 4, 8, 16}threads over the sequential implementation. Let ๐๐ the averagerunning time with ๐ threads over ten execution of odeNwith fixed
2 4 8 16Threads
2
4
6
8
10
Spee
dup
over
sequ
entia
l
s = 1 106
s = 2 106
s = 3 106
2 4 8 16Threads
2
3
4
5
6
7
8
9
10
Spee
dup
over
sequ
entia
l
= 43200 = 86400 = 129600
Figure 5: Speed-up of odeNโs parallel implementation.
(Left): Varying ๐ and fixed ๐ฟ ; (Right) Varying ๐ฟ and fixed ๐ .
parameters, with ๐1 being the average time for running the algo-rithm sequentially. We report the value of ๐1/๐๐ , ๐ โ {2, 4, 8, 16},i.e., the speedup over the sequential implementation. Fig. 5 (Left)shows the speedup across different values of the sample size ๐ , with๐ฟ = 86400. We observe an almost linear speedup up to 4 threadsand then a slightly worse performance, especially for small samplesizes, that may be related to the time needed to process each sam-ple. Fig. 5 (Right) shows how the speedup changes for ๐ = 2 ยท 106
and different values of ๐ฟ . We note that our algorithm odeN seemsnot to be impacted by the value of ๐ฟ , and always attaining similarperformances. Interestingly, as captured by our analysis in Section4.3, the algorithm does not reach a fully linear speedup since wedid not parallelized the computation of the sampling probabilities๐ (๐), ๐ โ ๐ธ๐ . As a remark, our parallel implementation is not op-timized, and more advanced parallel strategies may substantiallyincrease its speedup.
5.4 A Case Study
In this section we illustrate how counting multiple motifs, corre-sponding to the same target template ๐ป , with odeN can be used toextract useful insights from a temporal network. We consider a real-world activity network from Facebook [46]. In such network, eachnode represents a user and a temporal edge (๐ข, ๐ฃ, ๐ก) indicates thatuser ๐ข posted on ๐ฃ โs wall at time ๐ก (see the original publication [46]for more details). The network contains information collected fromSeptember 2006 to January 2009. After removing self-loops, the net-work has ๐=45.7K nodes,๐=826K temporal edges, and |๐ธ๐ |=179Kstatic (undirected) edges. We will fist show how analyzing the motifcounts obtained with odeN provides complementary insights tothose in [46], that relied onmostly static analyses.We then concludeby discussing how the counts of the network evolve by varyingonly the parameter โ (i.e., fixing ๐ป, ๐ฟ), showing that such countssurprisingly differ with different values of such parameter.
In the original paper [46], the authors partitioned the Facebooknetwork in nine different snapshots (obtaining nine projected staticnetworks), with each snapshot spanning 90 days of interactions inthe network. The authors observed that consecutive snapshots havesmall resemblance, i.e., on average only 45% of the edges are pre-served through consecutive snapshots. The authors also observedthat despite this difference all the snapshots have similar, almostinvariant, structural properties in terms of their clustering coeffi-cient, average degree distribution, and others. We used odeN (with
odeN: Simultaneous Approximation of Multiple Motif Counts in Large Temporal Networks CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia
Y = 1, [ = 0.1) to compare the temporal networks associated tothe snapshots by computing the counts of the 8 temporal motifs inM(๐ป, โ = 3) with๐ป being a triangle and ๐ฟ = 86400 = 1 day. On eachsnapshot, after extracting the motif counts, we computed for eachmotif ๐ its normalized count on the snapshot as ๐ถ๐/
โ8๐=1๐ถ๐๐
.The results are reported in Fig. (6a) (see Appendix E for a visualrepresentation of the motifs). Interestingly, even if in [46] the au-thors highlight small resemblance through different snapshots, thecounts of the motifs are stable across the different snapshots, es-pecially by looking at the first three and the last two snapshots.Surprisingly on snapshots 6 and 7, which correspond to the periodof observation of mid-2008, we observe that there is a significantvariation in the motif counts w.r.t. the previous months. This is theperiod where the authors of [46] observed a change in Facebookโsinterface (that led to a drop in the growth of the network) thatseems to be correlated to the variation on the motif counts. Evenmore surprisingly, this aspect is not captured by a static analysisof the snapshots as performed in [46]. Thus, our temporal motifsanalysis through odeN is able to capture a variation in the growthof the network that the static analysis cannot highlight. (We discusshow the motifs and their counts can be used to characterize theactivity on the network in Appendix E).
We then analyzed how the different motif counts of the wholenetwork change by varying the parameter โ . We fixed ๐ป a triangleand run odeN with Y = 1, [ = 0.1, ๐ฟ = 86400. The results are shownin Figure (6b).We observe that the counts of๐1, . . . , ๐ |M(๐ป,โ) | varysignificantly by increasing โ . For โ = 3 almost all the motifs havethe same counts, while for larger โ there are some motifs with veryhigh counts (i.e., overrepresented) and some other motifs that areunderrepresented. Overall the highest counts range from 104 to 106
from โ = 3 up to โ = 6. To understand if these counts increase onlyby chance, we performed a widely used statistical test (e..g, [11, 22])by computing the ๐ -scores of the different motif counts under thefollowing null model [31]. We generated 500 random networksby the timeline shuffling random model [11], which redistributesall the timestamps by fixing the directed projected static network.For each motif ๐๐ , ๐ = 1, . . . , |M(๐ป, โ) | we computed a ๐ -scorethat is defined as follows: let ๐ถ๐๐
be the count of the motif inthe original network and let ๐ถ1
๐๐, . . . ,๐ถ500
๐๐be its counts on the
๐-th random network ๐ โ {1, . . . , 500}. The ๐ -score is computedas, ๐๐๐
= (๐ถ๐๐โโ500
๐=1๐ถ๐
๐๐/500)/std(๐ถ1
๐๐, . . . ,๐ถ500
๐๐) where std(ยท)
denotes the standard deviation. The results are in Fig. (6c), and theyshow that the counts in Fig. (6b) are very significant and not dueto random fluctuations (higher ๐ -scores indicate that such motifcounts are significantly more frequent in ๐ than in the networkspermutated randomly). Interestingly, the ๐ -scores in Figure (6c)follow a similar law to the counts in Figure (6b), with the highest๐ -scores increasing significantly every time โ increases. Notablythe highest ๐ -scores of motifs with โ = 6 are more than 3 ordersof magnitude larger than the ๐ -scores of motifs with โ = 3. (Wediscuss some of the significant motifs in Appendix E).
6 CONCLUSIONS
In this work we introduced odeN, our algorithm to obtain rigor-ous, high-quality, probabilistic approximations of the counts ofmultiple motifs with the same static topology in large temporal
1 2 3 4 5 6 7 8 9Temporal Network Snapshot
0.025
0.050
0.075
0.100
0.125
0.150
0.175
0.200
Norm
alize
d Co
unt
M1M2
M3M4
M5M6
M7M8
(a)
Motif (sorted by count)
104
105
106
Mot
if co
unt
Distribution of the motif counts with varying
= 3 = 4 = 5 = 6
(b)
Motif (sorted by Z-score)
103
104
105
106
Z-sc
ore
of th
e m
otif
Distribution of the motif counts Z-scores with varying
= 3 = 4 = 5 = 6
(c)
Figure 6: (6a): Counts of the motifs inM(๐ป, 3) with ๐ป a tri-
angle on each temporal network corresponding to one snap-
shot in [46]. (6b): Counts on the full Facebook network with
varying โ . (6c): ๐ -scores of the motif counts with varying โ .
networks. Our experimental evaluation shows that odeN allows toanalyze several motifs in large networks in a fraction of the timerequired by state-of-the-art approaches. We believe that our algo-rithm odeN will be of practical interest in the analysis of temporalnetworks, complementing many of the existing tools and helpingin understanding complex networked systems and their patterns.
There are several interesting directions for future research, in-cluding devising better edge probability distributions for odeNand choosing such distribution based on the characteristics of thedataset, since different datasets can have very different temporaledges distributions (e.g., with skewed behaviours [40]) and, thus,there may not exist a unique distribution that is effective for alltemporal networks. Another direction of future research is thederivation of improved bounds for the number of samples requiredby odeN, using for example statistical learning theory concepts,such as pseudodimensions or Rademacher averages.
ACKNOWLEDGMENTS
This work was supported, in part, by MIUR of Italy, under PRINProject n. 20174LF3T8 AHeAD, and grant L. 232 (Dipartimenti diEccellenza), and by the U. of Padova project โSID 2020: RATED-Xโ.
CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia Ilie Sarpe and Fabio Vandin
REFERENCES
[1] Paolo Bajardi, Alain Barrat, Fabrizio Natale, Lara Savini, and Vittoria Colizza.2011. Dynamical Patterns of Cattle Trade Movements. PLoS ONE 6, 5 (may 2011),e19869. https://doi.org/10.1371/journal.pone.0019869
[2] V. Batagelj and M. Zaversnik. 2003. An O(m) Algorithm for Cores Decompositionof Networks. Advances in Data Analysis and Classification, 2011. Volume 5, Number2, 129-145 (Oct. 2003). arXiv:cs.DS/cs/0310049
[3] Jeffrey Baumes, Mark K. Goldberg, Mukkai S. Krishnamoorthy, Malik Magdon-Ismail, and Nathan Preston. 2005. Finding communities by clustering a graphinto overlapping subgraphs. In AC 2005, Proceedings of the IADIS InternationalConference on Applied Computing, Algarve, Portugal, February 22-25, 2005, Volume1, Nuno Guimarรฃes and Pedro T. Isaรญas (Eds.). IADIS, 97โ104.
[4] Caleb Belth, Xinyi Zheng, and Danai Koutra. 2020. Mining Persistent Activityin Continually Evolving Networks. In Proceedings of the 26th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining. ACM. https://doi.org/10.1145/3394486.3403136
[5] George Bennett. 1962. Probability Inequalities for the Sum of IndependentRandom Variables. J. Amer. Statist. Assoc. 57, 297 (mar 1962), 33โ45. https://doi.org/10.1080/01621459.1962.10482149
[6] Hanjo D Boekhout, Walter A Kosters, and Frank W Takes. 2019. Efficiently count-ing complex multilayer temporal motifs in large-scale networks. ComputationalSocial Networks 6, 1 (2019), 1โ34.
[7] Marco Bressan, Stefano Leucci, and Alessandro Panconesi. 2019. Motivo. Pro-ceedings of the VLDB Endowment 12, 11 (jul 2019), 1651โ1663. https://doi.org/10.14778/3342263.3342640
[8] Matteo Ceccarello, Carlo Fantozzi, Andrea Pietracaprina, Geppino Pucci, andFabio Vandin. 2017. Clustering uncertain graphs. Proceedings of the VLDBEndowment 11, 4 (dec 2017), 472โ484. https://doi.org/10.1145/3186728.3164143
[9] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. 2011. Friendship and mobility.In Proceedings of the 17th ACM SIGKDD international conference on Knowledgediscovery and data mining - KDD '11. ACM Press. https://doi.org/10.1145/2020408.2020579
[10] Ying Ding. 2011. Scientific collaboration and endorsement: Network analysisof coauthorship and citation networks. Journal of Informetrics 5, 1 (jan 2011),187โ203. https://doi.org/10.1016/j.joi.2010.10.008
[11] Laetitia Gauvin, Mathieu Gรฉnois, Mรกrton Karsai, Mikko Kivelรค, Taro Takaguchi,Eugenio Valdano, and Christian L. Vestergaard. 2018. Randomized referencemodels for temporal networks. (June 2018). arXiv:physics.soc-ph/1806.04032
[12] M. Girvan and M. E. J. Newman. 2002. Community structure in social andbiological networks. Proceedings of the National Academy of Sciences 99, 12 (jun2002), 7821โ7826. https://doi.org/10.1073/pnas.122653799
[13] Saket Gurukar, Sayan Ranu, and Balaraman Ravindran. 2015. COMMIT. InProceedings of the 2015 ACM SIGMOD International Conference on Management ofData. ACM. https://doi.org/10.1145/2723372.2737791
[14] Wook-Shin Han, Jinsoo Lee, and Jeong-Hoon Lee. 2013. Turboiso: towardsultrafast and robust subgraph isomorphism search in large graph databases. InProceedings of the ACM SIGMOD International Conference on Management of Data,SIGMOD 2013, New York, NY, USA, June 22-27, 2013, Kenneth A. Ross, DiveshSrivastava, and Dimitris Papadias (Eds.). ACM, 337โ348. https://doi.org/10.1145/2463676.2465300
[15] Petter Holme and Jari Saramรคki. 2012. Temporal networks. Physics Reports 519,3 (oct 2012), 97โ125. https://doi.org/10.1016/j.physrep.2012.03.001
[16] Petter Holme and Jari Saramรคki (Eds.). 2019. Temporal Network Theory. SpringerInternational Publishing. https://doi.org/10.1007/978-3-030-23495-9
[17] Y. Hulovatyy, H. Chen, and T. Milenkoviฤ. 2015. Exploring the structure andfunction of temporal networks with dynamic graphlets. Bioinformatics 31, 12(jun 2015), i171โi180. https://doi.org/10.1093/bioinformatics/btv227
[18] Ali Jazayeri and Christopher C Yang. 2020. Motif discovery algorithms in staticand temporal networks: A survey. Journal of Complex Networks 8, 4 (aug 2020).https://doi.org/10.1093/comnet/cnaa031
[19] Alpรกr Jรผttner and Pรฉter Madarasi. 2018. VF2++ - An improved subgraph isomor-phism algorithm. Discret. Appl. Math. 242 (2018), 69โ81. https://doi.org/10.1016/j.dam.2018.02.018
[20] Chrysanthi Kosyfaki, Nikos Mamoulis, Evaggelia Pitoura, and Panayio-tis Tsaparas. 2018. Flow Motifs in Interaction Networks. (Oct. 2018).arXiv:cs.SI/1810.08408
[21] Lauri Kovanen, Mรกrton Karsai, Kimmo Kaski, Jรกnos Kertรฉsz, and Jari Saramรคki.2011. Temporal motifs in time-dependent networks. Journal of Statistical Me-chanics: Theory and Experiment 2011, 11 (nov 2011), P11005. https://doi.org/10.1088/1742-5468/2011/11/p11005
[22] L. Kovanen, K. Kaski, J. Kertesz, and J. Saramaki. 2013. Temporal motifs revealhomophily, gender-specific patterns, and group talk in call sequences. Proceedingsof the National Academy of Sciences 110, 45 (oct 2013), 18070โ18075. https://doi.org/10.1073/pnas.1307941110
[23] Rohit Kumar and Toon Calders. 2018. 2SCENT. Proceedings of the VLDB Endow-ment 11, 11 (jul 2018), 1441โ1453. https://doi.org/10.14778/3236187.3269460
[24] Ravi Kumar, Jasmine Novak, and Andrew Tomkins. 2006. Structure and evolutionof online social networks. In Proceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining - KDD '06. ACM Press. https://doi.org/10.1145/1150402.1150476
[25] Jinsoo Lee, Wook-Shin Han, Romans Kasperovics, and Jeong-Hoon Lee. 2012. AnIn-depth Comparison of Subgraph Isomorphism Algorithms in Graph Databases.Proc. VLDB Endow. 6, 2 (2012), 133โ144. https://doi.org/10.14778/2535568.2448946
[26] Paul Liu, Austin R. Benson, and Moses Charikar. 2019. Sampling Methodsfor Counting Temporal Motifs. In Proceedings of the Twelfth ACM Interna-tional Conference on Web Search and Data Mining (Melbourne VIC, Australia)(WSDM โ19). Association for Computing Machinery, New York, NY, USA, 294โ302.https://doi.org/10.1145/3289600.3290988
[27] Patrick Mackey, Katherine Porterfield, Erin Fitzhenry, Sutanay Choudhury, andGeorge Chin Jr. 2018. A Chronological Edge-Driven Approach to TemporalSubgraph Isomorphism. (Jan. 2018). arXiv:cs.DS/1801.08098
[28] Naoki Masuda and Renaud Lambiotte. 2016. A Guide to Temporal Networks.WORLD SCIENTIFIC (EUROPE). https://doi.org/10.1142/q0033
[29] Tijana Milenkoviฤ and Nataลกa Prลพulj. 2008. Uncovering Biological Network Func-tion via Graphlet Degree Signatures. Cancer Informatics 6 (jan 2008), CIN.S680.https://doi.org/10.4137/cin.s680
[30] R. Milo. 2002. Network Motifs: Simple Building Blocks of Complex Networks.Science 298, 5594 (oct 2002), 824โ827. https://doi.org/10.1126/science.298.5594.824
[31] R. Milo. 2004. Superfamilies of Evolved and Designed Networks. Science 303,5663 (mar 2004), 1538โ1542. https://doi.org/10.1126/science.1089167
[32] Mark Newman. 2010. Networks. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199206650.001.0001
[33] Pietro Panzarasa, Tore Opsahl, and Kathleen M. Carley. 2009. Patterns anddynamics of users' behavior and interaction: Network analysis of an onlinecommunity. Journal of the American Society for Information Science and Technology60, 5 (may 2009), 911โ932. https://doi.org/10.1002/asi.21015
[34] Ashwin Paranjape, Austin R Benson, and Jure Leskovec. 2017. Motifs in temporalnetworks. In Proceedings of the Tenth ACM International Conference on Web Searchand Data Mining. 601โ610.
[35] Noujan Pashanasangi and C. Seshadhri. 2019. Efficiently Counting Vertex Orbitsof All 5-vertex Subgraphs, by EVOKE. CoRR abs/1911.10616 (2019). https://doi.org/10.1145/3336191.3371773 arXiv:1911.10616
[36] N. Przulj. 2007. Biological network comparison using graphlet degree distribution.Bioinformatics 23, 2 (jan 2007), e177โe183. https://doi.org/10.1093/bioinformatics/btl301
[37] Xuguang Ren and JunhuWang. 2015. Exploiting Vertex Relationships in Speedingup Subgraph Isomorphism over Large Graphs. Proc. VLDB Endow. 8, 5 (2015),617โ628. https://doi.org/10.14778/2735479.2735493
[38] Pedro Ribeiro, Pedro Paredes, Miguel EP Silva, David Aparicio, and FernandoSilva. 2019. A survey on subgraph counting: concepts, algorithms and applicationsto network motifs and graphlets. arXiv preprint arXiv:1910.13011 (2019).
[39] Ryan A. Rossi, Nesreen K. Ahmed, Aldo Carranza, David Arbour, Anup Rao,Sungchul Kim, and Eunyee Koh. 2021. Heterogeneous Graphlets. ACM Trans-actions on Knowledge Discovery from Data 15, 1 (jan 2021), 1โ43. https://doi.org/10.1145/3418773
[40] Ilie Sarpe and Fabio Vandin. 2021. PRESTO: Simple and Scalable Sampling Tech-niques for the Rigorous Approximation of Temporal Motif Counts. SIAM Interna-tional Conference on Data Mining (2021). https://doi.org/10.1137/1.9781611976700.17
[41] Alice C. Schwarze and Mason A. Porter. 2020. Motifs for processes on networks.(July 2020). arXiv:physics.soc-ph/2007.07447
[42] Shai S. Shen-Orr, Ron Milo, Shmoolik Mangan, and Uri Alon. 2002. Networkmotifs in the transcriptional regulation network of Escherichia coli. NatureGenetics 31, 1 (apr 2002), 64โ68. https://doi.org/10.1038/ng881
[43] Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, andKarstenM. Borgwardt. 2009. Efficient graphlet kernels for large graph comparison.In Proceedings of the Twelfth International Conference on Artificial Intelligenceand Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009(JMLR Proceedings), David A. Van Dyk and Max Welling (Eds.), Vol. 5. JMLR.org,488โ495. http://proceedings.mlr.press/v5/shervashidze09a.html
[44] Shixuan Sun, Xibo Sun, Yulin Che, Qiong Luo, and Bingsheng He. 2020. Rapid-Match: a holistic approach to subgraph query processing. Proceedings of theVLDB Endowment 14 (2020), 176โ188. https://doi.org/10.14778/3425879.3425888
[45] Kun Tu, Jian Li, Don Towsley, Dave Braines, and Liam D. Turner. 2019. gl2vec. InProceedings of the 2019 IEEE/ACM International Conference on Advances in SocialNetworks Analysis and Mining. ACM. https://doi.org/10.1145/3341161.3342908
[46] Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P. Gummadi. 2009.On the evolution of user interaction in Facebook. In Proceedings of the 2nd ACMworkshop on Online social networks - WOSN '09. ACM Press. https://doi.org/10.1145/1592665.1592675
[47] Jingjing Wang, Yanhao Wang, Wenjun Jiang, Yuchen Li, and Kian-Lee Tan. 2020.Efficient Sampling Algorithms for Approximate Temporal Motif Counting. InProceedings of the 29th ACM International Conference on Information & KnowledgeManagement. ACM. https://doi.org/10.1145/3340531.3411862
odeN: Simultaneous Approximation of Multiple Motif Counts in Large Temporal Networks CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia
[48] Shuo Yu, Yufan Feng, Da Zhang, Hayat Dino Bedru, Bo Xu, and Feng Xia. 2020.Motif discovery in networks: A survey. Computer Science Review 37 (2020),100267.
[49] Qiankun Zhao, Yuan Tian, Qi He, Nuria Oliver, Ruoming Jin, and Wang-ChienLee. 2010. Communication motifs. In Proceedings of the 19th ACM internationalconference on Information and knowledge management - CIKM '10. ACM Press.https://doi.org/10.1145/1871437.1871694
[50] Bo Zong, Xusheng Xiao, Zhichun Li, Zhenyu Wu, Zhiyun Qian, Xifeng Yan,Ambuj K. Singh, and Guofei Jiang. 2015. Behavior query discovery in system-generated temporal graphs. Proceedings of the VLDB Endowment 9, 4 (dec 2015),240โ251. https://doi.org/10.14778/2856318.2856320
Table 3: Notation table.
Symbol Description
๐ = (๐ , ๐ธ) Temporal network๐,๐ Number of nodes and temporal edges of ๐๐บ๐ Undirected projected static network of ๐
๐๐ , ๐ โ [1, |M(๐ป, โ) |] Motifs inM(๐ป, โ)๐ Nodes in the motifsโ Edges of the motifs๐ฟ Duration limit of ๐ฟ-instances
๐ = (K, ๐) Motif as pair (multigraph, ordering)U(๐,๐ฟ) Set of ๐ฟ-instances of๐ from ๐
๐ถ๐ Number of ๐ฟ-instances of๐ in ๐๐บ๐ข [๐] Undirected graph associated to K
M(๐ป, โ)Set of distinct motifs with โ edges s.t.it holds ๐บ๐ข [๐๐ ] โ ๐ป โ๐๐ โ M(๐ป, โ).
๐ป Static undirected target template๐๐ป , ๐ธ๐ป Set of nodes and edges of the target ๐ป๐ถ๐๐(๐) Number of ๐ฟ-instances containing ๐ โ ๐บ๐
๐ Number of samples collected by odeN๐๐ Indicator variable denoting if ๐ โ ๐บ๐ is sampled
๐๐ , ๐ (๐) Probability of sampling edge ๐ โ ๐บ๐
๐๐
๐๐Estimate of motif๐๐ obtained at odeNโs ๐-th step
๐ถ โฒ๐๐
Final odeNโs estimate of ๐ถ๐๐
Y, [ Quality and confidence parameters๐ Number of threads in odeN parallel
A NOTATION
The notation used throughout this work is summarized in Table 3.
B ODENโS SUBROUTINES
B.1 FastUpdate and its Subroutines
We now discuss the FastUpdate routine that is called in line 10 ofAlgorithm 1 to keep ๐ถ๐๐ ๐ก๐๐๐๐ก๐๐ updated. The FastUpdate subrou-tine is shown in Algorithm 2. ๐ถ๐๐ ๐ก๐๐๐๐ก๐๐ maintains the weightedcounts of themotif sequences identified, therefore to keep it updatedwe first count the ๐ฟ-instances of๐๐ , ๐ = 1, . . . |M(๐ป, โ) | within thesampled temporal network i.e. ๐ , and then rescale each count op-portunely. Such routine will feature two main aspects, i) an efficientadaptation of the algorithm by Paranjape et al. [34] and ii) an ef-ficient encoding of the various sequences representing the motifsoccurrences within integers that will allow for fast operations (com-parisons to distinguish between different motifs and fast updatesto the data structures).
We now discuss how FastUpdate counts all the ๐ฟ-instances in ๐ .First observe that we already know that ๐บ๐ โ ๐ป , and that ๐ can berewritten as ๐ = (((๐ฅ1, ๐ฆ1), ๐ก1), . . . , ((๐ฅโ , ๐ฆโ ), ๐กโ ). We first computethe set ๐ธ๐ข๐๐๐๐ข๐ = {(๐ฅ,๐ฆ) : ((๐ฅ,๐ฆ), ๐ก) โ ๐} and we assign to eachedge in ๐ธ๐ข๐๐๐๐ข๐ a unique identifier (lines 4-5). Then we run anefficient implementation of the algorithm by Paranjape et al. [34]that computes through dynamic programming the counts of allthe subsequences of edges (๐ฅ,๐ฆ) s.t. (๐ฅ,๐ฆ, ๐ก) โ ๐ having length โ
and occurring within ๐ฟ-time (lines 6-10). In Algorithm 3 we showour implementation of the subroutines needed to execute lines
CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia Ilie Sarpe and Fabio Vandin
Algorithm 2: FastUpdate
Input: ๐ฟ, ๐,๐ถ๐๐ ๐ก๐๐๐๐ก๐๐ , ๐ (๐๐ ), ๐ป1 ๐ธ๐ข๐๐๐๐ข๐ โ {(๐ฅ,๐ฆ) : (๐ฅ,๐ฆ, ๐ก) โ ๐}2 ๐๐๐๐๐ โ {}, ๐ธ๐๐๐ฃ โ [],๐๐๐๐๐๐ข๐๐ก๐ โ {}, ๐ ๐ก๐๐๐ก โ 13 idโ 04 foreach ๐ โ ๐ธ๐ข๐๐๐๐ข๐ do5 ๐ธ๐๐๐ฃ [id] โ ๐ ,๐๐๐๐๐ {๐} โ id++
6 foreach (๐ฅ,๐ฆ, ๐ก) โ ๐ do
7 while ๐ก โ ๐ก๐ ๐ก๐๐๐ก > ๐ฟ do
8 Decrement(๐๐๐๐๐ [(๐ฅ๐ ๐ก๐๐๐ก , ๐ฆ๐ ๐ก๐๐๐ก )], ๐๐๐๐๐๐ข๐๐ก๐ )9 ๐ ๐ก๐๐๐ก โ ๐ ๐ก๐๐๐ก + 1
10 Increment(๐๐๐๐๐ [(๐ฅ,๐ฆ)], ๐๐๐๐๐๐ข๐๐ก๐ )11 foreach key ๐ of length โ โ ๐๐๐๐๐๐ข๐๐ก๐ .๐๐๐ฆ๐ do
12 ๐ โฒ โ ReconstructMotif(๐, ๐ธ๐๐๐ฃ )13 if ๐บ๐ข [๐ โฒ] โ ๐ป then
14 ๐๐ โ EncodeAndClassifyMotif(๐ โฒ)15 ๐๐๐
โ ๐๐๐๐๐๐ข๐๐ก๐ {๐}/(|๐ธ๐ป |๐ (๐๐ ))16 ๐ โฒ
๐๐โ ๐ถ๐๐ ๐ก๐๐๐๐ก๐๐ {๐๐ }
17 ๐ถ๐๐ ๐ก๐๐๐๐ก๐๐ {๐๐ } โ ๐ โฒ๐๐+ ๐๐๐
6-10 (see the original paper [34] for full details and correctness).Intuitively, lines 6-10 of Algorithm 2 scan the input sequence ๐linearly, maintaining inmemory information about the edgeswithin๐ฟ time from the processed one. Through such scan the algorithmupdates ๐๐๐๐๐๐ข๐๐ก๐ to keep the counts of the sequences havingat most โ edges over the set ๐ธ๐ข๐๐๐๐ข๐ . Starting the cycle in line 11,๐๐๐๐๐๐ข๐๐ก๐ contains the counts of all the โ subsequences of edgesfrom ๐ over the set ๐ธ๐ข๐๐๐๐ข๐ . We highlight that we assign to eachstatic edge of ๐ an ID of ๐ bits. This allows us to encode eachsequence up to ๐ = 1, . . . , โ edges, occurring within ๐ฟ time, in aninteger using ๐ ยท ๐ bits through bitwise operations (โ<<โ denotesright shift and โ|โ denotes bitwise or) to allow for fast updates to๐๐๐๐๐๐ข๐๐ก๐ .
To obtain the estimates of motifs ๐1, . . . , ๐ |M(๐ป,โ) | , for eachโ sequence of edges identified we reconstruct the correspondinggraph and thus the motif ๐ โฒ that the sequences is an instance ofin line 12 (the multigraph is given by the edges IDโs while theordering of the edges is given by the sequence itself). We thencheck if๐บ๐ข [๐ โฒ] is isomorphic to ๐ป (constraint (1) from Problem 1).If so we encode the motif in a sequence of 2๐โ bits that allows us toclassify such motif (line 14) in order to distinguish between distinctmotifs (recall we want๐๐ ๏ฟฝ๐ ๐๐ , ๐ โ ๐ ). The encoding is computedas follows: given ๐ โฒ = โจ(๐ฅ1, ๐ฆ1), . . . , (๐ฅโ , ๐ฆโ )โฉ we assign to eachnode an incremental ID according to its first appearance in๐ โฒ andwe obtain the final encoding as โจID(๐ฅ1)ID(๐ฆ1) . . . ID(๐ฅโ )ID(๐ฆโ )โฉ. Itis easily seen that two motifs๐1, ๐2 share the same encoding iff itholds๐1 ๏ฟฝ๐ ๐2 as desired, given that the motifs are directed andthe definition of distinct motifs accounts for the ordering in whichedges appear. We provide an example below.
Example B.1. Let us consider๐1, ๐2, and๐3 from Figure 2. Con-sider ๐1 = โจ(๐ฆ, ๐ฅ), (๐ฆ, ๐ง), (๐ฅ, ๐ง)โฉ, then by assigning an incrementalID to each node according to its first appearance in ๐1 we get
Algorithm 3: Subroutines of FastUpdateFunction Increment(id,๐๐๐๐๐๐ข๐๐ก๐ )
1 foreach ๐ โ SortByDecLength(๐๐๐๐๐๐ข๐๐ก๐ .๐๐๐ฆ๐ ) do2 if ๐.๐๐๐๐๐กโ < โ then
3 ^ โ (๐ << ๐) |id4 ๐๐๐๐๐๐ข๐๐ก๐ [^] โ ๐๐๐๐๐๐ข๐๐ก๐ [^] +๐๐๐๐๐๐ข๐๐ก๐ [๐]
5 ๐๐๐๐๐๐ข๐๐ก๐ [id] โ ๐๐๐๐๐๐ข๐๐ก๐ [id] + 1Function Decrement(id,๐๐๐๐๐๐ข๐๐ก๐ )
6 ๐๐๐๐๐๐ข๐๐ก๐ [id] โ ๐๐๐๐๐๐ข๐๐ก๐ [id] โ 17 foreach ๐ โ SortByIncLength(๐๐๐๐๐๐ข๐๐ก๐ .๐๐๐ฆ๐ ) do8 if ๐.๐๐๐๐๐กโ < โ โ 1 then
9 ^ โ (id << (๐.๐๐๐๐๐กโ ยท ๐)) |๐10 ๐๐๐๐๐๐ข๐๐ก๐ [^] โ ๐๐๐๐๐๐ข๐๐ก๐ [^] โ๐๐๐๐๐๐ข๐๐ก๐ [๐]
ID(๐ฆ) = 1, ID(๐ฅ) = 2, ID(๐ง) = 3 so the final encoding of ๐1 isโจ121323โฉ. Following a similar procedure the encoding of ๐2 isโจ121323โฉ, while the encoding๐3 is โจ121332โฉ. The encodings of๐1and๐2 coincide while differing from the one of๐3 as desired.
After this step we update the global data structure ๐ถ๐๐ ๐ก๐๐๐๐ก๐๐
by summing to each motifโs estimate, its count in ๐ divided by|๐ธ๐ป |๐ (๐๐ ) where ๐ (๐๐ ) is the probability of edge ๐๐ of being sam-pled (lines 15-17), which we prove in Section 4.4 to be the correctweighting schema to output an unbiased estimate.
B.2 Exact Subgraph Enumeration
In this section we briefly discuss the algorithms for subgraph enu-meration that can be adapted to our Algorithm 1 (in line 5). Unfor-tunately we cannot easily use the algorithms for extracting ๐-nodemotifs mentioned in Section 3 as is, since they do not provide thelocal enumeration step required by odeN.
In fact, the problem most related to the exact enumeration werequire is the labelled query graph matching problem. In such set-ting one is provided a labelled query graph ๐ป = (๐๐ป , ๐ธ๐ป , ๐ฟ๐ป ), anda labelled graph ๐บ = (๐ , ๐ธ, ๐ฟ) (where labels can be colors for ex-ample, see [25]), ๐ฟ may be defined both on edges or vertices. Theproblem requires to find all the subgraphs โโฒ โ ๐บ isomorphic to๐ป , which could be either induced or not but must preserve thelabelling properties (i.e., if (๐ฅ,๐ฆ) โ ๐ธ is mapped to (๐ฅ โฒ, ๐ฆโฒ) โ ๐ป then(๐ฟ(๐ฅ), ๐ฟ(๐ฆ)) = (๐ฟ๐ป (๐ฅ โฒ), ๐ฟ๐ป (๐ฆโฒ))). To explain how we take advan-tage of the algorithms developed for the problem above we need tointroduce the following definitions (adapted from [35]).
Definition B.2. Let ๐ป = (๐๐ป , ๐ธ๐ป ) be an undirected graph, anautomorphism is a bijection ๐ : ๐๐ป โฆโ ๐๐ป such that (๐ฅ,๐ฆ) โ ๐ธ๐ป iff(๐ (๐ฅ), ๐ (๐ฆ)) โ ๐ธ๐ป .
Definition B.3. Let ๐ป = (๐๐ป , ๐ธ๐ป ) be an undirected graph, we saythat two edges ๐ = (๐ฅ,๐ฆ), ๐ โฒ = (๐ฅ โฒ, ๐ฆโฒ) โ ๐ธ๐ป belong to the sameedge-orbit iff there exists an automorphism that maps ๐ on ๐ โฒ.
In order to adapt the algorithms for the labelled query graphmatching problem we proceed in the following way: 1) colour thenodes of๐บ๐ with a fixed colour (say red) 2) Once sampled ๐๐ โ ๐บ๐ ,colour its endpoint nodes with a different colour (say blue), call
odeN: Simultaneous Approximation of Multiple Motif Counts in Large Temporal Networks CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia
the map from the last two points ๐ฟ๐บ๐; 3) compute the different
edge-orbits of the pattern ๐ป (by enumerating the automorphismsof ๐ป ) and for each edge-orbit choose an edge, colour its endpointnodes with the same colour assigned to ๐๐ , and keep the colouron the other edges the same as ๐บ๐ , call this map ๐ฟ๐ป ; 4) run analgorithm for the labelled query graph matching problem withgraph ๐บ๐ = (๐๐ , ๐ธ๐ , ๐ฟ๐บ๐
) and pattern ๐ป = (๐๐ป , ๐ธ๐ป , ๐ฟ๐ป ) 5) thedesired subgraphs (H ) are the union over the different edge-orbitsenumeration steps.
C IMPLEMENTATION DETAILS
In this section we provide additional implementation details, com-plementing the description of Section 5.1.
In our implementation, we used two main structures: first, anadjacency list3, that allows to query for an edge between ๐ข, ๐ฃ โ ๐in ๐ (log(min(๐๐ข , ๐๐ฃ))). Second, we used a hashmap to store foreach static directed edge the timestamps of the temporal edgesthat map on that edge, leading to ๐ (1) complexity of querying forthe timestamps of a static edge in ๐บ๐ . The initialization of suchstructures is done in ๐ (1) per each processed temporal edge whileloading the dataset, by knowing the number of nodes ๐. Many stateof the art algorithms exist for the local enumeration of motifs (e.g.,[14, 37, 44]), we provide in our code a general algorithm based onthe algorithm VF2++ [19]. However, instead of using the generalprocedure described in Section B.2, in our test we relied on a simplealgorithm that locally enumerates the subgraphs containing anedge ๐ = {๐ฅ,๐ฆ} isomorphic to ๐ป : for triangles the algorithm runs in๐ (min(๐๐ฅ , ๐๐ฆ) log(๐)), while when๐ป is a square the algorithm runsin ๐ (min(๐๐ฅ , ๐๐ฆ)๐๐๐๐ฅ log(๐)), with ๐๐๐๐ฅ the maximum degree ofa node in ๐บ๐ .
D PROOFS
In this section we provide the proofs not included in the main text.First we recall that ๐ถ๐๐
(๐) the number of ๐ฟ-instances ofmotif ๐๐ , ๐ = 1, . . . , |M(๐ป, โ) | from ๐ whose undirected pro-jected static network contains edge ๐ โ ๐บ๐ , i.e., ๐ถ๐๐
(๐) =โโโ๐บ๐ ,โโ๐ป :๐โโ |U(โ,๐๐ ) |, ๐ โ ๐บ๐ where U(โ,๐๐ ) is the set of
๐ฟ-instances of motif ๐๐ whose static projected graph is โ โ ๐บ๐ .Then based on the above it is simple to notice that the following for-mula holds for eachmotif๐๐ , ๐ = 1, . . . , |M(๐ป, โ) |:โ๐โ๐บ๐
๐ถ๐๐(๐) =
|๐ธ๐ป |๐ถ๐ . This relation will be the key for proving the unbiasednessof the estimates provided by odeN, as we show next.
Proof of Lemma 4.1. First let us consider the expectation of๐
๐
๐๐, ๐ = 1, . . . , |M(๐ป, โ) |, ๐ = 1, . . . , ๐ :
E
1|๐ธ๐ป |
โ๏ธ๐โ๐บ๐
๐ถ๐๐(๐)๐๐
๐๐
=1|๐ธ๐ป |
โ๏ธ๐โ๐บ๐
๐ถ๐๐(๐)E[๐๐ ]๐๐
= ๐ถ๐๐
where we used the linearity of expectation and the facts thatE[๐๐ ] = ๐๐ , ๐ โ ๐บ๐ , and
โ๐โ๐บ๐
๐ถ๐๐(๐) = |๐ธ๐ป |๐ถ๐๐
; thus ๐ ๐
๐๐, ๐ =
3We used the one provided by SNAP: https://github.com/snap-stanford/snap, moreefficient implementations can be also adopted improving the global running times.
1, . . . , |M(๐ป, โ) |, ๐ = 1, . . . , ๐ are unbiased estimates of ๐ถ๐๐, com-
bining such result to ๐ถ โฒ๐๐
we obtain,
E[๐ถ โฒ๐๐] = E
1๐
๐ โ๏ธ๐=1
๐๐
๐๐
=1๐
๐ โ๏ธ๐=1E[๐ ๐
๐๐] =
๐ ๐ถ๐๐
๐ = ๐ถ๐๐
by the linearity of expectation. โก
Proof of Lemma 4.2. We need to bound the variance of the es-timate ๐ถ โฒ
๐๐, first we rewrite the estimator
๐ถ โฒ๐๐=
1๐
๐ โ๏ธ๐=1
1|๐ธ๐ป |
โ๏ธ๐โ๐บ๐
๐ถ๐๐(๐)๐๐
๐๐=
1๐
๐ โ๏ธ๐=1
๐๐
๐๐
Since the ๐ variables ๐ ๐
๐๐, ๐ โ [1, ๐ ] are independent (edges are
drawn independently at each iteration of the outer for loop inAlgorithm 1), it holds var(๐ถ โฒ
๐๐) = var( 1๐
โ๐ ๐=1 ๐๐๐
) = 1๐ var(๐๐๐
)we thus only need to compute the variance of the variable ๐๐๐
. Letus recall var(๐๐๐
) = E[๐ 2๐๐] โ E[๐๐๐
]2 = E[๐ 2๐๐] โ ๐ถ2
๐๐by the
previous lemma. We will now bound E[๐ 2๐๐].
E[๐ 2๐๐] = E
1|๐ธ๐ป |2
โ๏ธ๐1โ๐บ๐
โ๏ธ๐2โ๐บ๐
๐ถ๐๐(๐1)๐ถ๐๐
(๐2)๐๐1๐๐2
๐๐1๐๐2
=
1|๐ธ๐ป |2
โ๏ธ๐2โ๐บ๐
๐ถ2๐๐(๐2)
1๐๐2โค 1|๐ธ๐ป |2
โ๏ธ๐2โ๐บ๐
๐ถ2๐๐(๐2)
๐
๐ผ=
=๐
๐ผ |๐ธ๐ป |2โ๏ธ
๐2โ๐บ๐
๐ถ2๐๐(๐2)
(1.)โค ๐
๐ผ |๐ธ๐ป |2|๐ธ๐ป |๐ถ2
๐๐=๐๐ถ2
๐๐
๐ผ |๐ธ๐ป |
where we used the linearity of expectations, the fact thatE[๐๐1๐๐2 ] = ๐๐1 only for ๐1 = ๐2 otherwise is 0, a boundon the minimum probability ๐๐ where ๐๐ โค ๐ผ/๐,โ๐ โ ๐บ๐
for ๐ผ defined as in Section 4.4. In (1.) we used the fact that๐ถ๐๐(๐) = _๐๐ถ๐๐
, ๐ โ ๐บ๐ , _๐ โ [0, 1], thenโ๐2โ๐บ๐
๐ถ2๐๐(๐2) =โ
๐2โ๐บ๐_2๐2๐ถ
2๐๐โค ๐ถ2
๐๐
โ๐2โ๐บ๐
_๐2 = |๐ธ๐ป |๐ถ2๐๐
since _๐2 โ [0, 1]and further
โ๐โ๐บ๐
_๐ = |๐ธ๐ป | byโ๐โ๐บ๐
_๐๐ถ๐๐= |๐ธ๐ป |๐ถ๐๐
.Thus the variance of ๐๐๐
is bounded by:
Var(๐๐๐) โค
๐๐ถ2๐๐
๐ผ |๐ธ๐ป |โ๐ถ2
๐๐= ๐ถ2
๐๐
(๐
๐ผ |๐ธ๐ป |โ 1
)combining everything together we obtain that var(๐ถ โฒ
๐๐) โค
๐ถ2๐๐
๐ (๐
๐ผ |๐ธ๐ป | โ 1), concluding the proof. โก
Proof of Theorem 4.3. Let us fix ๐๐ , ๐ โ [1, |M(๐ป, โ) |] wefirst show a bound to the following probability P[|๐ถ โฒ
๐๐โ๐ถ๐๐
| โฅY๐ถ๐๐
]. We want to derive such bound through the applicationof Bennettโs inequality to the following summation: 1
๐
โ๐ ๐=1 ๐
๐
๐๐,
we already know that E[๐ ๐
๐๐] = ๐ถ๐๐
and E[(๐ ๐
๐๐โ ๐ถ๐๐
)2] โค
๐ถ2๐๐
(๐
๐ผ |๐ธ๐ป | โ 1)= ๐ฃ2
๐for ๐ = 1, . . . , ๐ it holds:
๐๐
๐๐=
1|๐ธ๐ป |
โ๏ธ๐โ๐บ๐
๐ถ๐๐(๐)๐๐
๐๐โค 1|๐ธ๐ป |
โ๏ธ๐โ๐บ๐
๐ถ๐๐(๐)๐
๐ผ=๐๐ถ๐๐
๐ผ |๐ธ๐ป |
As argued by [40] Bennettโs inequality holds even if we only havean upper bound on the variance of the estimates. Therefore let us
CIKM โ21, November 1โ5, 2021, Virtual Event, QLD, Australia Ilie Sarpe and Fabio Vandin
compute the quantities to apply Bennettโs bound (see [40] for thestatement), clearly ๐ต = ๐ถ๐๐
( ๐๐ผ |๐ธ๐ป | โ 1) combining what we already
showed with the unbiasedness of ๐ ๐
๐๐, moreover ๐ฃ โค ๐ฃ2
๐since the
bound ๐ฃ2๐is equal for each ๐ โ [1, ๐ ]. Then,
๐ฃ2๐
๐ต2 =
๐ถ2๐๐
(๐
๐ผ |๐ธ๐ป | โ 1)
๐ถ2๐๐( ๐๐ผ |๐ธ๐ป | โ 1)2
=1
( ๐๐ผ |๐ธ๐ป | โ 1)
also๐ก๐ต
๐ฃ2๐
=Y๐ถ๐๐
๐ถ๐๐( ๐๐ผ |๐ธ๐ป | โ 1)
๐ถ2๐๐
(๐
๐ผ |๐ธ๐ป | โ 1) = Y
Combining everything together by Bennettโs inequality we obtain,
Pยฉยญยซ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ1๐ ๐ โ๏ธ
๐=1๐
๐
๐๐โ๐ถ๐๐
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โฅ Y๐ถ๐๐
ยชยฎยฌ โค 2 exp
(โ ๐
( ๐๐ผ |๐ธ๐ป | โ 1)โ(Y)
)(1)
Now, let๐ด๐ = โ|๐ถ โฒ๐๐โ๐ถ๐๐
| โฅ Y๐ถ๐๐โ, ๐ = 1, . . . , |M(๐ป, โ) |, namely
๐ด๐ is the event that the estimate of motif๐๐ , ๐ = 1, . . . , |M(๐ป, โ) | isdistant more than Y๐ถ๐๐
from ๐ถ๐๐. We already showed that that for
an arbitrary ๐ด๐ inequality (1) holds for P[๐ด๐ ], so
Pยฉยญยซ|M(๐ป,โ) |โ
๐=1๐ด๐
ยชยฎยฌ โค|M(๐ป,โ) |โ๏ธ
๐=1P[๐ด๐ ] โค
โค |M(๐ป, โ) |2 exp
(โ ๐
( ๐๐ผ |๐ธ๐ป | โ 1)โ(Y)
)โค [
combining the union bound and the choice of ๐ as in statement. โก
E CASE STUDY - MOTIF ANALYSIS
E1
"1
E2
E3
231
E1
"2
E2
E3
231
E1
"3
E2
E3
231
E1
"4
E2
E3
231
E1
"5
E2
E3
321
E1
"6
E2
E3
321
E1
"7
E2
E3
321
E1
"8
E2
E3
321
Figure 7: Graphical representation of the motifs in Figure
(6a).
Motifs on the Snapshots of the Facebook Network. Thanks to ouranalysis of Section 5.4 we are able to characterize the user behaviouron the Facebook network of wall posts by looking at different motifs(topology and their orderings) and their counts. We first show in Fig.7 the motifs corresponding to the labels of Figure (6a) in Section 5.4.Then, let๐ป = {๐ฃ1, ๐ฃ2, ๐ฃ3} be a triangle, the most frequent motifs (i.e.,those with the highest normalized counts on each snapshot) seemto share a common pattern: a first node (๐ฃ3) after posting on ๐ฃ1โs(or ๐ฃ2โs) wall triggers ๐ฃ1 (or ๐ฃ2โs) to post on the remaining nodeโswall with ๐ฃ1 posting also on such nodeโs wall to close the triangle,as captured by motifs๐3,๐7 and๐8. Observe that by identifyingthe users that mostly act as ๐ฃ3 in the occurrences of such frequent
motifs one is able to identify, for example, the nodes more engagedin spreading most of the information over the Facebook networkin a short period of time (recall that we set ๐ฟ to one day). Notsurprisingly motif๐5 is the less frequent one since its occurrencesrequire node ๐ฃ2 to post on ๐ฃ3โwall before receiving the post from ๐ฃ2therefore without being โtriggeredโ by such node, that received thepost from ๐ฃ3. Interestingly, without considering the orderings ofoccurrence among such patterns we will not be able to distinguishbetween the most frequent motifs and the least frequent ones sincefor example๐4 and๐5 have the same static directed graph structurebut they have very different counts on the different snapshots ofthe Facebook network.
1 3
CMZ1
= 1188894 (7.1%)
ZMZ1
= 1496494
2 MZ1
1 3
CMZ2
= 1072282 (7.1%)
ZMZ2
= 1215602
2 MZ3
1 3
CMZ3
= 1018769 (7.1%)
ZMZ3
= 1215069
2 MZ3
1 3
CMZ4
= 1110062 (7.1%)
ZMZ4
= 1165630
2 MZ4
1 3
CMZ5
= 8825 (2.5%)
ZMZ5
= 666
2 MZ5
1 3
CMZ6
= 6170 (2.4%)
ZMZ6
= 800
2 MZ6
1 3
CMZ7
= 5535 (2.5%)
ZMZ7
= 907
2 MZ7
1 3
CMZ8
= 3890 (2.3%)
ZMZ8
= 908
2 MZ8
t4
t3
t1 , t
5 , t6
t2
t4
t3
t1 , t
2 , t5
t6
t3
t4
t1 , t
2 , t5
t6
t3
t4t1 , t
6
t2 , t
5
t2, t5
t3, t6t1
t4
t2, t3t6t4t1, t5
t2, t4
t5t1
t3 , t
6
t3, t4
t5
t1
t6 t2
Figure 8: Graphical representation of the 4motifs with high-
est (top) and lowest (bottom) ๐ -scores in Figure (6c) for โ = 6.For each motif we report the exact count (which we com-
puted for such representation) and the relative error in the
approximation obtainedwithodeN in brackets, we addition-
ally report each ๐ -score of the motif as obtained from Sec-
tion 5.4 (i.e., by using only odeN).
Motifs with varying โ - Frequent vs Infrequent. In this Section webriefly discuss the properties and show visually the motifs withhighest and lowest๐ -scores obtained in Section 5.4 on the Facebookwall post network for โ = 6. The motifs are reported in Figure 8,where we report the 4-top motifs ranked by ๐ -score on the top andthe 4-lowest motifs by ๐ -scores on the bottom. Note, that the top4 motifs share a similar structure, both temporal and topological.Interestingly in the original paper [46] the authors noted that therewere very few pair of nodes that exchanged more than 5 messages(with median 2). The most frequent temporal motifs seem to involvea pair of highly active nodes (which exchanged many messagesbetween them, i.e., more than 4) and another third node that isreached by such pair of nodes. We unfortunately do not have theoriginal messages to understand better the information captured bysuch frequent motifs (since we do not have the original posts), but itis really surprising that the top 4 motifs all share similar propertiesespecially in the orderings of their edges. Additionally, it seemsthat triangles involving nodes that are pairwise very active seemto be the rarest type of interaction as captured by the 4 motifs withlowest ๐ -score, reported in Figure 8 bottom.