joint search by social and spatial proximity · 1 joint search by social and spatial proximity...

1

Joint Search by Social and Spatial ProximityKyriakos Mouratidis, Jing Li, Yu Tang, and Nikos Mamoulis

Abstract—The diffusion of social networks introduces new challenges and opportunities for advanced services, especially sowith their ongoing addition of location-based features. We show how applications like company and friend recommendation couldsignificantly benefit from incorporating social and spatial proximity, and study a query type that captures these two-fold semantics.We develop highly scalable algorithms for its processing, and enhance them with elaborate optimizations. Finally, we use realsocial network data to empirically verify the efficiency and efficacy of our solutions.

F

1 INTRODUCTIONThe emergence of social networks (SNs) brings anew era in the organization and browsing of onlineinformation. Manufacturers and service providers arebecoming increasingly interested in exploiting pop-ular SNs to promote their products and services.Recently, Microsoft’s search engine (Bing) has inte-grated social information from Facebook to returnweb pages that are popular among the friends of users[1]. Studies like [2] have investigated the influencebetween users of SNs and quantified the probability ofa user performing an action (e.g., purchase a product)after his/her friend(s) did. Current text search systemshave also incorporated social influence into queryprocessing by taking into account friend relationshipsfor the ranking of documents/objects [3], [4].

On the other hand, location-based services are anindispensable feature in SNs. This fact becomes in-creasingly prominent as the number of users whoaccess SN applications on mobile devices is growingsteadily. The most popular SN, Facebook, includes aset of location-based features, while others (such asFoursquare) are explicitly based on the managementof user locations. Motivated by this trend, we investi-gate the integration of social and spatial informationin a single query.

Consider a service like badoo.com, where a user u1who is looking for company to have lunch or watcha movie, may browse the profiles of nearby users andinvite them to join him/her. Existing systems applya traditional k-nearest neighbor query [5], potentiallywith some binary conditions (regarding age, sex, etc),to provide u1 with the profiles of users in the vicinity.While recommended users are indeed near u1 geo-graphically, his/her true preferences of companionswould be better captured if SN information was also

• K. Mouratidis is with the School of Information Systems, SingaporeManagement University.E-mail: [email protected]

• J. Li, Y. Tang, and N. Mamoulis are with the Department of ComputerScience, University of Hong Kong.E-mail: [email protected], [email protected], [email protected]

u1

u2

u3

u4

u5

u1

u2

u

u4

u

(a) Spatial domain

u2

u1

u3

u4

u5

u1

u2

u

u4

u

(b) Social network

Fig. 1. Motivation example

taken into account. Assume, for example, that theusers’ Euclidean coordinates and social connectionsare as shown in Figures 1(a) and 1(b) respectively.The closest user to u1 in the spatial domain is u5.However, u4 might be a better match because helocates only slightly farther (compared to u5) but is“closer” in the social network. Conversely, the closestuser socially (u2) may be too far spatially. Therefore,to provide meaningful recommendations, both socialproximity and spatial proximity should be incorporatedinto the search.

In this paper we propose and study the social andspatial ranking query (SSRQ). SSRQ reports the top-kusers in the SN based on a ranking function that in-corporates social and spatial distance from the queryuser. Our key contributions are:

• We conduct the first study on a joint search bysocial and spatial user proximity.

• We propose a suite of processing methodologies,including a highly scalable and robust approachthat relies on indexing and social summaries.

• We equip the latter with sophisticated optimiza-tions, based on computation sharing, interme-diate result caching and an accuracy-enhancingstrategy that complements social summaries inproximity estimation.

• We use real SN data to experimentally evaluateour algorithms.

2

2 RELATED WORK

2.1 Social Influence and Proximity MeasuresThe influence between two users captures the proba-bility that one user follows the other’s actions. Theinfluence information stemming from SNs can im-prove marketing strategies, for instance, by recom-mending products to users based on the purchasesof their contacts [6]. Existing work focuses primarilyon finding the top-k most influential users from agraph of influence scores [2] or learning the influencescores based on users’ past propagation of actions[7], [8]. Recently, [9] proposed an approach to directlyobtain the top-k most influential users from historicaldata, without the intermediate step of constructing aninfluence graph.

Many measures are proposed for computing theinfluence between two users (vertices) in a socialgraph. Simple measures rely either on the shortestpath distance or on vertex neighborhoods – i.e., thesocial proximity of two users may be defined as theinverse of their shortest path distance in the SN [3], [4]or as the number of their common friends [10]. Sophis-ticated proximity measures involve a combination ofinfinite sums over the ensemble of all paths betweentwo vertices and their common neighbors (e.g., Katzmeasure, rooted PageRank, escape probability); [11] isan extensive survey on this subject.

2.2 Multiple-domain SearchObjects associated with multiple domain attributeshave attracted considerable research interest. Webpages with geographic information and Flickr photoswith geo-tags require query processing on both thespatial and textual domains. Spatial keyword search [12]retrieves objects that are not only close to users inphysical space but which also match a set of key-words. [13] proposed the IR-tree data structure, whichextends the R-tree with inverted files. This index canbe used to efficiently support novel types of spatio-textual queries (e.g., [14]). A similar data structureappears in [15] to support a reverse form of spatialkeyword search.

Location-based SNs, such as Foursquare, Whrrl andGowalla, record users’ location history (e.g., check-ins). In [16], Scellato et al. observe that in a location-based social network, 30% of new friends made areplace friends, i.e., individuals who have visited thesame places. Hence, they build a supervised learningframework which predicts new friend links based onthe number of common contacts and common check-ins. In [17], Ye et al. observe that if a place is visitedby friends of a user, this place is probably of higherinterest to the user. Also, if a user has many nearbycheck-ins with another, one’s preferred places mayalso be appealing to the other. These two factors,combined with potential geographic influence amongthe places themselves, are used to make location

recommendations to SN users. The result in [17] isa set of places of interest, while in our problem theresult comprises k other SN users.

Both aforementioned studies measure social prox-imity between users as the number of their commonfriends. In our case, social influence is extended tomore than two hops in the SN and defined accordingto shortest path distance in the social graph. Also, inthe spatial domain, both [17] and [16] consider histor-ical check-ins, whereas we only consider the currentlocations of users, targeting present-time applications.

Armenatzoglou et al. [18] propose a general frame-work for queries over geo-social network data. Theydecouple the storage of the social network data fromthe geographic data; queries are evaluated as in adistributed database. The system supports searchingfor crisp structural patterns that appear in the socialgraph, which are spatially ranked. For example, the“nearest friends” query finds the k friends of a useru who are closest to a given location q. Our problemand solutions are different, since we integrate rankingat both social and spatial dimensions.

In [19], Cho et al. analyze the location data of SNusers and notice that an individual’s periodic move-ments which may seem random, are actually likely tocorrelate with the movements of his/her social con-tacts. This leads to a model of human mobility basedon social links. Another related application is proxim-ity detection in SNs. The goal is to continuously reportto each user who, among his/her friends, are within acertain distance from the user’s current location. Theproblem was introduced in [20] in the context of aP2P network. Subsequent approaches include deadreckoning [21] and adaptive safe region techniques[22], as well as constraint detection formulations [23].Proximity detection considers only immediate friendsof users and a fixed radius around their locations. Incontrast, in SSRQ the result may include users at anunpredictable number of social hops and at variablespatial distances from the query user.

Bao et al. [24] propose a location-aware news-feed system. This enables users to browse spatiallyrelated messages from their friends or registered newssources. Unlike SSRQ (which selects users), that sys-tem filters news-feeds/messages. Also, it considersonly immediate (1-hop) friends and news sources.

2.3 Shortest Path and Distance Computation

A traditional type of graph search is shortest pathcomputation from a source to a target vertex. Dijk-stra’s algorithm starts from the source and iterativelyexpands the network using a priority heap, until thetarget is reached. To prune the search space and directthe graph expansion, A∗ algorithm prioritizes thevisiting order of nodes by estimating their distanceto the target. [25] introduces the landmark approachwhich selects a set of vertices as landmarks in the

3

graph and pre-computes distances from every vertexto each landmark. Given two vertices and their dis-tances to a specific landmark, the triangular inequalityproduces a lower bound on the distance between thetwo vertices. Using multiple landmarks, we derivean equal number of lower bounds, among which thetightest can be used to enhance A∗ search. [26] extendsthis approach using a hierarchy of landmarks.

An approach to compute approximate distancesbetween vertices in a graph is to construct oracleswhich provide constant query time while having lin-ear space requirements. Theoretical results on distanceoracles appear in [27], [28]. Sarma et al. [29] proposelandmark based oracles which guarantee the theo-retical result of [28] and experimentally outperform[27] and [28]. Distance oracles are not effective in ourproblem, which involves distance computations in asocial graph, because we require exact distances, notapproximate. Moreover, the theoretical error boundsof distance oracles are too loose and they are knownto be poorly suited for social networks [29]. In [30],Cheng et al. propose a 2-hop cover data structurewhich supports efficient distance queries for a generalgraph with O(|V ||E|1/2) space. It is inapplicable to oursetting because, for the density and scale of real SNs,its space requirements are prohibitive.

2.4 Top-k Processing

Our problem is related to top-k processing. A top-k query specifies a preference function f over them attributes of a dataset and retrieves the k tuplesthat minimize (or maximize) this function. A thoroughsurvey of top-k processing techniques is given in [31].Here we survey the threshold algorithm (TA) and itsvariants [32] due to their higher relevance to SSRQ.

Assume that there are m repositories (sorted lists),one for each of the data attributes. The repository forthe i-th attribute keeps all tuple identifiers sorted inascending order of the i-th attribute. Two types of ac-cess are possible on each repository, sorted and random.Sorted access allows serial retrieval of elements (i.e.,pairs of tuple identifier and its i-th attribute value) byiterative “get-next” operations, starting from the firstelement in the list, then moving to the second, etc. Onthe other hand, random access allows retrieving theattribute of any tuple in a repository directly.

TA requires that the preference function f is in-creasingly monotone on all m attributes. It probes (viasorted access) the repositories in a round-robin fash-ion. For each element pulled from a list, it computesthe f value of the corresponding tuple by fetching itsremaining m−1 attributes from the other repositoriesvia random accesses. It maintains an interim result ofthe top-k tuples seen so far. It also keeps a thresholdτ computed as the value of f over the last attributevalues pulled from each of the m repositories. Es-sentially, τ is a lower bound on the f value of any

non-encountered tuple further down the lists. TAterminates when τ is no smaller than any of the fvalues in the interim result (which is then reported asthe final result).

TA assumes that random access is possible. NRA isthe no random access version of the algorithm, whereonly sorted access is available on the repositories.The repositories are probed in round-robin order. Forevery encountered tuple, NRA maintains a lower andan upper bound of its f value. The lower bound (theupper bound) is computed by replacing the unseenattributes of the tuple with the last value pulled fromthe corresponding repository (the maximum possiblevalue in the corresponding repository). NRA termi-nates when the k smallest upper bounds among seentuples are no greater than the lower bound of anyother encountered tuple.

Another variant of TA is the combined algorithm(CA). TA assumes that random and sorted access havethe same cost. CA, instead, considers that randomaccess is costlier than sorted. It proceeds similarly toNRA, but it periodically performs one random access.Specifically, for every κ sorted accesses, one randomis made; κ is set to the ratio of random access cost tosorted access cost.

Bruno et al. [33] consider top-k queries in web-accessible databases. The data consists of a sortedlist and a set of random access lists. Since randomaccess is expensive, the authors propose that when anobject is encountered in the sorted list, only a selectedsubset of the random access lists is probed to refinethe object’s f value bounds.

3 PROBLEM SETTINGThe problem setting includes a set of users U andan undirected social graph G = (V,E). Each userui ∈ U has spatial coordinates in Euclidean space. Theusers may move dynamically; our system/query onlyconsiders their current (i.e., last reported) location.The social graph G includes a vertex vi ∈ V for everyuser ui ∈ U . We establish the convention that vertexvi corresponds to user ui, i.e., the mapping is impliedby the subscripts. We do not unify the two notationsto help distinguish between spatial and social context.Every edge (vi, vj) in E represents a friend relation-ship between users ui, uj and is associated with anumerical weight that indicates the strength of therelationship – the smaller the weight, the stronger thefriendship. In previous work, given the topology ofa social network, the weights are mined from pastpropagation of user actions [7], [8]. We make noassumption about the weights other than them beingpositive numbers. We consider that G is undirected,but our work extends to directed graphs easily.

3.1 Ranking FunctionWe define spatial proximity between users ui and uj astheir Euclidean distance d(ui, uj). On the other hand,

4

TABLE 1Frequently Used Notation

Notation ExplanationG(V,E) graph G with vertex set V and edge set Ed(ui, uj) Euclidean distance between ui and uj

p(vi, vj) graph distance between vi and vjα preference param. for social/spatial proximityk number of users to be reported by the queryR the result set of the queryfk the k-th (i.e., maximum) f value in RM number of landmarks usedmij graph distance between vi and j-th landmarks partitioning granularity in grid index of AIS

we measure social proximity between vertices vi andvj based on their shortest path distance in G, anddenote it as p(vi, vj). We use this formulation because(i) it is simple and (ii) it is demonstrated to effectivelycapture social proximity/influence [3], [4].

Following common practice in combining measure-ments from different domains, we apply a linear func-tion over the (normalized) social and spatial proximityto rank users [13], [34], [14]. Specifically, given a queryuser uq , the ranking of ui ∈ U is determined byfunction f as:

f(uq, ui) = α · p(vq, vi) + (1− α) · d(uq, ui) (1)

where α is a (user- or application-specified) real num-ber between 0 and 1 that determines the relativesignificance of proximity in the two domains. Thesmaller the value of f for a user, the more suit-able he/she is for uq . Note that our definition (andimplementation) uses normalized social and spatialproximities, by dividing d(uq, ui) and p(vq, vi) withthe maximum pairwise distance in Euclidean spaceand in the social graph respectively. For simplicity,we omit the denominators from the presentation.

3.2 Query Formulation

In this work, we propose the social and spatial rankingquery (SSRQ) where a user uq (or an application)provides parameter α and asks for the top-k userswho minimize function f , with respect to his/hercurrent location and social links. Formally:

Definition 1 (SSRQ): Given a set of users U , theunderlying social graph G, a query user uq ∈ U , anda preference parameter α, SSRQ returns the k users uin U − {uq} with the smallest f(uq, u) values.

That is, for every u′ /∈ R and u′ 6= uq it holds that

f(uq, u′) ≥ fk

where fk is the maximum (i.e., least preferable) rank-ing value f across all users in the result R of the query.In Table 1 we summarize the frequently used notation.

4 PRELIMINARY SOLUTIONS

We first present two simple solutions, namely SocialFirst Approach and Spatial First Approach; then we hy-bridize them into an elaborate solution called TwofoldSearch Approach.

4.1 One Domain Approach

Social First Approach. A preliminary approach forSSRQ processing is the Social First Algorithm (SFA).The main idea in SFA is to consider users in increas-ing social distance from the query user. To achievethis, SFA expands the social graph around vq usingDijkstra’s algorithm. For every encountered user (i.e.,for every vertex popped from Dijkstra’s search heap),it also computes the Euclidean distance from uq and,in turn, the f value. The first k users are placed inthe interim result R. For any subsequent user u, ifhis/her f value is smaller than the current fk (thek-th largest f score in R), he/she enters the interimresult (and evicts from it the user with the maximumf value). The termination condition of SFA is based onthe fact that the social distance of every un-processeduser is lower-bounded by that of the last vertex en-countered by Dijkstra’s algorithm. Therefore, if v is thelast vertex popped from Dijkstra’s heap, expressionθ = α · p(vq, v) lower-bounds the f value of everynon-encountered user. Hence, set R is guaranteed toinclude the correct result when θ ≥ fk, i.e., it is safefor SFA to terminate.

Spatial First Approach. Spatial First Approach (SPA)is another preliminary solution. It processes users inincreasing spatial distance from the query user. Forthis purpose, SPA uses an incremental nearest neighbor(NN) search in the Euclidean space. To efficientlyperform this search, a regular grid index is built onthe user locations and a branch-and-bound algorithmis used to retrieve the NNs; this combination is themost suitable for dynamic spatial data kept in mainmemory [35]. For every encountered user, SPA directlycalculates their social distance to the query user andinserts them into the interim result R if necessary,similar to SFA. If u is the last NN retrieved, expressionθ = (1−α)·d(uq, u) lower-bounds the f value of everynon-encountered user. Hence, set R is guaranteed tobe correct when θ ≥ fk.

Although SFA and SPA are intuitive and simple,they suffer from a major drawback. They are unawareof either the spatial distance or the social distance ofthe un-processed users, i.e., the value of θ relies solelyon either social or spatial information and, therefore,may be too loose. This shortcoming motivates thealgorithm described next.

4.2 Twofold Search Approach

In this section we describe an SSRQ processing ap-proach which performs concurrently a social and a

5

spatial search, thus termed twofold search algorithm(TSA). This twofold search equips TSA with two lowerbounds for un-processed users (one on their socialand the other on their spatial distance from uq), thusderiving a tighter overall bound on f and alleviatingthe main drawback of SFA and SPA. The social searcharound uq is performed by Dijkstra’s algorithm, sim-ilar to SFA. The second search is an incremental NNretrieval in the Euclidean space, similar to SPA. TSAexecutes in two phases.

In the first phase, the two searches proceed simulta-neously, by iteratively reporting the next closest userin their respective domain, and alternating with eachother in a round-robin manner. Whenever the socialsearch is invoked, the encountered user is evaluated,i.e., its f value is computed and checked againstthe current fk for potential inclusion into the interimresult R. Note that evaluation is fast in this case, be-cause Euclidean distance from uq is trivial to compute.In contrast, when the spatial search is invoked, theencountered user is either (i) ignored if he/she hasalready been encountered by the social search or (ii)placed in a candidate set Q. Set Q keeps users thatare only partially evaluated, because computing theirsocial distance from q requires expensive processing.The spatial and social search, due to their incrementalnature, can be seen as sorted lists (repositories). Inthis aspect, the first phase of TSA follows a hybridparadigm between TA and NRA (covered in Section2.4) in that both sorted and random access is possiblein the spatial domain, while only sorted accesses aremade in the social dimension.

Regarding the termination condition of the firstphase, let tp be the social distance of the last userencountered by the social search, and td be the Eu-clidean distance of the last reported spatial NN. Thef value of every user that is not encountered byany of the two searches is lower-bounded by valueθ = α · tp + (1 − α) · td. The first phase of TSA stopswhen θ ≥ fk. From the definition of bound θ it followsthat:

Lemma 1: The final query result may only includeusers that are either already in the interim result R orin the candidate set Q derived from the first phase ofTSA.

Based on Lemma 1, the second phase of TSA ig-nores any non-encountered users and aims at evalu-ating (or disqualifying) candidates in Q. The f valueof every candidate is lower-bounded by expressionθ′ = α · tp + (1 − α) · t′d where t′d is the Euclideandistance of the closest candidate to uq (in the spatialdomain), while tp is as previously defined. If θ′ ≥ fk,TSA can terminate. It is obvious that continuing theNN search in the spatial domain cannot affect θ′ andwould therefore be a waste of computations. Hence,in the second phase of the algorithm only the social searchcontinues.

In the second phase, whenever a vertex is output

u0

u1

u2

u3

u4

u5

u6

u7

u8

u9

u10

u11

uq

uq

u1

u2

u3

u1

u2

u3

u4

u4

Reuse this

path for u4

uq

u1

u2

u3

u4

uq

u1

u2

u3

u4

uq

u3

u2

u1

u4

oq

o2o1

o3o4

o5o6

o7

o8

o911

1

2

1

22 3

5

7

landmark4

oq

o1

o2

o3o4

o5

41

611

5

C1

C2C3

C4

5

4

1v1

v2v3

v4v5

1 1

3

1

vq

v6

v7

4

13

1 v8

d pu1 0.1 0.1u7 0.1 0.6u8 0.6 0.2u6 0.7 0.5u5 0.7 0.2u4 0.8 0.1u3 0.9 0.3u2 0.9 0.4

Fig. 2. TSA example

by the Dijkstra search, we perform the followinginvestigation. If the vertex does not belong to Q, it isignored (by Lemma 1 it cannot be part of the result).If the vertex is in Q, it is removed from Q, it isevaluated and (should its f value be smaller than fk)it is included into R. In either case, θ′ is updated toreflect the new tp. TSA terminates when θ′ ≥ fk.

Algorithm 1 outlines TSA. Lines 7-8 include animportant detail. In the first phase, it is possible thata candidate is encountered by Euclidean NN searchand subsequently discovered by social search too. Inthis case, it must be removed from Q, since it isfully evaluated. Leaving such candidates in Q wouldunnecessarily burden the second phase.

Algorithm 1 TSA(G, α, k, uq)//Input: G: the social graph// α: the preference parameter// k: the requested number of users// uq : the query user

1: Initialize Dijkstra and incremental NN search at uq

2: Initialize result set R = {}; candidate set Q = {}3: while Dijkstra’s heap is non-empty do4: Pop next vertex v; en-heap un-visited adj. vertices5: if f(uq, u) < fk then6: Update result R, value fk, and value tp7: if u ∈ Q then8: Remove v from Q

9: Fetch the next nearest neighbor unn of uq

10: if unn was not encountered by Dijkstra search then11: Insert unn into the candidate set Q12: Update value td to d(uq, unn)13: Set θ = α · tp + (1− α) · td14: if θ ≥ fk then Break . End of First Phase15: Set value t′d = minu∈Q d(uq, u)16: Set θ′ = α · tp + (1− α) · t′d17: while Q is not empty and θ′ < fk do18: Fetch the next vertex v from social search19: if u ∈ Q then . u is the user corresponding to v20: if f(uq, u) < fk then21: Update result R and value fk22: Remove u from Q and update value t′d23: Update value tp24: Update θ′ = α · tp + (1− α) · t′d25: Return R

TSA Example: Figure 2 illustrates eight usersu1, u2, ..., u8 and uq . It also includes a table with theEuclidean and social distances of these eight usersfrom uq , sorted on the former. The order of users in

6

ascending social distance is v1, v4, v8, v5, v3, v2, v6, v7.The figure shows only the subgraph of G that isrelated to our example of processing an SSRQ querywith k = 2 and α = 0.5.

TSA first accesses u1 in the social domain with fvalue 0.1, and places it into the interim result, i.e.,R = {u1}. Euclidean NN search fetches u1, whichis ignored because it was previously fully evaluated.TSA then discovers u4 (in the social domain) with fvalue 0.45 and sets R = {u1, u4}. Next, it encountersu7 in the Euclidean domain and inserts it into Q.The social search then fetches u8 with score 0.4 andreplaces u4 in R. The Euclidean search also retrievesu8, which is ignored. At this stage, tp = 0.2 andtd = 0.6, yielding θ = 0.4. On the other hand, fk = 0.4which is no larger than θ, and therefore the first phaseculminates.

The second phase starts with Q = {u7}. Currently,the lower bound θ′ (determined by the Euclideandistance of u7 and tp) is 0.15, i.e., smaller than fk.TSA continues the social search and iteratively visitsnew vertices until either u7 is found (and evaluated)or tp increases enough so that θ′ ≥ fk. In our example,the algorithm terminates when u7 is encountered bysocial search, replacing u8 in the result. TSA reportsR = {u1, u7}.

TSA with Quick Combine: Quick Combine [31] is apopular alternative to round-robin probing for rankedsearch. This heuristic decides which search (social orspatial) to probe next based on (i) an estimate of howrapidly the distances increase in each domain, and (ii)how large the preference coefficient (α and (1−α)) oneach domain is. The version of TSA that utilizes QuickCombine in its first phase is denoted as TSA-QC.

TSA with Landmarks: An enhancement to TSA ispossible if used in conjunction with the landmarkapproach. Specifically, in a pre-processing stage, anumber of vertices in G are chosen as landmarksusing the selection technique in [25] and their dis-tances from every other vertex are computed andrecorded. Before the second phase of TSA starts, weuse the landmark information to derive a lower boundof p(uq, u) for every candidate u ∈ Q. In turn, thisproduces a lower bound of the candidate’s f value(the Euclidean distance of u is already known). If thatlower bound is no smaller than fk, the candidate iseliminated from Q.

5 AGGREGATE INDEX SEARCH

Although TSA and its landmark-aided version utilizetighter bounds than SFA/SPA, they may still visitnumerous users who are close in the social graphbut far away in the spatial domain, and vice versa.The reason is that the two searches are oblivious ofeach other, and may be accessing completely differentusers. This motivates a new approach, called aggregate

index search (AIS), which summarizes both social andspatial information into the same index, and runs aunified search on it.

The index is a spatial access method that addi-tionally incorporates (aggregate) social information.Given an index node, we devise a mechanism thatprovides a lower bound for the f values of all un-derlying users. This bound is used in a branch-and-bound process to quickly identify users that are closein both domains. The approach incorporates a novelaggregation of landmark information to provide socialsummaries at index nodes, as well as optimized graphaccess techniques and adaptations of landmarks, tai-lored to the characteristics of SSRQ. We first describethe core of the approach, followed by optimizationsin its submodules.

5.1 Aggregate Index and Query ProcessingThe AIS index is a spatial data structure with em-bedded social information. It could use any spatialaccess method as a basis (e.g., an R-tree, a k-d-tree,etc). However, we choose a multi-level regular gridbecause (i) it supports fast location updates [36],[37] and (ii) it facilitates our branch-and-bound SSRQsearch (the latter being the reason we prefer it over asingle-level grid). Each index node is parent to s × snodes in the immediately lower level, where s isan integer parameter that determines the partitioninggranularity into child nodes. The lowest level containsleaf cells. Each leaf cell C holds the users that lie insideits spatial extent. Figure 3 illustrates an internal indexnode which is parent to s× s leaf cells, in an examplewhere s = 2. Note that the multi-level grid does nothave to be a tree, i.e., it does not necessarily have aroot. We may instead keep only a certain number ofits lowest levels1.

The social summaries kept in the index rely onlandmark information. AIS requires that a set oflandmarks is used and that each vertex vi ∈ Vis associated with a vector including its distancesfrom every landmark. Assuming that there are Mlandmarks, we denote the distance between vertex viand the j-th landmark as mij . The social summarykept with each cell consists of two vectors, m andm, both of length M . Consider first vector m. Its j-thelement is symbolized as m[j] and is the maximumpath distance between any user in cell C and the j-th landmark. Formally, m[j] = maxvi∈C mij . Similarly,the j-th element of vector m indicates the minimumpath distance between any user in C and the j-thlandmark, i.e., m[j] = minvi∈C mij . This informationis propagated upwards, setting the social summariesof internal index nodes according to the full set ofusers they cover.

1. This is the case in our experiments, where keeping the lowesttwo levels from a three-level hierarchy generally yields favorableperformance.

7

internal node

leaf cell

contains a set of

users

s = 2

Fig. 3. Internal and leaf cells in AIS index

To enable a branch-and-bound search in the indexwe need to derive a lower bound on the f valuesof users in a (internal or leaf) cell C. We first de-fine a lower bound on the spatial domain. Given aquery user uq , we denote as d(uq, C) the minimumEuclidean distance between uq and any point in C.If uq is inside C, then d(uq, C) = 0. In all othercases, d(uq, C) equals the distance between uq and theclosest point on the boundary of C. For example, inFigure 4(a) the minimum distance between u1 andthe illustrated cell is determined by the horizontalprojection line shown dashed. On the other hand,d(u2, C) equals the length of the diagonal dashed line.

Regarding the social domain, we show how theaggregate landmark information (vectors m and m)can be used to provide a lower bound on p(vq, vi) forevery vi ∈ C. Essentially, our technique extends thelandmark approach (which was originally proposedfor individual vertices) to groups of vertices, i.e., to allvertices under a cell C – to the best of our knowledge,it the first time this is attempted in the literature.

Lemma 2: Given a cell C with social vectors m andm, the following formula provides a lower boundfor the shortest path distance between any user in Cand the query user uq :

p(uq, C) = max1≤j≤M

m[j]−mqj if mqj < m[j]mqj − m[j] if mqj > m[j]0 otherwise

(2)Proof: Consider the j-th landmark and assume

that mqj < m[j]. For every ui ∈ C the triangularinequality suggests that p(vq, vi) ≥ |mij −mqj |. Sincemqj < m[j] ⇒ mqj < mij , the inequality becomesp(vq, vi) ≥ (mij −mqj). By definition, m[j] ≤ mij andthus p(vq, vi) ≥ (m[j]−mqj). As the second part of theinequality is constant for every ui ∈ C, we deducethat minui∈C p(vq, vi) ≥ (m[j] − mqj). The case formqj > m[j] is symmetric. On the other hand, whenm[j] ≤ mqj ≤ m[j], we can derive no lower boundbased on the j-th landmark. Finally, we may use themaximum (i.e., tightest) lower bound derived fromany landmark as a lower bound for minui∈C p(vq, vi).

Figure 4 illustrates a cell C containing three usersand the underlying social graph. Vertex v6 is chosen asthe single landmark (M = 1), and the graph distances

l0 = (2, 7)l1 = (1, 6)l2 = (2, 7)l3 = (1, 6)l4 = (0, 5)l5 = (1, 4)

l6 = (2, 3)l7 = (3, 3)l8 = (3, 2)l9 = (4, 2)l10= (4, 1)l11= (5, 0)

u4 u11 u4 u11u1

u3u4

u8 u9

u10

u0

u2

u5

u11

u6

u7

u0

u1

u2

u3

u4

u5

u6

u7

u8

u9

u10

u11

u1

u3u4

u8 u9

u10u1

u3u4

u8 u9

u10

+

(0,1)l

(4,6)l

(0,5)l

(1,6)l

(3,1)l

(4,2)l

u3

u4

u5

u1(uq)

uq

u3

u2

u9

ps = 4

u5

u1u6

(0,1)l d = 1

vq

v4

v3

v5

v9

2

v10

v6

1 35

23

1

landmark

v1

v2

v11

1

2

4

v7

v8

5

5

1 X 2 3 4

1

2

3

4

Y

u2

(a) Lower bound of f

oq

o2o1

o3o4

o5

o7

o8

o911

1

2

1

22 3

5

7

landmark4

oq

o1

o4 o5

o7

o8

o9

11

1

2

1

2

2

31

landmark

v7

v1v2 v3

v4v5

v6

12

(b) Social graph

Fig. 4. AIS bound example

of v3, v4, v5 are 4, 3, and 1 respectively. Hence, theaggregate information (social summary) kept for thiscell is m = 4 and m = 1. By Formula 2, we can directlyderive a lower bound of the social distance betweenany user in C and vq ≡ v1 (without accessing thespecific social or landmark information of the usersin C), i.e., p(v1, C) = 1, which in this example is astight as it would be if the exact landmark informationof individual users was accessed.

Combing the lower bound d(uq, C) for Euclideandistance and p(vq, C) for graph distance, we derive alower bound of the f value for any user in an (internalor leaf) cell C.

Theorem 1: Given a cell C with social vectors m andm, the following formula provides a lower bound forthe f value of any user in C:

MINF(uq, C) = α · p(vq, C) + (1− α) · d(uq, C) (3)

Proof: From Lemma 2 and the definition ofd(uq, C) it follows that f(uq, ui) ≥ MINF(uq, C) foreach ui ∈ C.

Theorem 1 and metric MINF pave the way forthe AIS processing algorithm. The search starts fromthe top level of the index. All cells in that level arepushed into a min-heap H with keys equal to theirMINF values. The head of the heap is iterativelypopped. Depending on the type of the popped itemwe distinguish three cases:• If the item is an internal index node, we push intoH all its child nodes with their individual MINFvalues as keys.

• If the item is a leaf cell C, we push into H allthe users ui ∈ C with key equal to α · p(vq, vi) +(1 − α) · d(uq, ui), where p(vq, vi) is the lowerbound of social distance p(vq, vi) derived from thelandmark information of vi.

• If the item is a user ui, we compute its exact socialdistance from vq (using a submodule described inSection 5.2) and update the interim result R if itsf -value is lower than fk.

The algorithm terminates when the head of the heaphas a key larger than or equal to the current fk.Algorithm 2 summarizes the process.

AIS benefits from combining spatial and social in-formation in the same index and promptly identifiesusers that lie nearby uq in both domains. In particular,

8

Algorithm 2 AIS(G, α, k, uq)//Input: G: the social graph// α: the preference parameter// k: the requested number of users// uq : the query user

1: Initialize an empty min-heap H2: Push into H all top-level index nodes with MINF as key3: while H is not empty and head’s key is less than fk do4: Pop the head item of H5: if popped item is an internal index node then6: for each child C of the node do7: Push C into H with key MINF(uq, C)

8: else if popped item is a leaf cell C then9: for each user u ∈ C do

10: Push u into H (key α·p(vq, v)+(1−α)·d(uq, u))11: else if popped item is a user u then12: Call a submodule to compute p(vq, v)13: if f(uq, u) < fk then14: Update result R and value fk15: Return R

it effectively eliminates nodes, cells and users thatare only close in the Euclidean space using the socialsummaries. On the other hand, for users that are onlyclose in the social graph, it avoids eagerly evaluatingthem.

The aggregate index supports efficient location up-dates. When a user ui moves, the update is dealt withas a deletion in the old cell and an insertion in thenew one2. We first remove ui from the user list ofthe old cell and update the cell’s social summaries– if a component in m or m is due to a landmarkdistance of vi, the component is recomputed over theremaining users in the cell. Regarding insertion intothe new cell, ui is added to the cell’s user list and thelandmark distances of vi are compared against vectorsm and m. If, say, the j-th landmark distance of vi islarger than the corresponding component of m, thelatter is set to mij . Symmetrically, if mij < m[j] weupdate m[j] to mij . Should there be an update in thesocial summary of either the old or the new cell ofui, it may recursively propagate to upper level nodesin a similar manner. Our index design is primarilyconcerned with location updates, as the positions ofSN users change much more frequently/dynamicallythan the topology of the network. To deal with thelatter (i.e., updates in G) batching could be used inconjunction with dynamic shortest path algorithms,so that landmark information can be incrementallymaintained [38], [39].

An important remark regards a key principle indesigning the index for AIS. The index, as describedabove, partitions the user set according to Euclideancoordinates. Since social summaries are vectors, itis possible to partition the user set (and thus forman index) in the combined social-spatial space. We

2. Note that if the user moves within his/her current cell, wesimply update his/her coordinates; no index maintenance is nec-essary.

attempted this approach with little success. We ob-served that when a space partitioning method is ap-plied to index the combined space, dead space (emptypartitions) tends to cripple performance. On the otherhand, data partitioning indices cannot effectively bal-ance the relative significance of the two domains (intheir bulk-loading and splitting mechanism) withoutprior knowledge of α and also lead to oblong boxesthat compromise performance. Finally, this combined-space approach (be it with a space or data partitioningindex) suffers from the dimensionality curse, needingto cope with M+2 dimensions. This imposes a seriouslimitation on the number of landmarks used.

5.2 Graph Search with Computation Sharing

In this section we describe how AIS computes socialdistances for users u popped from its search heap H ,i.e., we elaborate on Line 12 of Algorithm 2. First,we decide on the processing paradigm to derive thegraph distance. Next, we propose two approaches thatenable sharing computations (i.e., reusing informa-tion) among the different calls of the submodule forthe various evaluated users.

Let u be the user to be evaluated in Line 12, and vbe the corresponding vertex in the social graph. AISassumes that landmark information is available forall v ∈ V in order to build its index. We utilize thislandmark information to accelerate the computationof p(vq, v) too. That is, as described in Section 2.3, anA∗ search is applicable – the algorithm proceeds likeDijkstra, but en-heaps encountered vertices with a keyincremented by a (landmark-derived) underestimateof their distance to the target vertex. This tends tonarrow down the search area of the algorithm. Tofurther enhance performance, instead of a straight-forward A∗ execution from vq to the target vertex v,we follow the bidirectional search paradigm [25]. Theidea in this paradigm is to concurrently execute twoA∗ searches: one from vq to v (called forward search)and another from v to vq (reverse search). When thetwo searches meet, a complete path is derived, whichprovides a preliminary value for p(vq, v). This valuedoes not necessarily correspond to the shortest pathbut facilitates tightening the search; i.e., if the forwardor reverse search de-heaps a vertex with key largerthan or equal to this distance, the latter can be safelyoutput as the actual graph distance.

The issue is that the above technique is aimedfor vertex-to-vertex computations. In AIS instead, weneed to perform multiple graph distance calculationsfrom the same source vq to different target vertices.Directly applying the bidirectional approach wouldperform overlapping searches, i.e., it would unneces-sarily repeat part of the work multiple times. Considerfor instance Figure 5. Assume that in two consec-utive executions of Line 12 in Algorithm 2 we areto obtain the graph distances from vq to vertices v11

9

vq

1

2

1v1

forward search

1

1

2

1

1

12

5

3

v2

v3

v4 v5v6

v7

v8

v9 v10 v11

Fig. 5. Path caching

and v6 respectively. After the first bidirectional search(between vq and v11), the forward search accessesall vertices inside the dashed boundary. If anotherbidirectional search is applied between vq and v6,forward search starts from scratch and all verticesinside the boundary (i.e., v2, v3, v4) are visited again.

This observation motivates the idea to share compu-tations among different graph distance computations,and therefore save processing time. Before presentingspecific techniques to achieve this goal, we muststress that (unlike [25]) our bidirectional approachdoes not use A∗ in both directions. Specifically, whilethe reverse search is a landmark-based A∗ process,for the forward search we employ a plain Dijkstrasearch, without any aid from landmarks. The reasonwill become clear shortly.

We describe two complementary computation shar-ing approaches. The first is conceptually simple.

Distance caching: If the target vertex v was visitedby forward search previously, its exact distance hasalready been computed and can be reported directly.Continuing the example in Figure 5, if v4 happens tobe the next target vertex, its distance from vq is alreadyknown because the forward search between vq and v11has previously visited it (the distance of any vertexpopped from the Dijkstra heap is immediately de-rived). Similarly, if v belongs to a previously reportedshortest path, its distance from vq is also readilyavailable (when a vertex belongs to the shortest pathbetween a source and a target, its distance from eitheris directly deduced).

Forward heap caching: The second technique reusesthe search heap of the forward search. Instead ofterminating forward search and re-invoking it fromscratch for every target vertex, we maintain its heapcontents and re-use them between runs. That is, essen-tially the forward search only pauses when the graphdistance to a target vertex v is found and its state(i.e., its search heap) is maintained. When the distanceof the next target v′ is to be computed, the forwardsearch resumes from the point it stopped, using thealready populated heap.

Note that for this optimization to be possible,forward search must be implemented as a Dijkstraprocess. The rationale is that in Dijkstra’s algorithm

the keys used in the search heap are irrelevant tothe target vertex, and this exactly is the fact thatenables reusing the heap for different target vertices.In an A∗ implementation of forward search, the heapkeys would be incremented by (landmark-derived)distance bounds that depend on the specific targetvertex each time, making the heap useless for differenttarget vertices.

The distance computation submodule of AIS withall optimizations is outlined by procedure GraphDist(Algorithm 3). Hf is the search heap of forwardsearch. T is a table including previously computedshortest paths. Hf and T are global variables, i.e., theyare retained between the calls of GraphDist and dis-carded only when AIS (the calling process) terminates.

Algorithm 3 GraphDist(G, vq , v, Hf , T )//Input: G: the social graph// vq : the (vertex corresponding to the) query user// v: target vertex to compute the graph distance to// Hf : the min-heap of forward search// T : set of all previously computed shortest paths

1: if v was previously visited by forward search then2: Return the stored distance of v3: else if v appears in any path in T then4: Return the stored distance of v5: Initialize MinDist = +∞ and ShortestPath = {}6: Initialize an A∗ process at v for the reverse search7: while MinDist > head’s key in heap of rev. search do8: Fetch next vertex vf from forward search (from Hf )9: if vf was previously visited by reverse search then

10: if p(vq, vf ) + p(vf , v) < MinDist then11: Set MinDist = p(vq, vf ) + p(vf , v)12: Update ShortestPath accordingly13: Fetch next vertex vr from reverse search14: if vr was previously visited by forward search then15: if p(vq, vr) + p(vr, v) < MinDist then16: Set MinDist = p(vq, vr) + p(vr, v)17: Update ShortestPath accordingly18: Do not push nodes adj. to vr into rev. heap19: Store ShortestPath in T20: Return MinDist

5.3 Improving on Landmark Lower Bounds

Referring to the general AIS algorithm, as described inSection 5.1, vertices are evaluated in an order dictatedby a lower bound of their f values (see Line 10 inAlgorithm 2). This lower bound is derived in part bylandmark estimates of the social distance between vqand the vertices. It is a known fact that landmarksoften produce very loose lower bounds, which maylead AIS to evaluate target vertices that in reality lietoo far from vq in the social graph.

Consider for instance Figure 6, and assume that v5is used as the landmark. Vertices vq and v7 are almostequi-distant from the landmark, yielding a lowedbound p(vq, v7) = 1. This is a large underestimateof the actual distance and may lead in evaluating v7(via expensive search in G), although it is actually

10

vq

v2

v3

v4

v5

2

v6

v7

1 3

2

23

1

forward search

v1v8

v9

v10v11

v12

25

1

111

landmark

Fig. 6. Delayed evaluation example

too far from vq in the social space. Using a largenumber of landmarks could reduce the occurrenceof wide underestimates, but it is not a panacea; aswe show in the experiments, using many landmarkscould seriously harm overall performance.

Fortunately, the nature of our algorithm allows forinformation sharing that may alleviate this problem.Specifically, as AIS evaluates more users, the forwardsearch in its bidirectional submodule also proceeds.Let β be the key (graph distance) of the last vertexpopped in forward search. If the target vertex v inLine 11 of Algorithm 2 has not been visited by theforward search before, we are sure that its socialdistance from vq is at least β, i.e., β may serve asa lower bound of p(vq, v) which, actually, might betighter (larger) than the landmark-based p(vq, v). Ifthat is the case, we derive a new (larger) lower boundfor f(vq, v) (that is α · β + (1− α) · d(uq, u)) and pushv back into the AIS heap with the new bound as thekey. This may postpone the premature evaluation ofvertices due to wide landmark underestimates. Werefer to this technique as delayed evaluation strategy.

Consider again the example in Figure 6, and assumethat when v7 is popped from the heap of AIS the βvalue is 2. This means that the forward search (due tothe evaluation of previous target vertices) has reachedup to the boundary shown dashed. Instead of directlycomputing the actual graph distance of v7 (in Line 12of Algorithm 2), we detect that its landmark distanceis looser than β and re-insert it into the AIS heap withan updated key based on β.

Note that a vertex might be re-inserted into theAIS heap multiple times before it is actually eval-uated. This is the case when a re-inserted vertex ispopped anew, but the β value has meanwhile furtherincreased, leading to an even tighter lower bound forits f value. In this situation, the vertex is pushed intoH again with a new key. To incorporate the delayedevaluation strategy we need to add the followinginstructions right after Line 11 in Algorithm 2:

1: if key of u in H is less than α ·β+(1−α) ·d(uq, u) then2: if v not visited by forw. search nor exists in T then3: Push u back into H with key α·β+(1−α)·d(uq, u)4: Go to Line 3 (of Algorithm 2)

The second condition is to avoid re-inserting a ver-tex whose graph distance is readily available (becauseit has been visited by forward search, or because itbelongs to an already computed shortest path).

5.4 Graph Distance Pre-computationGiven that social search dominates the processing cost(in all approaches), pre-computing social distances be-tween vertices could possibly improve performance.Materializing all-pair social distances requires a pro-hibitive amount of storage; for the Foursquare graphin our experiments, which contains around 2 millionusers, we need roughly 16 TB to store all-pair dis-tances. To alleviate this problem, we could insteadmaterialize for each user the distances of the t so-cially closest vertices. To utilize the pre-computation,we replace SFA’s Dijkstra component with the pre-computed distance list. In case the algorithm exhauststhe list of t social neighbors (without terminating),it falls back to our best method, AIS. Note that pre-computation is applicable to SPA and TSA as well,but with limited success, because these algorithmsmay encounter a socially distant candidate (outsidethe pre-computed list) very early in their execution.

6 EXPERIMENTAL EVALUATION

In this section we experimentally evaluate the SSRQtechniques proposed in the paper. All methods wereimplemented in C++ and the experiments conductedon an Intel Core2Duo 2.66GHz CPU machine with 8GB memory, running on Ubuntu 10.04.

We use two real datasets, Gowalla and Foursquare.Table 2 provides some of their characteristics; the lastcolumn indicates their average vertex degree. Gowalla,obtained from snap.stanford.edu, contains 196K users.Foursquare, used in [40], [41], contains 1.88M users.Due to privacy constraints, the location records forsome users are unavailable. Thus, we only have accessto the historical positions of 54.4% of users in Gowallaand those of 60.3% of users in Foursquare.3 From thelocations available for a user, we assign him/her theone with the highest frequency of visits.

TABLE 2Data Statistics

Name |V | |E| # locations Deg.Gowalla 196,590 1,900,654 107,092 9.7

Foursquare 1,880,405 17,838,254 1,133,936 9.5

Neither of the social networks has explicit infor-mation about the edge weights. Based on a commonmethodology [2], [42], we derive this informationfrom the degrees of vertices incident to the edges. In-tuitively, the more the friends of a user, the looser theconnection to them, i.e., the larger the edge weight.

3. Users with no available location are considered infinitely faraway from any other user.

11

10 20 30 40 50k

1

2

3

4

5

6

7

8

9

10

hops

G. Avg. hop

G. Max. hop

F. Avg. hop

F. Max. hop

(a) Hop statistics

0.1 0.3 0.5 0.7 0.9α

10-2

10-1

Jacc

ard

rati

o

vs. social vs. spatial

(b) SSRQ vs. social/spatial NNs

Fig. 7. Insights into the nature of SSRQ query

Thus, edge weights are set proportionally to the prod-uct of degrees of the vertices (users) they connect, i.e.,the weight of edge (vi, vj) is set to deg(vi)·deg(vj)

max degree2 , wheredeg(vi) and deg(vj) are the degrees of vertices viand vj respectively, and max degree is the maximumvertex degree in the social graph.

Table 3 includes the tested value ranges for thequery and system parameters in our setup. In eachexperiment, unless otherwise stated, the parametersare set to the default values shown in the table. Dataand indices for all methods are kept in memory.The main performance factor in our evaluation isrun-time, i.e., query processing cost. We also reportthe pop ratio, computed as |Vpop|

|V | , where |Vpop| is thenumber of vertices popped from the search heaps ofthe methods. Importantly, the pop ratio measurementsare also indicative of performance (specifically, I/Ocost) in an alternative setting where the social graphis stored on the disk. Every reported measurement isthe average across 1,000 SSRQ random queries.

TABLE 3Query and System Parameters

Parameter Default Rangesize of result k 30 10, 20, 30, 40, 50

preference parameter α 0.3 0.1, 0.3, 0.5, 0.7, 0.9grid granularity s 10 5, 10, 15, 20, 25

We first study our data and the nature of SSRQ.In Figure 7(a) we record the number of hops (awayfrom vq) where the furthest SSRQ result is found. Weplot the AVG and MAX of these numbers (across the1,000 queries run for each k value tested). Prefix “F.”corresponds to Foursquare and “G.” to Gowalla. We seethat results may lie several hops away from vq , insome cases reaching up to 8 hops.

In Figure 7(b) we investigate the similarity (i) be-tween the SSRQ result and the k Euclidean NNs ofuq and (ii) between the SSRQ result and the k sociallyclosest users to vq . In either case, we compute theJaccard ratio, a standard measure of set similarity [43].Given two sets, it is defined as the cardinality of theirintersection divided by the cardinality of their union;its domain is 0 (unrelated sets) to 1 (identical sets). Weplot results for different α values in Foursquare, i.e.,we vary the relative importance of spatial and socialproximity. The Jaccard ratio is below 0.1 in all cases,indicating that the nature of SSRQ is very different

10 20 30 40 50k

0.00

0.05

0.10

0.15

0.20

0.25

0.30

runnin

g t

ime(s

ec)

SFA

SPA

TSA

TSA-QC

AIS

SFA-CH

SPA-CH

TSA-CH

(a) Run-time in Gowalla

10 20 30 40 50k

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

runnin

g t

ime(s

ec)

SFA

SPA

TSA

TSA-QC

AIS

SFA-CH

SPA-CH

TSA-CH

(b) Run-time in Foursquare

10 20 30 40 50k

10-2

10-1

100

101

Pop r

ati

o

SFA

SPA

TSA TSA-QC AIS

(c) Pop ratio in Gowalla

10 20 30 40 50k

10-2

10-1

100

101

Pop r

ati

o

SFA

SPA

TSA TSA-QC AIS

(d) Pop ratio in Foursquare

Fig. 8. Effect of k

from either social or spatial NN search. Results inGowalla are similar and omitted for brevity.

Next we compare the different SSRQ approaches.SFA and SPA are described in Section 4. TSA is thelandmark-aided version of the algorithm in Section 4.2(we disregard its non-landmark counterpart becauseit consistently performs worse). TSA-QC is the QuickCombine version of TSA. AIS is the best-performingversion of aggregate index search from Section 5 (itsdifferent flavors are evaluated later). We finetuned thelandmark-based methods with respect to M (numberof landmarks) and set it to 8.

Figure 8 investigates performance for different val-ues of k. Figures 8(a) and 8(b) show that processingin Gowalla is faster than Foursquare – the reason isthat SSRQ search in Gowalla, regardless of the al-gorithm chosen, generally reaches fewer users thanin Foursquare, as shown in Figure 7(a). The run-timeincreases with k, because the search area in both thesocial and the spatial domain expands for larger k.

Just for this experiment, in the run-time charts weinclude variants SFA-CH, SPA-CH and TSA-CH (ofSFA, SPA, TSA) where we replaced the Dijkstra-basedsocial distance computation module with the state-of-the-art pre-computation-based technique for shortestpath computation, CH, from [44]. These variants areslower than vanilla SFA/SPA/TSA, because (i) CHis better suited to low-degree graphs (such as planargraphs) and (ii) in vanilla methods, shortest paths areproduced incrementally (they all have vq as source),thus essentially sharing/reusing computations.

Turning to the relative performance of our algo-rithms, Figures 8(c) and 8(d) verify that SFA and SPAprocess more vertices than TSA, due to their loosertermination conditions. However, the difference inrun-time is not as wide as in pop ratio, the reasonbeing TSA’s overhead in maintaining two bounds(one spatial and one social). TSA-QC performs better

12

0.1 0.3 0.5 0.7 0.9α

0.00

0.05

0.10

0.15

0.20

runnin

g t

ime(s

ec)

SFA

SPA

TSA

TSA-QC

AIS


0.1 0.3 0.5 0.7 0.9α

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

runnin

g t

ime(s

ec)

SFA

SPA

TSA

TSA-QC

AIS


Fig. 9. Effect of α

than TSA in Gowalla when k is small but is slowerthan TSA in Foursquare in all settings. We suspect thatthis is due to the different social/spatial distributionsin the two datasets. AIS visits fewer than 6% and3% of the vertices in Gowalla and Foursquare respec-tively, while the other approaches visit more than 90%in most cases. This demonstrates that the aggregateindex search paradigm vastly reduces the number ofexpanded vertices (especially in large SNs), whichimplies significantly shorter processing time.

In Figure 9 we test different values of α, i.e., dif-ferent weighing of social versus spatial proximity.SFA examines vertices in increasing social distanceorder, which implies that for large α the first fewprocessed vertices are highly likely to already producethe result. TSA and TSA-QC are also more socially-led (than spatially), since their second phase reliesentirely on graph search, thus benefiting from a largeα. SPA, on the other hand, is spatially-led and henceits performance worsens with α, albeit only slightly.Importantly, our most advanced method, AIS, is ro-bust to α and retains its clear lead over alternatives.

Aggregate index search provides a flexible frame-work, which can be equipped with different tech-niques and optimizations. Here we evaluate its threemost representative versions: (i) AIS-BID is a directimplementation of Algorithm 2 using the bidirectionalsearch in [25] for graph distance computations andno other optimization; (ii) AIS− uses all optimizationsexcept the delayed evaluation strategy; and (iii) AIS usesall optimizations.

In Figure 10 we compare these algorithms for var-ious values of k. The behavior of AIS-BID demon-strates that, although the search technique proposedin [25] is more efficient than other flavors of bidirec-tional search, it is unable to yield favorable perfor-mance without our further enhancements. This fact isalso supported by the pop ratio charts. The compar-ison between AIS and AIS− reveals that the delayedevaluation mechanism improves performance, albeitto a moderate degree.

In Figure 11 we evaluate the pre-computation tech-nique (from Section 5.4) against AIS. We present run-time versus t, i.e., versus the number of cached so-cial neighbors per user. Pre-computation yields minorimprovements in the larger graph (Foursquare) butsignificant in the smaller one (Gowalla). The reason

10 20 30 40 50k

0.00

0.05

0.10

0.15

0.20

runnin

g t

ime(s

ec)

AIS-BID AIS− AIS


10 20 30 40 50k

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

runnin

g t

ime(s

ec)

AIS-BID AIS− AIS


10 20 30 40 50k

10-2

10-1

100

Pop r

ati

o

AIS-BID AIS− AIS

(c) Pop ratio in Gowalla

10 20 30 40 50k

10-2

10-1

100

Pop r

ati

o

AIS-BID AIS− AIS

(d) Pop ratio in Foursquare

Fig. 10. Effect of k on AIS versions

1K 2K 3K 4K 5K 6K 7K 8K 9K 10Kt

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16ru

nnin

g t

ime(s

ec)

AIS AIS-Cache


1K 2K 3K 4K 5K 6K 7K 8K 9K 10Kt

0.30

0.31

0.32

0.33

0.34

0.35

runnin

g t

ime(s

ec)

AIS AIS-Cache


Fig. 11. Effect of pre-computation

is that, as shown in Figure 7(a), in Foursquare searchexpands more hops away from vq , thus being morelikely to reach a vertex outside the cache.

In Figure 12, we measure the effect of s, i.e., thegranularity of the grid index, on SPA, AIS-BID, AIS−

and AIS. Recall that a larger s implies more cells withsmaller size each. This parameter affects performancein two conflicting ways: (i) as s grows, more cells liein the vicinity of the query user and therefore morecomputations are needed to calculate distance boundsfor them; (ii) on the other hand, smaller grid cellsprovide more accurate summaries (be them Euclideanor social) about the underlying users, and increasethe effectiveness of pruning. Value s = 10 strikes afavorable balance between these factors, although themethods are not very sensitive to it.

5 10 15 20 25s

0.00

0.05

0.10

0.15

0.20

runnin

g t

ime(s

ec)

SPA AIS-BID AIS− AIS


5 10 15 20 25s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

runnin

g t

ime(s

ec)

SPA AIS-BID AIS− AIS


Fig. 12. Effect of grid granularity

13

10 20 30 40 50k

0.0

0.5

1.0

1.5

2.0

2.5

3.0

runn

ing

time(

ms)

SFA

SPA

TSA

TSA-QC

AIS

(a) Effect of k

0.1 0.3 0.5 0.7 0.9α

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

runn

ing

time(

ms)

SFA

SPA

TSA

TSA-QC

AIS

(b) Effect of α

Fig. 13. Experiments with Twitter

For generality, in Figure 13 we use a real dataset,Twitter, with higher average degree than our defaultdatasets (its average degree is 57.7). It contains 124KTwitter users in Singapore who made geo-taggedtweets in 2013; a user’s location is derived fromhis/her latest tweet. The charts (versus k and α) showsimilar trends to our default datasets. A difference isthat the run-time increases less sharply with k, be-cause the larger degree implies that more candidates(users) are reachable with fewer hops from uq .

Finally, we generate synthetic data to examine pa-rameters we cannot control in the real SNs. In Fig-ure 14(a), we generate data with different correlationsbetween the social and spatial distances. We use thesocial distances derived from Foursquare, but assignto users artificial locations as follows. For each vq , wegenerate the spatial distance of user u from uq by for-mula d = ρ·p(vq, v)+ε, where ε is a random number inrange [−0.15, 0.15] and ρ is 1 (for positive-correlationdataset) or -1 (for negative-correlation dataset). Basedon the generated d (normalized in the [0,1] range), weplace the user at a random point on the circle withradius d from uq . We also generate a third dataset,where the spatial locations of users are randomlypermuted, so as to create a dataset with independentcorrelation between social and spatial proximity.

All algorithms require the shortest time when dataare positively correlated and the longest when theyare negatively correlated. In the positive correlationcase, users that are socially near vq tend to also lieclose by in the Euclidean space. Hence, search encoun-ters the top-k users early on and terminates quicker.The situation is reversed for negatively correlateddata, because socially near users are spatially far, andvice versa, implying that the top-k users tend to liefar from the query in either of the two domains, ifnot in both. AIS is the method of choice in all cases,furthermore demonstrating robustness to the type ofcorrelation between spatial and social distance.

In Figure 14(b), we show performance for differ-ent SN sizes. Starting with Foursquare as a basis,we extracted from it SNs of different sizes usingthe structure-preserving Forest Fire Sampling technique[45]. As the number of SN vertices is tripled from0.6M to 1.8M, the running time of all algorithmsincreases almost linearly, with AIS scaling much moregracefully than competitors.

positive independent negativecorrelation

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

runnin

g t

ime(s

ec)

SFA

SPA

TSA TSA-QC AIS

(a) Correlation

0.6M 1.2M 1.8Mdata size

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

runnin

g t

ime(s

ec)

SFA

SPA

TSA TSA-QC AIS

(b) Data size

Fig. 14. Experiments with synthetic data

7 CONCLUSION

We study a query type that captures proximity in thecombined social-spatial domain. Our most efficientalgorithm relies on an aggregate index that supportsestimates of combined proximity. Experiments on ac-tual social networks demonstrate that it is highlyscalable and robust. A direction for future work isjoint social and spatial processing on networks storedin a distributed manner.

ACKNOWLEDGMENTS

This work was funded by research grant 14-C220-SMU-004 from the Singapore Management UniversityOffice of Research under the Singapore Ministry ofEducation Academic Research Funding Tier 1 Grant,and by grant HKU 715413E from Hong Kong RGC.The authors would also like to thank Dr. MohamedSarwat for providing datasets and suggestions.

REFERENCES

[1] M. Helft, “Bing taps facebook data for fight with google,”http://bits.blogs.nytimes.com/2011/05/16/, May 2011.

[2] D. Kempe, J. M. Kleinberg, and E. Tardos, “Maximizing thespread of influence through a social network,” in KDD, 2003,pp. 137–146.

[3] M. V. Vieira, B. M. Fonseca, R. Damazio, P. B. Golgher, D. C.Reis, and B. A. Ribeiro-Neto, “Efficient search ranking in socialnetworks,” in CIKM, 2007, pp. 563–572.

[4] P. Yin, W.-C. Lee, and K. C. K. Lee, “On top-k social websearch,” in CIKM, 2010, pp. 1313–1316.

[5] D. Papadias, M. L. Yiu, N. Mamoulis, and Y. Tao, “Nearestneighbor queries in network databases,” in Encyclopedia of GIS,2008, pp. 772–776.

[6] M. Richardson and P. Domingos, “Mining knowledge-sharingsites for viral marketing,” in KDD, 2002, pp. 61–70.

[7] K. Saito, R. Nakano, and M. Kimura, “Prediction of informa-tion diffusion probabilities for independent cascade model,”in KES (3), 2008, pp. 67–75.

[8] A. Goyal, F. Bonchi, and L. V. S. Lakshmanan, “Learninginfluence probabilities in social networks,” in WSDM, 2010,pp. 241–250.

[9] A. Goyal, F. Bonchi, and L. Lakshmanan, “A data-basedapproach to social influence maximization,” PVLDB, vol. 5,no. 1, pp. 73–84, 2011.

[10] M. E. J. Newman, “Clustering and preferential attachment ingrowing networks,” in Physical Review Letters E, 2001.

[11] D. Liben-Nowell and J. M. Kleinberg, “The link-predictionproblem for social networks,” JASIST, vol. 58, no. 7, pp. 1019–1031, 2007.

[12] I. D. Felipe, V. Hristidis, and N. Rishe, “Keyword search onspatial databases,” in ICDE, 2008, pp. 1483–1484.

[13] G. Cong, C. S. Jensen, and D. Wu, “Efficient retrieval of thetop-k most relevant spatial web objects,” PVLDB, vol. 2, no. 1,pp. 337–348, 2009.

14

[14] X. Cao, G. Cong, C. S. Jensen, and B. C. Ooi, “Collective spatialkeyword querying,” in SIGMOD Conference, 2011, pp. 373–384.

[15] J. Lu, Y. Lu, and G. Cong, “Reverse spatial and textual knearest neighbor search,” in SIGMOD Conference, 2011, pp.349–360.

[16] S. Scellato, A. Noulas, and C. Mascolo, “Exploiting placefeatures in link prediction on location-based social networks,”in KDD, 2011, pp. 1046–1054.

[17] M. Ye, P. Yin, W.-C. Lee, and D. L. Lee, “Exploiting geograph-ical influence for collaborative point-of-interest recommenda-tion,” in SIGIR, 2011, pp. 325–334.

[18] N. Armenatzoglou, S. Papadopoulos, and D. Papadias, “Ageneral framework for geo-social query processing,” PVLDB,vol. 6, no. 10, pp. 913–924, 2013.

[19] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mo-bility: user movement in location-based social networks,” inKDD, 2011, pp. 1082–1090.

[20] A. Amir, A. Efrat, J. Myllymaki, L. Palaniappan, andK. Wampler, “Buddy tracking - efficient proximity detectionamong mobile friends,” Pervasive and Mobile Computing, vol. 3,no. 5, pp. 489–511, 2007.

[21] G. Treu, T. Wilder, and A. Kupper, “Efcient proximity detectionamong mobile targets with dead reckoning,” in MOBIWAC,2006, pp. 75–83.

[22] M. L. Yiu, L. H. U, S. Saltenis, and K. Tzoumas, “Efficient prox-imity detection among mobile users via self-tuning policies,”PVLDB, vol. 3, no. 1, pp. 985–996, 2010.

[23] Z. Xu and H.-A. Jacobsen, “Adaptive location constraint pro-cessing,” in SIGMOD Conference, 2007, pp. 581–592.

[24] J. Bao, M. F. Mokbel, and C.-Y. Chow, “Geofeed: A locationaware news feed system,” in ICDE, 2012, pp. 54–65.

[25] A. V. Goldberg and C. Harrelson, “Computing the shortestpath: A* search meets graph theory,” in SODA, 2005, pp. 156–165.

[26] H.-P. Kriegel, P. Kroger, M. Renz, and T. Schmidt, “Hierarchicalgraph embedding for efficient query processing in very largetraffic networks,” in SSDBM, 2008, pp. 150–167.

[27] J. Matousek, “On the distortion required for embedding finitemetric spaces into normed spaces,” Israel Journal of Mathemat-ics, vol. 93, no. 1, pp. 333–344, 1996.

[28] M. Thorup and U. Zwick, “Approximate distance oracles,” J.ACM, vol. 52, no. 1, pp. 1–24, 2005.

[29] A. D. Sarma, S. Gollapudi, M. Najork, and R. Panigrahy, “Asketch-based distance oracle for web-scale graphs,” in WSDM,2010, pp. 401–410.

[30] J. Cheng and J. X. Yu, “On-line exact shortest distance queryprocessing,” in EDBT, 2009, pp. 481–492.

[31] I. F. Ilyas, G. Beskales, and M. A. Soliman, “A survey of top-kquery processing techniques in relational database systems,”ACM Comput. Surv., vol. 40, no. 4, 2008.

[32] R. Fagin, A. Lotem, and M. Naor, “Optimal aggregation algo-rithms for middleware,” J. Comput. Syst. Sci., vol. 66, no. 4, pp.614–656, 2003.

[33] N. Bruno, L. Gravano, and A. Marian, “Evaluating top-kqueries over web-accessible databases,” in ICDE, 2002, pp.369–380.

[34] X. Cao, G. Cong, and C. S. Jensen, “Retrieving top-k prestige-based relevant spatial web objects,” PVLDB, vol. 3, no. 1, pp.373–384, 2010.

[35] K. Mouratidis, M. Hadjieleftheriou, and D. Papadias, “Concep-tual partitioning: An efficient method for continuous nearestneighbor monitoring,” in SIGMOD Conference, 2005, pp. 634–645.

[36] D. V. Kalashnikov, S. Prabhakar, and S. E. Hambrusch, “Mainmemory evaluation of monitoring queries over moving ob-jects,” Distributed and Parallel Databases, vol. 15, no. 2, pp. 117–135, 2004.

[37] M. F. Mokbel, X. Xiong, and W. G. Aref, “Sina: Scalable incre-mental processing of continuous queries in spatio-temporaldatabases,” in SIGMOD Conference, 2004, pp. 623–634.

[38] P. Narvaez, K.-Y. Siu, and H.-Y. Tzeng, “New dynamic algo-rithms for shortest path tree computation,” IEEE/ACM Trans.Netw., vol. 8, no. 6, pp. 734–746, 2000.

[39] E. P. F. Chan and Y. Yang, “Shortest path tree computation indynamic graphs,” IEEE Trans. Computers, vol. 58, no. 4, pp.541–557, 2009.

[40] M. Sarwat, J. Bao, A. Eldawy, J. J. Levandoski, A. Magdy, andM. F. Mokbel, “Sindbad: a location-based social networkingsystem,” in SIGMOD Conference, 2012, pp. 649–652.

[41] J. J. Levandoski, M. Sarwat, A. Eldawy, and M. F. Mokbel,“Lars: A location-aware recommender system,” in ICDE, 2012,pp. 450–461.

[42] W. Chen, C. Wang, and Y. Wang, “Scalable influence max-imization for prevalent viral marketing in large-scale socialnetworks,” in KDD, 2010, pp. 1029–1038.

[43] M. Levandowsky and D. Winter, “Distance between sets,”Nature, vol. 234, pp. 34–35, 1971.

[44] L. Wu, X. Xiao, D. Deng, G. Cong, A. D. Zhu, and S. Zhou,“Shortest path and distance queries on road networks: Anexperimental evaluation,” PVLDB, vol. 5, no. 5, pp. 406–417,2012.

[45] J. Leskovec and C. Faloutsos, “Sampling from large graphs,”in KDD, 2006, pp. 631–636.

Kyriakos Mouratidis is an Associate Pro-fessor at Singapore Management University.He received his BSc from the Aristotle Uni-versity of Thessaloniki, Greece, and his PhDfrom the Hong Kong University of Scienceand Technology. His main research area isspatial databases, with a focus on continuousqueries, road networks and spatial optimiza-tion problems. He has also worked on pref-erence queries, wireless broadcasting sys-tems, and certain database privacy topics.

Jing Li received the BSc degree in computerscience and engineering from Nanjing Uni-versity, and the PhD degree from the De-partment of Computer Science, University ofHong Kong. His research interests includedata privacy, query processing in spatio-temporal databases, graph databases, anddata mining in textual data streams.

Yu Tang is a MPhil student at the Depart-ment of Computer Science, The Universityof Hong Kong. His research interests aremainly in theoretical and system aspects ofdatabases and data management.

Nikos Mamoulis received a diploma inComputer Engineering and Informatics fromthe University of Patras, Greece, and aPhD in Computer Science from the HongKong University of Science and Technol-ogy. He is currently an Associate Profes-sor at the Department of Computer Science,University of Hong Kong. He has been in-volved in the organization of several work-shops/conferences, and serves as associateeditor for IEEE TKDE and VLDB Journal.

joint search by social and spatial proximity · 1 joint search by social and spatial proximity...

Documents