hong kong baptist university doctoral thesis efficient

Hong Kong Baptist University

DOCTORAL THESIS

Efficient group queries in location-based social networksLi, Yafei

Date of Award:2015

Link to publication

General rightsCopyright and intellectual property rights for the publications made accessible in HKBU Scholars are retained by the authors and/or othercopyright owners. In addition to the restrictions prescribed by the Copyright Ordinance of Hong Kong, all users and readers must alsoobserve the following terms of use:

• Users may download and print one copy of any publication from HKBU Scholars for the purpose of private study or research • Users cannot further distribute the material or use it for any profit-making activity or commercial gain • To share publications in HKBU Scholars with others, users are welcome to freely distribute the permanent URL assigned to thepublication

Download date: 11 Jun, 2022

https://scholars.hkbu.edu.hk/en/studentTheses/d8982641-6846-44d7-9e96-58d588768ea6

Efficient Group Queries in Location-basedSocial Networks

Yafei LI

A thesis submitted in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

Principal Supervisor: Professor Jianliang XU

Hong Kong Baptist University

June 2015

Declaration

I declare that this thesis has been composed by myself under the guidance of my principal

supervisor professor Jianliang XU. The thesis has not previously included in any thesis,

dissertation or report submitted to any institution for a degree, diploma or other qualifica-

tion. All sources of information have been acknowledged by means of references to the

relevant publications.

Signature:

Date: June 2015

i

Abstract

Nowadays, with the rapid development of GPS-equipped mobile devices, location-based

social networks have been emerging to bridge the gap between the physical world and

online social networking services. Various types of data, such as personal locations,

check-ins, microblogs and social relations, have been available in location-based social

networks. Efficiently managing and analyzing such data to meet users’ daily query re-

quirements become a challenging task. Among all the existing works in location-based

social networks, group query is one of the most important research topics. In this thesis,

we investigate query techniques for location-based services in social networking applica-

tions. Specifically, considering a location-based social network, we study spatial-aware

interest group queries, geo-social k-cover group queries, and social-aware ridesharing

group queries.

Firstly, we study the spatial-aware interest group queries in location-based social net-

works. Recently, most of the location-based social networks release check-in services that

allow users to share their visiting locations with their friends. These locations, considered

as spatial objects, are usually associated with a few tags that describe the features of those

locations. Utilizing such information, we propose a new type of Spatial-aware Interest

Group (SIG) query that retrieves a user group of size k where each user is interested in the

query keywords and the users are close to each other in the Euclidean space. We prove this

query problem is NP-complete, and develop two efficient algorithms IOAIR and DOAIR

based on the IR-tree for the processing of SIG queries. We also validate the performance

efficiency of the proposed query processing algorithms by empirical evaluation.

Secondly, we study the problem of geo-social k-cover group queries for collaborative

ii

spatial computing. In this problem, we propose a novel type of geo-social queries, called

Geo-Social K-Cover Group (GSKCG) query, which is based on spatial containment and

a new modeling of social relationships. Intuitively, given a set of spatial query points

and an underlying social network, a GSKCG query finds a minimum user group in which

the members satisfy certain social relationship and their associated regions can jointly

cover all the query points. Albeit its practical usefulness, the GSKCG query problem

is NP-complete. We consequently explore a set of effective pruning strategies to derive

an efficient algorithm for finding the optimal solution. Moreover, we design a novel

index structure tailored to our problem to further accelerate query processing. Extensive

experiments demonstrate that our algorithm achieves desirable performance on real-life

datasets.

Thirdly, we study the problem of social-aware ridesharing group queries. With the

deep penetration of smartphones and geo-locating devices, ridesharing is envisioned as a

promising solution to transportation-related problems such as congestion and air pollution

for metropolitan cities. Despite the potential to provide significant societal and environ-

mental benefits, ridesharing has not so far been as popular as expected. Notable barriers

include the social discomfort and safety concerns when traveling with strangers. To over-

come these barriers, in this thesis, we propose a new type of Social-aware Ridesharing

Group (SaRG) query which retrieves a group of riders by taking into account their social

connections besides traditional spatial proximities. Because the SaRG query problem is

NP-hard, we design an efficient algorithm with a set of powerful pruning techniques to

tackle this problem. We also present several incremental strategies to accelerate the search

speed by reducing the repeated computations. Moveover, we propose a novel index tai-

lored to the proposed problem to further speed up the query processing. Experimental

results on real datasets show that our proposed algorithms achieve desirable performance.

The works of this thesis show that the group query processing techniques are effective,

which would facilitate the wider deployment of such query services in real applications.

Keywords: Location-based services, Query processing, Group queries, Indexing, Spatial

database, Location-based social networks, Social constraints, Ridesharing.

iii

Acknowledgements

I would like to express my deep gratitude to my principle supervisor, Prof. Jianliang XU,

for his great patience, inspiring guidance and constructive suggestions in my studies and

research works in these years. He has brought me into this challenging research area

and shared insightful experiences with me. I would like to thank my co-supervisor, Dr.

Weifeng SU, for his continuous encouragement and supporting in life. I would also like

to thank other supervisors, Dr. Rui CHEN, Dr. Haibo HU, Dr. Byron CHOI for their

good suggestions on my research studies.

I would like to thank my colleagues for their direct and indirect help. In particular, I

should mention Mr. Lei CHEN, Mr. Qian CHEN, Mr. Cheng XU, Mr. Zhe FAN, Mr.

Peipei YI, Dr. Xin LIN, Dr. Qijun ZHU, Dr. Dingming WU, Dr. Yun PENG, among

many others.

Finally, I take this special occasion to thank my father Guoming LI and my mother

Xiuqin GAO for raising and supporting me for so many years. I also wish to thank my

dear Yanli ZENG and other family members for their full understanding and supporting

in these years. Without them, I would never go so far.

iv

Table of Contents

Declaration i

Abstract ii

Acknowledgements iv

Table of Contents v

List of Tables viii

List of Figures ix

Chapter 1 Introduction 1

1.1 Spatial-aware Interest Group Queries . . . . . . . . . . . . . . . . . . . . 3

1.2 Geo-Social K-Cover Group Queries . . . . . . . . . . . . . . . . . . . . 4

1.3 Social-aware Ridesharing Group Queries . . . . . . . . . . . . . . . . . . 6

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Chapter 2 Related Works 9

2.1 Spatial query processing . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Social query processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Spatial keyword query processing . . . . . . . . . . . . . . . . . . . . . 11

2.4 Geo-Social query processing . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Ridesharing query processing . . . . . . . . . . . . . . . . . . . . . . . . 14

v

Chapter 3 Spatial-aware Interest Group Queries in Location-based Social Net-

works 17

3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Preliminary: IR-Tree . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.3 Interest Oriented Algorithm . . . . . . . . . . . . . . . . . . . . 23

3.2.4 Diameter Oriented Algorithm . . . . . . . . . . . . . . . . . . . 29

3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Datasets and Queries . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Chapter 4 Geo-Social K-Cover Group Queries for Collaborative Spatial Com-

puting 43

4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 Basic Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.3 Diameter Based Pruning . . . . . . . . . . . . . . . . . . . . . . 51

4.2.4 Access Order Based Pruning . . . . . . . . . . . . . . . . . . . . 54

4.3 Hybrid Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.1 SaR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.2 Enhanced SaR-tree . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.3 GSKCG Query Processing . . . . . . . . . . . . . . . . . . . . . 62


4.4.1 Datasets and Queries . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64


vi

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Chapter 5 Towards Social-aware Ridesharing Group Query Services 70

5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.1 RSGExplorer Algorithm . . . . . . . . . . . . . . . . . . . . . . 75

5.2.2 Quota Available Strategy . . . . . . . . . . . . . . . . . . . . . . 81

5.2.3 Group Diameter Strategy . . . . . . . . . . . . . . . . . . . . . . 83

5.2.4 k-plex Based Strategy . . . . . . . . . . . . . . . . . . . . . . . 84

5.2.5 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3 Incremental Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3.1 Incremental Computation of Core Numbers . . . . . . . . . . . . 87

5.3.2 Social Diameter-based Bounding . . . . . . . . . . . . . . . . . 88

5.3.3 Neighbor-based Bounding . . . . . . . . . . . . . . . . . . . . . 89

5.4 Hybrid Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.1 SIR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 92


5.5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 93


5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Chapter 6 Conclusions and Future Work 98

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Bibliography 101

Curriculum Vitae 109

vii

List of Tables

3.1 Summary of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Example Interest Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Dataset Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37


4.2 Dataset properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64


5.2 Survey results (216 participants) . . . . . . . . . . . . . . . . . . . . . . 72

5.3 Access indexes of users in Figure 5.4 . . . . . . . . . . . . . . . . . . . . 82

5.4 Dataset properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

viii

List of Figures

1.1 A framework of the social-aware ridesharing system . . . . . . . . . . . . 7

1.2 An example of Slugging . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 An example of SIG query . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Tree Index Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Example of Theorem 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Distance between u1 and its neighbors . . . . . . . . . . . . . . . . . . . 29

3.5 Constructing G4(u1, u11) . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Varying k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.7 Varying α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8 Varying k on Dianping (α = 0.9) . . . . . . . . . . . . . . . . . . . . . . 39

3.9 Varying the number of query tags . . . . . . . . . . . . . . . . . . . . . . 40

3.10 Varying Buffer Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.11 Varying the Number of Users . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 An example of a location-based social network for GSKCG query . . . . 46

4.2 Branch and bound search tree . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 Sorted user list ListP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 A sample SaR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 Example of CBRs in an SaR-tree . . . . . . . . . . . . . . . . . . . . . . 56

4.6 A sample LBSN for constructing CBR . . . . . . . . . . . . . . . . . . . 57

4.7 Constructing user u’s internal CBRs . . . . . . . . . . . . . . . . . . . . 58

4.8 Constructing a user u’s external CBRs . . . . . . . . . . . . . . . . . . . 58

4.9 Running time vs. k value . . . . . . . . . . . . . . . . . . . . . . . . . . 65

ix

4.10 Running time vs. number of query points . . . . . . . . . . . . . . . . . 66

4.11 Running time vs. query point coverage . . . . . . . . . . . . . . . . . . . 66

4.12 Running time under multiple familiar regions . . . . . . . . . . . . . . . 67

4.13 Pruning capabilities of different schemes . . . . . . . . . . . . . . . . . . 67

4.14 Size of query results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.15 Running time vs. network size . . . . . . . . . . . . . . . . . . . . . . . 68

4.16 Quality comparison of the returned groups . . . . . . . . . . . . . . . . . 69

5.1 Numbers of potential social groups of size 5 . . . . . . . . . . . . . . . . 73

5.2 An example of an SaRG query . . . . . . . . . . . . . . . . . . . . . . . 74

5.3 Branch and bound search tree . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 An example of SaRG query . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.5 An example of SIR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.6 Running time vs. group size . . . . . . . . . . . . . . . . . . . . . . . . 94

5.7 Running time vs. k value . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.8 Running time vs. the number of riders . . . . . . . . . . . . . . . . . . . 95

5.9 Pruning abilities of different schemes . . . . . . . . . . . . . . . . . . . 96

5.10 Travel cost vs. k or s . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

x

Chapter 1

Introduction

With the rapid development of location-aware mobile devices, ubiquitous Internet ac-

cess and social computing technologies, a large volume of users’ personal data, such as

locations, check-ins, microblogs, tweets, and social connections, has been abundantly e-

merging and readily accessible from various location-based social networks (e.g., Twitter,

Fackbook). Moveover, the amount of such data is growing explosively. For example, by

July 2014, the number of users in Twitter has been up to 500 millions and the average

number of tweets per day has exceeded 58 millions;1 the total number of monthly ac-

tive Facebook users has been up to 1.3 billions and the average number of the messages

sent on Facebook per 20 minutes has exceeded 640 millions.2 Hence, how to efficiently

manage these data to satisfy users’ daily query requirements is a crucial task. Among all

the existing studies in the location-based social networks, group queries have a number

of practical applications (e.g., activity planning [37, 68], product promotion [40], trav-

el recommendation [60, 61], ridesharing [16, 44]). In this thesis, we study group query

techniques for location-based services in location-based social networks.

A group query is issued to a database when a user wants to find a set of objects or

users satisfying some required query constraints. Specifically, consider a spatial database

in which each object is coupled with several tags to indicate its features. A group query

usually inputs a set of query keywords and a query point, and returns a set of spatial ob-

1http://www.statisticbrain.com/twitter-statistics/2http://www.statisticbrain.com/facebook-statistics/

1

jects that can fully or collaboratively cover all the query keywords and that are close to the

query point. Given a social network, a typical group query requests a set of users, in which

the connections among them satisfy some specific social constraints (e.g., minimum ac-

quaintance). Currently, location-based social networks, such as Foursquare and Facebook

Places, are bridging the gap between the physical world and the online social networking

services through acquired user locations. User-generated data from these location-based

social networks is usually mixed with more than one data type (e.g., check-ins, tweets or

microblogs coupled with locations, social relations, and trajectories). Group queries are

subsequently evolving with some novel forms over such data. Therefore, efficient query

processing techniques need to be developed to tackle these newly generated group query

problems.

A few recent works have investigated group query techniques on group queries in

location-based social networks [10,39,40,42,44,45,50,67–69,72]. The representative s-

tudies on geo-textual group queries [10,42] consider the collective spatial keyword queries

based on the objects’ spatial distance and keyword coverage. However, they do not con-

sider the social factor, such as users’ interests reflected by their check-ins on these spatial

objects. There is little study on finding a group by considering the group members’ in-

terests and their spatial distances. While the studies [39, 67, 68] consider the geo-social

group queries which intend to find a group of attendees satisfying the given social and

spatial distance constraints, these queries do not fully exploit new search possibilities

brought by social computing technologies. For example, finding a group of collaborative

workers, whose service regions can jointly cover the given spatial tasks, is an important

geo-social query problem for collaborative spatial computing. However, such practically

useful queries on spatial containment and social relations have not been covered by the

current existing works. Besides, another type of group queries, namely ridesharing group

query, is treated as a promising approach to resolve the transportation-related problem-

s in metropolitan cities, such as traffic congestion and air pollution. However, existing

works [44, 45, 50, 69, 72] from both industry and academia just focus on the coordina-

tion of ridesharing trips and schedules. They do not consider the trust issue in forming

2

ridesharing groups, which may make the ridesharing unsafe and uncomfortable.

In this thesis, we propose several novel group queries for location-based social net-

works to fill the current research gap stated above. Specifically, we have selected three

representative group queries, namely, spatial-aware interest group queries, geo-social k-

cover group queries, and social-aware ridesharing group queries. The main challenge of

processing the proposed group queries lie in the hardness of these problems. We prove that

the proposed group queries are NP-complete or NP-hard problems. Therefore, designing

efficient algorithms to tackle these hard problems requires non-trivial efforts. In this the-

sis, we develop several efficient query processing algorithms with a number of pruning

strategies. We also design several efficient index structures to accelerate the search speed.

1.1 Spatial-aware Interest Group Queries

The first part of this thesis is focused on efficient spatial-aware interest group queries in

location-based social networks. Currently, most of the location-based social networks re-

leased check-in services that allow users to share their visiting locations with their friends.

These locations, considered as spatial objects, are usually associated with a few tags that

describe the features of those locations, e.g., spatial object ‘starbucks’ with tags ‘food’,

‘beverage’, and ‘coffee’. If a user checks in the spatial object ‘starbucks’, the user may be

interested in ‘food’, ‘beverage’, or ‘coffee’. These voluntary check-in actions reflecting

the users’ interests can benefit many applications. Utilizing such information, Chapter 3

proposes a new type of Spatial-aware Interest Group (SIG) queries that retrieves a user

group of size k where each user is interested in the query keywords and the users are close

to each other in the Euclidean space.

Existing works in the literature have considered group queries in location-based social

networks. [41,68] aim at finding a group of attendees close to a rally point and ensure that

the selected attendees have a good social relationship to create a good atmosphere in the

activity. [67] aims to find the activity time and attendees with the minimum total social

distance to the initiator. [37, 38] explore a group of experts whose skills can cover all the

requirements and the communication cost among group members is low. Different from

3

existing work, the SIG query retrieves a user group of size k that maximizes a ranking

function combining the diameter of the group (i.e., the distance between the farthest pair

of users) and the group’s interest in the query keywords.

SIG queries are useful in many scenarios. For example, consider that a company wants

to hold promotion campaigns in some regions. The company is interested in identifying

the regions containing potential customers who are interested in the features (query key-

words) of the product promoted. Another example is for interest-based group gathering.

Query keyword ‘movie’ may find a group of nearby people who are movie lovers, while

query keyword ‘NBA’ could retrieve a group of nearby people who like playing basket-

balls. Note that the group size in these queries is usually constrained due to limited venue

capacity and/or financial budget. Chapter 3 is dedicated to the efficient query processing

techniques on this kind of queries.

1.2 Geo-Social K-Cover Group Queries

The second part of this thesis is devoted to efficient geo-social k-cover group queries

for collaborative spatial computing. The convergence of location data and social data

has enabled a new computing paradigm that explicitly combines both location and social

factors to generate useful computational results for either business or social good. We

use the term collaborative spatial computing to represent this emerging paradigm. The

idea of collaborative spatial computing has been widely used in various domains. One

of the most important applications of collaborative spatial computing in location-based

social networks is geo-social queries, which are attracting increasing interest from both

industrial and academic communities.

The study of geo-social queries is in its incipiency. The pioneering studies [25,39,41,

68] typically consider geo-social queries that take as inputs a set of mobile users, a query

location point and certain social acquaintance constraint and that return a set of users with

the minimum location distance while satisfying the social constraint. While being useful

in some applications (e.g., activity planning), these queries do not fully exploit new search

possibilities brought by geo-social data. In Chapter 4, we propose a novel type of geo-

4

social queries, called Geo-Social K-Cover Group (GSKCG) queries, which is based on

spatial containment and a new modeling of social relationships. Intuitively, given a set of

spatial query points and an underlying social network, a GSKCG query finds a minimum

user group in which the members satisfy certain social relationship and their associated

regions can jointly cover all the query points.

GSKCG queries have applications in a wide range of location-based services. Some

of them are listed as follows: 1) Travel recommendation: To recommend a self-drive tour

of a few points of interest (POIs) (e.g., [60,61]), a GSKCG query helps to find a minimal

group of tourists who are collectively familiar with these POIs (e.g., in terms of weather,

accommodation safety, road conditions, and traffic laws) so as to reduce accident risks

and who have relatively tight social relations in order to make the tour more trustful and

more harmonious. The minimum group size makes it easier for all group members to

reach a consensus in subsequent planning. 2) Spatial task outsourcing: Given a set of

spatial tasks, each associated with a spatial location, one needs to distribute them to a set

of workers, each having a service region. To successfully accomplish the tasks, the service

regions of the selected workers should cover all spatial tasks’ locations, and the workers

are expected to have good collaborative relationships so that the tasks can be efficiently

performed. A GSKCG query directly addresses this worker selection problem in spatial

task outsourcing. In practice, the size of the group of selected workers should be minimum

to minimize employment cost. 3) Collaborative team organization: GSKCG queries are

useful for marketing and promotion agencies. For example, in an agency, each agent has

several familiar market areas and several good collaborators. If a company wants to hire

a marketing team to promote its products in some market areas, a GSKCG query finds

a good team that covers all promotion locations and that is cohesive while causing the

minimum cost for the company. As another example, a community organization can resort

to a GSKCG query to find a minimal group of investigators to conduct a questionnaire

survey in several sites. The returned group will be jointly familiar with all the sites and

have a good collaborative atmosphere in order to efficiently deliver, collect and analyze

the questionnaires.

5

Compared with the SIG query problem, the GSKCG query problem is more challeng-

ing because of the complex social relations introduced. We solve this problem in Chapter

4.

1.3 Social-aware Ridesharing Group Queries

The third part of this thesis is presented towards efficient social-aware ridesharing group

search. Nowadays, there is tremendous unused transportation capacity worldwide in the

form of unoccupied seats in private cars. Not only would filling some of these seats reduce

smog, carbon emissions, and fuel consumption, but it also could create opportunities for

increasing local social capital. Ridesharing is a natural and practical approach to make

use of these unoccupied seats and is envisioned as a promising solution to alleviating

transportation-related problems (e.g., traffic congestion, air pollution) in metropolitan c-

ities. As reported in a recent study [16], the potential traffic reduction in a city could be

as high as 31-59% if users are willing to share a ride with people whose travel patterns

are similar. Moreover, ridesharing can save on traffic expense for both drivers and riders.

There have been some existing works on the ridesharing problem from both indus-

try and academia with a focus on coordination of ridesharing trips and schedules. Given

a driver’s origin and destination, a ridesharing system returns the driver a set of riders

by considering the trip and schedule similarity. Generally, current works can be catego-

rized into three types: i) static ridesharing [1, 4, 5, 44, 62, 66] which refers to the scenario

where the requests of drivers and riders are known in advance; ii) dynamic rideshar-

ing [24, 32, 50, 72] where riders and drivers continuously enter and leave the system and

are matched up in real time or on a short notice; iii) trust-conscious ridesharing [1, 16]

which addresses the trust issue in ridesharing. Chapter 5 is concerned with trust-conscious

ridesharing. Existing approaches include the adoption of reputation-based systems and

profile checking by linking with social networks like Facebook. However, these attempts

cannot remove the major barriers in current ridesharing systems such as social discomfort

and safety concerns when traveling with strangers. Little work studies ridesharing by tak-

ing social relations into consideration. Although [16] considers ridesharing with friends

6

ride requests

allocate drivers

ride offersallocate riders

soci

al r

elat

ions

engage engage

Ridesharing

Service Provider

Riders Drivers

Social Network

Figure 1.1: A framework of the social-aware ridesharing system

or friends of friends, this kind of trust-conscious ridesharing is not very practical as will

be shown in the social model analysis (elaborated in Chapter 5). Thus, these existing so-

lutions cannot be applied to the social-aware ridesharing problem considered in Chapter

5.

In Chapter 5, we propose a new type of ridesharing queries, called Social-aware

Ridesharing Group (SaRG) queries, which is based on trip matching and social acquain-

tance. Broadly, as illustrated in Figure 1.1, our proposed ridesharing system consists of

three parties: (i) riders (or passengers who want to participate in ridesharing), (ii) drivers

(or private car owners who offer ridesharing), and (iii) ridesharing service provider (RSP)

(the server in charge of the arrangement of ridesharing). The riders submit ride requests

to the RSP, while the drivers send in ride offers. In other words, a ride offer provided

by a driver forms an SaRG query; the riders who submitted ride requests form the data

space (or search space); the RSP arranges the best ride matches of ridesharing by jointly

considering trip matching, social connections as well as the capacity of a car. Designing

efficient matching algorithms for the RSP is the most important task to make the system

work effectively. Note that, in our problem, the RSP hosts a set of active ride requests

(expired requests might be dropped and re-submitted). Once there comes a ride offer from

a driver, the RSP will match the most suitable riders to the driver. A ridesharing group is

composed of a driver and the most suitable riders.

In our ridesharing system, we adopt a simple yet popular form of ridesharing called

Slugging [44]. Slugging assumes that the driver’s trip is fixed and that the riders would

walk to the origin location of the driver’s trip, board at the departure time, alight at the

7

driver’s destination, and then walk to their own destinations. The idea of Slugging is

illustrated in Figure 1.2.

v1

v2

v1

v2

v3v3

Figure 1.2: An example of Slugging

The consideration of social factors in ridesharing brings several new research chal-

lenges. First, how to capture and model social constraints for the purpose of ridesharing

is a fundamental issue. Second, the social relationship may not be incremental in nature

(e.g., the acquaintance constraint among the users of a ridesharing group may not hold

after the removal of one user). As such, the social-aware ridesharing problem becomes

more challenging. Indeed, as we shall prove in Chapter 5, the SaRG query problem is NP-

hard, and therefore how to design an efficient algorithm to retrieve the optimal answer to

an SaRG query is the focus of Chapter 5. Our key insight is that in practical settings

an SaRG query possesses some intrinsic properties (e.g., the number of seats in a car is

usually small; the riders who are far away from the trip origin cannot be candidates of a

ridesharing group), which make the problem tractable.

1.4 Thesis Organization

The rest of this thesis is organized as follows. In Chapter 2, we present the related works

that are relevant to this thesis. In particular, we highlight the research works which are

closely related to our contributions in this thesis. We study the spatial-aware interest group

queries in location-based social networks in Chapter 3. In Chapter 4, we detail the geo-

social k-cover group queries for collaborative spatial computing. Chapter 5 presents the

social-aware ridesharing group queries. Finally, we summarize our contributions made in

this thesis and discuss the possible directions for the future work in Chapter 6.

8

Chapter 2

Related Works

In the first chapter, we have discussed the importance of group queries in location-based

social networks and proposed three important research problems. In this chapter, we

survey the existing works that are closely relevant to our proposed research problems.

2.1 Spatial query processing

Spatial query processing using R-tree and its variants has been extensively studied over

the past three decades. The existing works have studied various types of queries, including

k-nearest-neighbor queries [31, 34, 35, 49, 52], range queries [48, 59], and closest-pair

queries [22, 30, 57].

As a pioneering study on spatial queries processing, Roussopoulos et al. [52] present-

ed an efficient branch-and-bound R-tree traversal algorithm to search the nearest neighbor

object to a query point, and then extended it to the k-nearest-neighbor search. Mean-

while, Katayama et al. [34] proposed a new index structure named SR-tree, which inte-

grates bounding spheres and bounding rectangles for high-dimensional nearest neighbor

queries. However, there are significant overlaps among the minimum bounding rectangles

(MBRs) in both R-tree and SR-tree, these overlaps result in weak pruning efficiency. To

overcome this weakness, a novel index structure R*-tree was presented in [31] to reduce

the overlapping MBRs. Based on R*-tree index structure, Hjaltason et al. [31] proposed

an incremental algorithm to efficiently search the nearest neighbor. Kolahdouzan and

9

Shahabi [35] presented a novel approach using first-order voronoi diagrams to efficiently

evaluate k-nearest-neighbor queries in spatial network databases. Moreover, Papadias et

al. [49] extended the concept of the nearest neighbor query by considering a group of

points which aims to find a set of data points with the smallest sum of distances to all the

query points, and proposed various pruning heuristics to efficiently process such group

nearest-neighbor queries.

For the spatial range query processing, Tao et al. [59] studied the range search on mul-

tidimensional uncertain data. They presented a novel concept of “probabilistically con-

strained rectangle”, which supports effective pruning/validation of nonqualifying/qualifying

data. They also developed a new index structure called U-tree for minimizing the query

overhead. Pagel et al. [48] proposed a probabilistic model for user-defined window

queries, and characterized the efficiency of spatial data structures in terms of the expected

number of data bucket accesses needed to perform a window query.

Closest pair query is another important query in spatial databases. Corral et al. [30]

presented non-incremental recursive and iterative branch-and-bound algorithms for k-

closest pair queries. Hjaltason and Samet et al. [22] proposed an incremental algorithm

based on priority queues for distance join queries. Shin et al. [57] suggested adaptive mul-

tistage and plane-sweep techniques for K-distance join queries and incremental distance

join queries.

Our work on SIG queries can be seen as extending the R-tree to handle queries with

mixed spatial and keyword information, which retrieves a set of users who satisfy the

mixed spatial and interest constraints.

2.2 Social query processing

There have been some studies on group and team queries over social networks with the

goal of finding a user group with a certain social relationship. Social groups or teams

are usually cohesive subgraphs formed by users with acquaintance relations. Their ac-

quaintance levels can be measured by several classical graph models, such as clique [29],

k-core [55], and k-plex [47]. The clique model idealizes cohesive properties so that it

10

seldom exists in real-life social networks and is difficult to compute. Both k-core and

k-plex focus on a degree based model. However, k-plex is NP-complete since it restricts

the subgraph size, while k-core further relaxes to achieve the linear time complexity with

respect to the number of edges.

Group and team queries have been studied in the context of social networks [12, 27,

76], including social-temporal queries [67], and expert collaboration queries [37, 38]. In

detail, Yang et al. [67] proposed a social-temporal group query to find a group of activ-

ity attendees with the minimum total social distance to the query issuer. They proposed

two efficient algorithms, SGSelect and STGSelect, which include effective pruning tech-

niques and employ the idea of pivot time slots to substantially reduce the query processing

time. Lappas et al. [37] and Li et al. [38] studied the problem of expert team formula-

tion which aims to find a group of experts covering all required skills and minimize the

communication cost among them.

In this thesis, we use k-core to model users’ social relations, which is different from

the previous studies. In addition, our proposed queries, GSKCG and SaRG, take into

consideration the spatial factor.

2.3 Spatial keyword query processing

Recently, spatial queries have been extended to incorporate text keywords, known as spa-

tial keyword queries in the literature. Zhou et al. [73] proposed a hybrid index structure

to handle both textual and spatial queries. They studied the performance of hybrid index

structures that integrate text indexes and spatial indexes for location-based web search.

This work opens a stream of research topics on spatial keyword search. Cong et al. [18]

presented a new indexing scheme called IR-tree, which integrates the R-tree and invert-

ed files for location-aware top-k object retrieval. Rocha et al [51] proposed the top-k

spatial keyword queries on road networks where the distance between the query location

and the spatial object is the shortest path. An efficient method based on a new hybrid in-

dex, cell-keyword conscious B+-tree, was proposed by Cong et al. [19] to process top-k

queries on trajectories database. However, these works all assume a static query location

11

at a snapshot. They cannot provide a mobile user a continuously aware of the k spatial

web objects that best match a query with respect to location and text relevancy. Based

on these practical query requirements, Wu et al. [64] studied the efficient processing of

continuously moving top-k spatial keyword (MkSK) queries. They proposed two effi-

cient methods for computing safe zones that guarantee correct results at any time with the

minimum communication cost.

The difference between the top-k query and k-nearest-neighbor query lie in whether

the keywords in the query are used as a soft or hard constraint [42]. Fellpe et al. [23]

considered how to find the k-nearest-neighbor of the query location, with each object in

the result containing the set of keywords issued in the query. Lu et al. [43] proposed a

hybrid index tree called IUR-tree to efficiently process reverse spatial textual k-nearest-

neighbor queries which finds the objects that take the query object as one of their k most

spatial-textual similar objects.

Different from the works on top-k and k-nearest-neighbor queries presented above,

Fan et al. [26] studied the problem of spatio-textual range queries on a new kind of spatio-

textual data named regions-of-interest (ROIs). They developed textual-based and grid-

based filtering algorithms to efficiently find a set of objects that have large overlap with

the query region and high textual similarity. It is the first work that considers the queries

on the spatial object with the spatial region and textual properties.

However, in some cases, a spatial object may not cover all the query keywords, which

may lead to empty solutions. Thus, several works proposed the aggregate spatial keyword

search to tackle this problem. It returns a set of spatial objects that collaboratively cover

all the query keyword. Zhang et al. [70] studied an m-closest-keyword (mCK) query

that finds a set of partially closest objects covering m specified keywords. Cao et al. [10]

proposed a collective spatial keyword query that retrieves a group of nearby spatial objects

to collectively cover the specified keywords. The techniques they presented can solve the

presented problem, but the problem of collaborative spatial keywords search is usually

NP-hard or NP-complete that results in inferior query processing performance. Compared

with the above two works, Long et al. [42] presented a more efficient and exact algorithm

12

to tackle such problems by adopting a distance owner-driven approach.

It is noteworthy that, unlike these previous studies, the proposed SIG query in this the-

sis explores the relationship between users’ locations and interests in the query keywords

and searches the k-size maximum interest group on location-based applications.

2.4 Geo-Social query processing

Efficiently processing queries that consider both spatial and social constraints attracts in-

creasingly attention recently. A main stream is to mine users’ location and social network

data to find the relationships between the users and their locations. [13, 54] have shown

that users with short social distances usually live geographically close.

Yet query processing research in this direction is still in its infancy. Liu et al. [41]

proposed the circle-of-friend query to find minimal-diameter social groups. Shi et al. [56]

presented a model by considering both spatial information and the social relationships

between users who visit the clustered places. They extended the density-based clustering

paradigm and applied it on places which are visited by users of a geo-social network.

Armenatzoglou et al. [3] proposed a general framework that offers flexible data manage-

ment and algorithmic design for Geo-Social Network (GeoSN) queries. Their architecture

segregates the social, geographical and query processing modules. Each GeoSN query is

processed via a transparent combination of primitive queries issued to the social and ge-

ographical modules. Yang et al. [68] proposed a socio-spatial group query to select a

group of nearby attendees with a tight social relationship. They designed a new index

structure called Social R-tree to integrate the users’ social relationships into an R-tree for

efficient query processing. This index is different from our proposed Enhanced SaR-tree

in Chapter 4 in that it is used to reduce the checking states during the enumeration. Zhu

et al. [39] presented a new family of geo-social group queries with minimum acquain-

tance constraint (GSGQs), and also designed a new index structure named SaR-tree to

accelerate the GSGQs queries. However, the SaR-tree cannot be directly adopted by our

GSKCG queries due to our regional spatial factor which differs from the point spatial

factor in [39].

13

Unlike the studies [41, 68] that aim to minimize the spatial distance among selected

users, our GSKCG query aims to find a group of users whose associated regions jointly

cover all query points, a brand new spatial constraint with important real-life applications.

Moreover, we use a different model k-core to measure the level of social acquaintance, a

more reasonable measure for practical use.

2.5 Ridesharing query processing

We survey the ridesharing query processing techniques from the following three aspects:

static ridesharing, dynamic ridesharing, and trusted ridesharing problems.

Most of the early studies considered static ridesharing, which refers to the ridesharing

where the requests of drivers and riders are known in advance. We classify the stat-

ic ridesharing in the following categories: slugging, carpooling, and dial-a-ride. Slug-

ging [5] is one particular form of ridesharing where passengers walk to the origin of the

driver’s trip, board at the departure time, debark at the driver’s destination and then walk

to their own destinations. Ma and Wolfson [44] studied slugging from a computational

perspective using a graph abstraction. Carpooling is another representative application

of ridesharing for daily commutes, where private car drivers declare their availability for

pick-up and later bringing back riders. The main issue in carpooling is about the as-

signment of riders to drivers and the identification of each driver’s route to minimize

the travel cost. For small-size carpooling, it can be solved by using linear programming

techniques [7, 9]. To deal with large-size problem, many heuristic algorithms have been

proposed [1, 62]. More recently, Yan and Chen [66] employed a time-space network

flow technique to develop a model for the many-to-many carpooling system with mul-

tiple vehicle and person types. They develop a solution algorithm based on Lagrangian

relaxation. In the dial-a-ride problem (DARP), no private-car is involved and the trans-

portation is carried out by public vehicles (such as taxi) that provide a shared service.

Users formulate requests by specifying the origin and destination locations. The aim

is to design a minimum-cost set of vehicle routes accommodating all requests under a

number of spatial-temporal constraints. Earlier works on DARP can be found in a sur-

14

vey [21]. DARP is NP-hard in general. Only problems that involve small number of

vehicles and ride requests can be solved exactly and the methods are often by integer

programming techniques [20]. For large-scale DARP, heuristics are still the most popular

methods [4, 21, 65]. These approaches usually have two phases, where the first one is to

obtain an initial schedule and the second one is to improve the solution by some local

searches.

Enabled by recent mobile technologies, dynamic ridesharing services have been gain-

ing increasing attention (e.g., [24, 32, 72]). In dynamic ridesharing systems, riders and

drivers continuously enter and leave the system; dynamic ridesharing algorithms match

up them in real time on short notice. Existing works can be broadly classified into two

categories: centralized and distributed. Centralized real-time ridesharing relies on a cen-

tral service provider to perform all operations for ridesharing. A recent survey on the

optimization techniques for centralized dynamic ridesharing can be found in [2]. Vari-

ous optimization objectives (e.g., minimizing system-wide vehicle miles or travel time)

and spatial-temporal constraints (with desired departure/arrival time or spatial proxim-

ity requirements) have been considered. [50] proposed an opportunistic user interface

to support centralized rideshare planning whilst preserving location privacy. [32] is the

latest work that modeled a centralized real-time ridesharing problem with service guar-

antee. They proposed several novel kinetic tree-based algorithms that are better suited to

dynamic request scheduling and on-the-fly route adjustment. The drawback of the cen-

tralized ridesharing is its lack of scalability, especially when ridesharing requests are in a

large volume. To address this issue, distributed ridesharing solutions have been develope-

d (e.g., [24, 72]). [24] proposed a dynamic taxi-sharing algorithm based on peer-to-peer

communications and distributed coordination. [72] presented a distributed ridesharing ser-

vice based on a new geometry matching algorithm to shorten the waiting time for passen-

gers and to avoid traffic jams. However, all these works considered only the participants’

itineraries and time schedule constraints in rideshare assignment. They cannot be ap-

plied to the social-aware rideshare assignment problem, which imposes complex social

constraints.

15

A few recent existing works have been intended to address the trust issue in rideshar-

ing [1, 16]. Suggested approaches include the adoption of reputation-based systems and

profile checking by linking with social networks like Facebook [1]. Both of these ap-

proaches entail significant involvement from participants. In [16], Cici et al. suggested

grouping participants who are friends or friends of friends in the assessment of the poten-

tial benefits of ridesharing. However, as indicated by our user study result, such simple

social constraints can be either too restricted or too relaxed to be practical for realistic

ridesharing systems.

16

Chapter 3

Spatial-aware Interest Group Queries

in Location-based Social Networks

Currently, most of the location-based social networks release check-in services that al-

low users to share their visiting locations with friends. These checked-in spatial objects

usually reflect the users’ interests. Moreover, the interests and locations of users are es-

sential for activity planning and product promotion. Based on such data, in this chapter,

we study the spatial-aware interest group queries in location-based social networks. The

rest of this chapter is organized as follows. Section 3.1 presents the problem definition.

Section 3.2 presents two efficient algorithms based on IR-tree for the processing of SIG

queries. Section 3.3 shows the empirical study of our proposed algorithms on two real

datasets. Section 3.4 summarizes this chapter.

3.1 Problem Definition

In this section, we give some preliminaries and provide the problem statement, followed

by an example to elaborate the problem defined. Table 3.1 summarizes the notations used

throughout this chapter.

Let D be a set of spatial objects. Each spatial object p is associated with a set of

tags p.Γ. Let U be a set of users. Each user u ∈ U is a triple (id , λ, ν), where id is

the user’s identifier, λ is the user’s location, and ν is a vector of the user’s interests for

17

Table 3.1: Summary of notationsNotation DefinitionD a set of spatial objectsU a set of usersI(u, T ) the interest of user u on the tag set TGk a group of size kI(Gk, q.T ) the group interest on the query keywords q.TD(Gk) the diameter of a group Gk

Gk(ui) a group including ui

Gk(ui) a group set in which all the groups include ui

rankuq (Gk(ui)) the ranking upper bound of group Gk(ui)

C(ui, ui, uj) a circle centered at ui with radius ui, uj

the tags that are associated with the spatial objects checked in by the user. The interest

value for a set of tags is defined in Definition 3.1. We may use interest and interest value

interchangeably.

Definition 3.1. Let Du be the set of spatial objects checked in by user u and let Dt be

the set of spatial objects that are associated with tag t. Function Count(u, p) counts the

times of spatial object p checked in by user u. User u’s interest value for tag t is computed

as:1

I(u, t) =

∑p∈Du∧p∈Dt

Count(u, p)∑p∈Du

Count(u, p). (3.1.1)

Given a set of tags T , if a user’s interest value for every tag t ∈ T is positive, we say the

user fully covers T . Specifically, the interest value of a user u for a tag set T is defined

as:

I(u, T ) =∑t∈T

I(u, t). (3.1.2)

As an example, Table 3.2 shows an interest vector. Higher values indicate higher

interest. For example, the user is more interested in ‘hotel’ than ‘sport’. The user’s

interest value for tags ‘hotel’ and ‘sport’ is 0.36+0.2=0.56.

Table 3.2: Example Interest Vector

Tag movie airport hotel music sportInterest Value 0.10 0.20 0.36 0.14 0.20

We follow existing works [11, 46, 63] to define a ranking function as a weighted sum

1The definition of user interest based on check-in counts is adopted due to its simplicity. We can alsouse other models, such as user ratings or likes/dislikes, to quantify the user interest; and no any modificationon the query processing algorithm is needed.

18

of the normalized group interest and group diameter.2 It ranks a user group Gk of size k

with regard to a query q, denoted by rankq(Gk):

rankq(Gk) = αI(Gk, q.T )

Imax (q.T )+ (1− α)(1− D(Gk)

Dmax

), (3.1.3)

Here q.T is a set of query keywords that belong to the tag space of the dataset, I(Gk, q.T )

is the group interest on the query keywords q.T , which is defined as the minimum interest

of the users in the group if Gk fully covers q.T , i.e.,

I(Gk, q.T ) =

min{I(u, q.T ) | u ∈ Gk} if the users in Gk jointly fully cover q.T

0 otherwise(3.1.4)

and D(Gk) is the diameter of group Gk, i.e., the Euclidean distance between the farthest

pair of users in the group,

D(Gk) = max{||ui.λ uj.λ|| | ui, uj ∈ Gk}, (3.1.5)

where ||ui.λ uj.λ|| is the Euclidean distance between two users. Parameter α ∈ [0, 1] is

used to balance the group interest and the group diameter.

Definition 3.2. A Spatial-aware Interest Group (SIG) query q = (T, k) consists of two

parameters: a set of keywords q.T and the size k of the requested user group. It re-

trieves a user group of size k where each user is interested in the query keywords and the

users are close to each other in the Euclidean space, meaning that the ranking function

(Equation 3.1.3) is maximized.

Example 3.1. Figure 3.1 illustrates an example SIG query with three different values of α

in the ranking function (Equation 3.1.3). The circles, squares, and triangles in the figure

depict the locations of a set of users. Given an SIG query q, the sizes of those shapes

indicate the user interests for a set of query keywords q.T . The bigger the size, the higher

the user interest. Query q requests a user group of size 4 that maximizes the ranking2Note that the group interest and diameter factors are not directly comparable. In order to treat these two

factors fairly, we use the global maximum group diameter Dmax and maximum group interest Imax (q.T )to normalize them so that they will be kept in the same value domain [0, 1].

19

f

α = 0

α = 0.5

α = 1

Figure 3.1: An example of SIG query

function. The gray circles are the result group when α = 0, i.e., only the group diameter

is considered. The gray squares are the result group when α = 0.5. The gray triangles

represent the query result when α = 1, i.e., only the group interest is considered.

Theorem 3.1. The SIG query problem is NP-complete.

Proof. We establish the hardness by a reduction from a classical NP-complete problem,

namely the minimum set cover problem (MSC). An instance of the MSC problem consists

of a universe set U = {e1, e2, . . . , en}, a collection of sets S = {S1, S2, . . . , Sm}, where

Si is a subset of U and an integer k. The decision problem of MSC is to find whether

there is a subset S ′ ⊆ S, such that |S ′| ≤ k and the union of S ′ fully covers U .

Given an instance of MSC, we construct an instance of SIG query q = (T, k) on a set

of users. Each element ei in U is a keyword ti in q.T , each set Si is a user ui, and the

elements in Si are ui’s interests (keywords). We set the value α of the SIG query to 1. We

remark that Imax(q.T ) is a constant under this setting. Thus, maximizing rankq(Gk) is

equivalent to maximizing I(Gk, q.T ).

Suppose that we have a PTIME algorithm A that returns the query answer Gk =

{u′1, u′2, . . . , u′k} of an SIG query. There are two cases. Case 1: If q.T is fully covered

by the interests of Gk, then {S ′1, S ′2, . . . , S ′k} fully covers U and its size is k. Therefore,

a solution of the MSC is found. Case 2: If Gk does not fully cover q.T , then there does

not exist another group G′k, such that the interests of G′k fully cover q.T . Otherwise, G′k

would be returned as an answer as in Case 1. Therefore, with such aGk, one can conclude

20

that the MSC instance does not have a solution. By using A, the MSC problem is solved

in PTIME, a contradiction. Therefore, there does not exist a PTIME algorithm A that can

solve the SIG query problem.

3.2 Proposed Approaches

In this section, we present two efficient algorithms, namely Interest Oriented Algorithm

(IOAIR) and Diameter Oriented Algorithm (DOAIR), for the processing of SIG queries

based on the IR-tree [18]. Section 3.2.1 introduces the index structure IR-tree. Sec-

tion 3.2.2 presents the basic ideas of the two algorithms. Sections 3.2.3 and 3.2.4 elaborate

the two algorithms.

3.2.1 Preliminary: IR-Tree

We adopt the IR-tree index structure [18], where users are considered as spatial objects,

users’ locations and interest vectors are considered as the locations and documents of

objects, respectively. Figure 3.2(a) and Figure 3.2(b) show the users’ locations and its

corresponding IR-tree. IR-tree is essentially an R-tree attached with inverted files. The

leaf nodes in the IR-tree contain a number of entries of the form (u, u.λ), where u refers

to a user and u.λ is the MBR (minimum bounding rectangle) of the user’s location. Each

leaf node also includes a pointer to an inverted file that indexes the interest vectors of the

users stored in the node.

Each non-leaf node in the IR-tree includes several entries in the form of (ch,mbr),

where ch is the identifier of a child node and mbr is the MBR of all rectangles in the

child nodes. Each non-leaf node also includes a pointer to an inverted file that indexes the

pseudo interest vectors of the entries stored in the node. The pseudo interest vector of an

entry contains all the tags that appear in its child nodes. The interest value for each tag is

the maximum value in the child nodes.

We remark that the user locations (e.g., when they are referred to home/office address-

es) may not be frequently updated. When the user location changes, the IR-tree should

21

u1

u3

R5

R4

R7

R6

R2

R1

u2

u9

u5

u13

R8

u4

u11

u10

u8

u12

u7

u6

R3

(a) R-tree

R6 R7

R1 R2 R4 R5

u1 u2 u3 u4

R1 R2 R3

R6 R7

R8

Coffee (R6, 0.7), (R7, 0.7)

Tea (R6, 0.8), (R7, 0.6)

Coffee (R2, 0.7), (R1, 0.6), (R3, 0.3)

Tea (R3, 0.8), (R1, 0.5), (R2, 0.2)

Coffee (R4, 0.7), (R5, 0.6)

Tea (R5, 0.6), (R4, 0.2)

Coffee (u13, 0.6), (u11, 0.5), (u12, 0.4)

Tea (u12, 0.6), (u13, 0.5)Coffee (u2, 0.6), (u1, 0.4)

Tea (u1, 0.5)

Coffee (u4, 0.7), (u3, 0.6), (u5, 0.5)

Tea (u3, 0.2), (u4, 0.1)

Coffee (u6, 0.3), (u7, 0.2)

Tea (u7, 0.8), (u6, 0.4)

u5 u6 u7 u8 u9 u10 u11 u12 u13

R3

Coffee (u9, 0.7), (u8, 0.6), (u10, 0.5)

Tea (u8, 0.2), (u9, 0.1)

R4 R5

(b) IR-tree

Figure 3.2: Tree Index Structure

be updated accordingly to support efficient query processing. Fortunately, this can be

well handled by the embedded updating mechanism of IR-tree, whose efficiency has been

demonstrated in [18].

3.2.2 Overview

To avoid enumerating all possible groups, algorithms IOAIR and DOAIR construct groups

in special orders. If the ranking score of the current found group is higher than the upper

bounds on the ranking scores of the unseen groups, the current found group is returned as

the result. The derivation of the upper bound on the ranking score of a group is shown in

Theorem 3.2.

Theorem 3.2. Let Dmin be the distance between the closest pair of users in the dataset.

22

An upper bound on the ranking score of a group Gk(ui) of size k containing user ui is

rankuq (Gk(ui)) = α

I(ui, q.T )

Imax (q.T )+ (1− α)(1− Dmin

Dmax

). (3.2.6)

Proof. According to Equation 3.1.4, we have I(Gk(ui), q.T ) ≤ I(ui, q.T ). And since

Dmin ≤ D(Gk(ui)), we derive

rankuq (Gk(ui)) = α

I(ui, q.T )

Imax (q.T )+ (1− α)(1− Dmin

Dmax

)

≥ αI(Gk(ui), q.T )

Imax (q.T )+ (1− α)(1− D(Gk(ui))

Dmax

)

= rankq(Gk(ui)).

3.2.3 Interest Oriented Algorithm

Interest Oriented Algorithm (IOAIR) classifies groups in terms of users’ interests. Let

set Gk contain all possible user groups of size k. Let set Gk(ui) cover all the group-

s of size k that contains user ui and have the same level of interest as ui, i.e., ∀G ∈

Gk(ui)(I(G, q.T ) = I(ui, q.T )). Obviously, ∪ui∈UGk(ui) = Gk. Algorithm IOAIR fol-

lows the descending order of the user interest and iteratively constructs the group Gk(ui)

with the maximum ranking score in Gk(ui). If the ranking score of the current construct-

ed group Gk(ui) is higher than the upper bound on the ranking score of the next group

Gk(ui+1) (termination condition), the current found group is returned as the result. The

correctness of the termination condition is guaranteed by Lemma 3.1. The correctness of

algorithm IOAIR is guaranteed by Theorem 3.3.

Lemma 3.1. Let S = {u1, u2, · · · , un} be a sorted list of users in descending order of

their interests. If rankq(Gk(ui)) > rankuq (Gk(uj )), we have rankq(Gk(ui)) > rankq(Gk(um)),

where k ≤ i < j ≤ m ≤ n.

Proof. For j ≤ m ≤ n, we have I(uj, q.T ) ≥ I(um, q.T ). According to Equation 3.2.6,

we derive rankuq (Gk(uj )) ≥ ranku

q (Gk(um)). Hence, we get rankq(Gk(ui))> rankuq (Gk(uj ))

23

≥ rankuq (Gk(um)) ≥ rankq(Gk(um)) based on Theorem 3.2.

Theorem 3.3. Algorithm IOAIR finds the correct answer to an SIG query.

Proof. We prove it by contradiction. Assume that given an SIG query q, algorithm IOAIR

returnsG as the result. Now suppose there existsG′ with the maximum ranking score such

that rank q(G) < rank q(G′). Let ui be the user with the minimum interest in G and u′i be

the user with the minimum interest in G′. Hence, we have G and G′ are the groups with

maximum ranking score in Gk(ui) and Gk(u′i), respectively. There are three possible cases.

(1) If I(ui, q.T ) < I(u′i, q.T ), algorithm IOAIR first considers Gk(u′i), and then Gk(ui).

According to Lemma 3.1, we have rank q(G′) ≥ ranku

q (G) ≥ rank q(G). Thus, algorithm

IOAIR must return the groupG′ in Gk(u′i), notG in Gk(ui). (2) If I(ui, q.T ) = I(u′i, q.T ),

we have Gk(u′i) = Gk(ui). Algorithm IOAIR must return the group G′ since rank q(G) <

rank q(G′). (3) If I(ui, q.T ) > I(u′i, q.T ), algorithm IOAIR first considers Gk(ui), and

then Gk(u′i). According to Lemma 3.1, we have rank q(G) ≥ rankuq (G′) ≥ rank q(G

′),

which contradicts the assumption that rankuq (G) < rank q(G

′). Hence, the correctness of

algorithm IOAIR is proved.

Algorithm 1 shows the pseudo code of the IOAIR algorithm. The candidate group

Gk is initialized as the k-sized user group with the minimum diameter in (line 1). It

processes users in descending order of their interests for the query keywords (line 3) by

calling function GetNextUser that adopts the Threshold Algorithm [75]. For the current

obtained user ui, function IOAIRGetNextGroup constructs a group of size k containing

ui with the maximum ranking score (i.e., minimum diameter), denoted as Gk(ui), where

I(Gk(ui), q.T ) = I(ui, q.T ) (line 5). The constructed group Gk(ui) is assigned as the

candidate group if its ranking score is higher than that of the candidate group Gk (lines 6

and 7). If the ranking score of the candidate group is higher than the upper bound on the

ranking score of Gk(ui+1) that is the group of size k containing the next user ui+1 with

the maximum ranking score, the algorithm returns the candidate group as the result and

terminates (lines 8 and 9).

In order to find group Gk(ui) such that I(Gk(ui), q.T ) = I(ui, q.T ) with the maxi-

mum ranking score in Gk(ui), function IOAIRGetNextGroup (Algorithm 2) uses the IR-

24

Algorithm 1 IOAIR(Integer k, Keywords T , InvertedFile invf , IRTree irtree)1: Result Gk ← the k-sized user group with the minimum diameter;2: Dc ←∞;3: while ui, ui+1 ← GetNextUser(T, invf ) do;4: Update Dc according to Equation 3.2.7;5: Gk(ui)← IOAIRGetNextGroup(irtree, ui, k,Dc, T );6: if rankq(Gk (ui)) > rankq(Gk ) then7: Gk ← Gk(ui);8: if rankq(Gk ) > rankuq (Gk (ui+1 )) then9: Return Gk;

10: Return Gk;

tree to retrieve the users who have higher interest values than does ui and puts them inG′k.

Taking the advantage of the IR-tree where each entry in each node has an upper bound on

the users’ interests contained in the subtree pointed to by the entry, it is able to prune the

nodes whose interest is smaller than the interest of user ui, since no user in the subtree

can have a larger interest than does user ui (line 17). Since the interest of group Gk(ui)

is determined, constructing Gk(ui) with the maximum ranking score is equivalent to find

a group of size k with the minimum diameter from G′k (line 14). We apply backtracking

method to enumerate all possible k size groups, each group also needs to be checked if it

fully covers T .

Early Stop Let G′k contain all the users with higher interests than ui. It is possible

to prune some users in G′k so that Gk(ui) can be quickly found from G′k. Function

IOAIRGetNextGroup considers the users with higher interests than ui in ascending or-

der of their distances to ui. If k − 1 users have been obtained, a candidate group of size

k including ui is formed. If the diameter of the candidate group is not greater than the

distance between ui and the newly added user (line 7), the candidate group is the one

with the maximum ranking score. Otherwise, the candidate group is updated by consid-

ering the newly added user. The correctness is guaranteed by Theorem 3.4 (illustrated by

Example 3.2).

Theorem 3.4. Let S = {u1, u2, · · · , um, um+1, · · · , un} be a sorted list of users with

higher interests than ui and in ascending order of their distances to user ui. Let Gk(ui)

be the user group of size k containing ui with the maximum ranking score calculated from

25

Algorithm 2 IOAIRGetNextGroup(IRTree irtree, User ui, Integer k, Double Dc, Key-words T )

1: Queue ← NewPriorityQueue();2: Queue.Enqueue(irtree.root , 0);3: Add ui to G′k;4: while Queue is not empty do5: Entry e← Queue .Dequeue();6: if e refers to a user then7: if D(Gk) ≤ ||ui e|| then8: if D(Gk) < Dc then9: Return Gk;

10: else11: Return NULL;12: Add e to G′k;13: if G′k contains more than k users then14: Gk ← select the group of size k with the minimum diameter from G′k;

15: else16: for each entry e′ in the node pointed to by e do17: if the interest of e′ > the interest of ui then18: if ||ui e′|| < Dc then19: Queue.Enqueue(e′, ||ui e′||);20: Return Gk;

S ′ = {u1, u2, · · · , um}. If ||ui um+1|| ≥ D(Gk(ui)), Gk(ui) is the user group of size k

containing ui with the maximum ranking score calculated from S.

Proof. Suppose we can find a groupG′k(ui) of size k containing ui from S ′′ = {u1, u2, · · · ,

um, · · · , um+j} where 1 ≤ j ≤ n − m, such that rank q(G′k(ui)) > rank q(Gk(ui)).

Then we have ∃j(um+j ∈ G′k(ui)). Since ∀u ∈ S ′′(I(ui, q.T ) ≤ I(u, q.T )), we have

I(G′k(ui), q.T ) = I(Gk(ui), q.T ) and derive D(G′k(ui)) < D(Gk(ui)). Since um+j ∈

G′k(ui), we have D(G′k(ui)) ≥ ||ui um+j||. Since ||ui um+j|| ≥ D(Gk(ui)), we have

D(G′k(ui)) ≥ ||ui um+j|| ≥ D(Gk(ui)) that contradicts D(G′k(ui)) < D(Gk(ui)) de-

rived before and thus complete the proof.

Example 3.2. We illustrate Theorem 3.4 in Figure 3.3. Let S = {u1, u2, u3, u4, u5} be a

sorted list of users with higher interests than ui and in ascending order of their distances

to user ui. Let G4(ui) = {ui, u1, u2, u3} be the user group of size 4 containing ui with the

maximum ranking score calculated from S ′ = {u1, u2, u3}. The diameter is D(G4(ui)) =

||u1 u2||. Next, we consider u4 and have ||ui u4|| < D(G4(ui)). Hence, we obtain a new

group G4(ui) = {ui, u2, u3, u4} from S ′′ = {u1, u2, u3, u4} with the maximum ranking

26

ui

u1

u3

u2

u4uu1 u4

ui

u3

u22

u

u5

Figure 3.3: Example of Theorem 3.4

score and diameter D(G4(ui)) = ||ui u4||. We then consider u5 and have D(G4(ui)) <

||ui u5||. Theorem 3.4 guarantees that G4(ui) = {ui, u2, u3, u4} is the user group of size

4 containing ui with the maximum ranking score calculated from S.

Diameter Constraint When retrieving user group Gk(ui), it is not necessary to consid-

er all the users whose interests are higher than ui. We propose a diameter constraint Dc

that can be used to prune the search space (Theorem 3.5) so that less users are considered

when selecting a group Gk(ui) with the maximum ranking score.

Lemma 3.2. rankq(Gk(ui)) > rankq(Gk(uj )) ⇐⇒ D(Gk(ui)) < Dc, where

Dc = Dmax (1−rankq(Gk(uj ))− α I(ui,q.t)

Imax (q.t)

(1− α)). (3.2.7)

The proof can be easily derived based on Equation 3.1.3 and thus omitted.

Theorem 3.5. If rankq(Gk(ui)) > rankq(Gk(uj )) and ||ui um|| ≥ Dc, we have um /∈

Gk(ui).

Proof. We prove it by contradiction. Suppose um ∈ Gk(ui). Since ||ui um|| ≤ D(Gk(ui)),

27

we have

rankq(Gk(ui)) = αI(ui, q.T )

Imax (q.T )+ (1− α)(1− D(Gk(ui))

Dmax

)

≤ αI(ui, q.T )

Imax (q.T )+ (1− α)(1− ||ui um||

Dmax

)

≤ αI(ui, q.T )

Imax (q.T )+ (1− α)(1− Dc

Dmax

)

= rankq(Gk(uj )).

It contradicts the condition rankq(Gk(ui)) > rankq(Gk(uj )).

Lemma 3.2 indicates that the condition of a group with smaller interest having higher

ranking score is that its diameter must be small enough. Based on Lemma 3.2, Theo-

rem 3.5 guarantees that it prunes the users whose distances to ui is larger than Dc, since it

is impossible to construct a group containing those users and having higher ranking score

than does the candidate group (illustrated by Example 3.3). Hence, the search space is

pruned (line 18 in Algorithm 2). Only the group with higher ranking score than the can-

didate group is returned (line 8 in Algorithm 2). The value of Dc is updated when a user

group with higher ranking score is found (line 7 in Algorithm 1).

In order to facilitate our example description, we set q.T = ‘coffee’, and the value of

user’s interest in q.T is shown in Figure 3.2(b).

Example 3.3. Figure 3.2(a) shows the location layout of the users appearing in the IR-

tree of Figure 3.2(b). Let Dmax = 100, α = 0.5, Imax = 1.0, I(u1, q.T ) = 0.4, and the

current maximum ranking score be 0.6. Suppose the current processing group is Gk(u1).

Based on Lemma 3.2, we can obtain Dc = 20. Thus the diameter of Gk(u1) should be

less than 20. Figure 3.4 shows the distances between u1 and IR-tree nodes or its neighbor

users. With the diameter constraint Dc, we do not need to consider tree nodes {R3} and

users {u6, u7, u10, u13} during the query processing.

28

0 6 20 9 10 0 10 0

7 8 6 18 21 20 17 9 25 11 15 24

Figure 3.4: Distance between u1 and its neighbors

3.2.4 Diameter Oriented Algorithm

Diameter Oriented Algorithm (DOAIR) classifies groups in terms of group diameters. Let

set Gk contain all possible user groups of size k. Let set Gk(ui, ·) cover all the groups of

size k, taking user ui as one end of the group diameter. Obviously, ∪ui∈UGk(ui, ·) = Gk.

Note that Gk(ui, ·) may be an empty set. Algorithm DOAIR follows the descending order

of the user interest and constructs the group Gk(ui, ·) with the maximum ranking score in

Gk(ui, ·). If the ranking score of the current found groupGk(ui, ·) is higher than the upper

bound on the ranking score of the next group Gk(ui+1, ·) (termination condition), the

current found group is returned as the result. The correctness of the termination condition

is guaranteed by Lemma 3.3. The correctness of algorithm DOAIR is guaranteed by

Theorem 3.6.

Lemma 3.3. Let S = {u1, u2, · · · , un} be a sorted list of users in descending order

of their interests. If rankq(Gk(ui , ·)) > rankuq (Gk(uj , ·)) where ranku

q (Gk(uj , ·)) =

rankuq (Gk(uj )) (Equation 3.2.6), we have rankq(Gk(ui , ·)) > rankq(Gk(um , ·)), where

k ≤ i < j ≤ m ≤ n.

Proof. For j ≤ m ≤ n, we have I(uj, q.T ) ≥ I(um, q.T ). According to Equation 3.2.6,

we derive rankuq (Gk(uj , ·)) ≥ ranku

q (Gk(um , ·)). Hence, we get rankq(Gk(ui , ·)) >

rankuq (Gk(uj , ·)) ≥ ranku

q (Gk(um , ·)) ≥ rankq(Gk(um , ·)) based on Theorem 3.2.

Theorem 3.6. Algorithm DOAIR find the correct answer to an SIG query.

Proof. We prove it by contradiction. Assume that given an SIG query q, algorithm

DOAIR returns G as the result. Now suppose there exists G′ with the maximum rank-

ing score such that rank q(G) < rank q(G′). Suppose G and G′ are the groups with

29

maximum ranking score in Gk(ui, ·) and Gk(u′i, ·), respectively. There are three possi-

ble cases. (1) If I(ui, q.T ) < I(u′i, q.T ), algorithm DOAIR first considers Gk(u′i, ·), and

then Gk(ui, ·). According to Lemma 3.3, we have rank q(G′) ≥ ranku

q (G) ≥ rank q(G).

Thus, algorithm DOAIR must return the group G′ in Gk(u′i, ·), not G in Gk(ui, ·). (2)

If I(ui, q.T ) = I(u′i, q.T ), we have Gk(u′i, ·) = Gk(ui, ·). Algorithm DOAIR must re-

turn the group G′ since rank q(G) < rank q(G′). (3) If I(ui, q.T ) > I(u′i, q.T ), al-

gorithm DOAIR first considers Gk(ui, ·), and then Gk(u′i, ·). According to Lemma 3.3,

we have rank q(G) ≥ rankuq (G′) ≥ rank q(G

′), which contradicts the assumption that

rankuq (G) < rank q(G

′). Hence, the correctness of algorithm DOAIR is proved.

Algorithm 3 shows the pseudo code of the DOAIR algorithm. The candidate groupGk

is initialized as the k-sized user group with the minimum diameter (line 1). It processes

users in descending order of their interests for the query keywords (line 3) by calling

function GetNextUser that adopts the Threshold Algorithm [75]. For the current obtained

user ui, function DOAIRGetNextGroup constructs a group of size k with the maximum

ranking score, taking user ui as one end of the group diameter, denoted as Gk(ui, ·) (line

7). If its ranking score is higher than that of the candidate group Gk (lines 8 and 9). If the

ranking score of the candidate group is higher than the upper bound of Gk(ui+1, ·), the

algorithm then returns the candidate group as the result and terminates (lines 10 and 11).

Algorithm DOAIR is able to skip the construction of group Gk(ui, ·) if the distance Dui

between ui and its nearest neighbor is larger than Dc and the candidate group interest is

also higher than ui’s interest (lines 5 and 6), meaning that it is impossible to find a group

taking ui as one end of the group diameter with higher ranking score than the candidate

group. Theorem 3.7 guarantees the correctness of this pruning step.

In order to find group Gk(ui, ·) with the maximum ranking score in Gk(ui, ·), function

DOAIRGetNextGroup (Algorithm 4) uses the IR-tree to retrieve the users in ascending

order of their distances to ui (line 19). For an encountered user e, it tries to construct a

group Gk of size k with diameter ui e (lines 9 and 10). To avoid enumerating all pos-

sible diameter ui e and find out group Gk(ui, ·) efficiently, an early stop condition (line

12) and two interest constraints (lines 14 and 18) are designed. The diameter contraint

30

Algorithm 3 DOAIR(Integer k, Keywords T , InvertedFile invf , IRTree irtree)1: Result Gk ← the k-sized user group with the minimum diameter;2: Dc ←∞;3: while ui, ui+1 ← GetNextUser(T, invf ) do;4: Update Dc according to Equation 3.2.7;5: if Dui > Dc then6: Continue;7: Gk(ui, ·)← DOAIRGetNextGroup(irtree, ui, k,Dc, T );8: if rankq(Gk (ui , ·)) > rankq(Gk ) then9: Gk ← Gk(ui, ·);

10: if rankq(Gk ) > rankuq (Gk (ui+1 , ·)) then11: Return Gk;12: Return Gk;

(Theorem 3.5) is also applied here (line 18).

Theorem 3.7. Let Gk be the candidate group and Duibe the distance between ui and its

nearest neighbor. If Dui> Dc, we have rankq(Gk) > rankq(Gk(ui , ·)) where

Dc = Dmax (1−rankq(Gk)− α I(ui,q.t)

Imax (q.t)

(1− α)).

Proof. We derive rankq(Gk) > rankq(Gk(ui , ·)) as follows:

rankq(Gk(ui , ·)) = αI(G(ui, ·), q.T )

Imax (q.T )+ (1− α)(1− D(Gk(ui, ·))

Dmax

)

≤ αI(ui, q.T )

Imax (q.T )+ (1− α)(1− Dui

Dmax

)

< αI(ui, q.T )

Imax (q.T )+ (1− α)(1− Dc

Dmax

)

= rankq(Gk).

Early Stop Function DOAIRGetNextGroup considers the users in ascending order of

their distances to ui. Let Gk be the current found group with diameter ui e. If the interest

ofGk equals the interest of ui, groupGk is the group with the maximum score in Gk(ui, ·).

All the rest users farther than e from ui do not need to be considered. The correctness is

guaranteed by Theorem 3.8.

31

Algorithm 4 DOAIRGetNextGroup(IRTree irtree, User ui, Integer k, Double Dc, Key-words T )

1: Queue ← NewPriorityQueue();2: Queue.Enqueue(irtree.root , 0);3: Ic←0,4: Add ui to G′k;5: while Queue is not empty do6: Entry e← Queue .Dequeue();7: if e refers to a user then8: Add e to G′k;9: if G′k contains more than k users then

10: Gk ← GetCurrentResult(G′k, ui, e, T );11: if Gk is not empty then12: if the interest of Gk = the interest of ui then13: Return Gk;14: Update Queue and G′k, delete the users whose interest ≤ the interest of Gk

(Theorem 3.10).15: else16: for each entry e′ in the node pointed to by e do17: Update Ic according to Theorem 3.9;18: if the interest of e′ > Ic ∧ ||e′ ui|| < Dc then19: Queue .Enqueue(e′, ||ui e′||);20: Return Gk;

Theorem 3.8. Let S = {u1, u2, · · · , um, um+1, · · · , un} be a sorted list of users in as-

cending order of their distances to user ui. Let Gk(ui, um) be the user group with di-

ameter ui um and rank q(Gk(ui, um)) > rank q(Gk(ui, uj)) where 1 ≤ j < m. If

I(Gk(ui, um), q.T ) = I(ui, q.T ), rank q(Gk(ui, um)) > rank q(Gk(ui, uj)) where 1 ≤

j ≤ n, j 6= m.

Proof. Suppose we can find a group Gk(ui, uj) with diameter ui uj where m < j ≤ n,

such that rank q(Gk(ui, um)) < rank q(Gk(ui, uj)). Since D(Gk(ui, um)) = ||ui um|| <

||ui uj|| = D(Gk(ui, uj)) and I(Gk(ui, um), q.T ) = I(ui, q.T ) ≥ I(Gk(ui, uj), q.T ), we

can derive rank q(Gk(ui, um)) > rank q(Gk(ui, uj)) that contradicts the assumption and

thus complete the proof.

Interest Constraint Ic In function IOAIRGetNextGroup, when retrieving user group

Gk(ui), it uses a distance constraint Dc to prune the search space. Besides Dc, function

DOAIRGetNextGroup contains an interest constraint Ic to further prune the search space

such that selecting a group Gk(ui, ·) with the maximum ranking score is more efficient.

32

Specifically, if the interest of a user e is lower than Ic, the ranking score of the group with

diameter ui e is lower than the current found candidate group.

Lemma 3.4. Suppse um and un are the mth and nth nearest neighbors of ui, where m <

n. We have, rank q(Gk(ui, um)) < rank q(Gk(ui, un)) ⇐⇒ I(Gk(ui, un), q.T ) > Ic,

where

Ic =Imax(q.T )

α(rank q(Gk(ui, um))− (1− α)(1− ||ui un||

Dmax

)). (3.2.8)

The proof is trivial and thus omitted (easily derived based on Equation 3.1.3).

Theorem 3.9. Let um and un be the mth and nth nearest neighbors of ui, where m < n.

Let Gk(ui, um) be the current found group with maximum ranking score. If I(un, q.T ) <

Ic, we have rankq(Gk(ui , um)) > rankq(Gk(ui , un)).

Proof. We prove it by contradiction. Assume rankq(Gk(ui , um)) ≤ rankq(Gk(ui , un)).

Then we have Ic > I(un, q.T ) ≥ I(Gk(ui, un), q.T ) that contradicts Lemma 3.4.

Interest Constraint IG Let Gk(ui, um) be the current found candidate group and IG =

I(Gk(ui, um), q.T ). When constructing groupGk(ui, un) where ||ui um|| < ||ui un||, the

users whose interest is lower than IG can be pruned. In other words, if group Gk(ui, un)

is successfully constructed such that rankq(Gk(ui , um)) < rankq(Gk(ui , un)), the users

whose interest is lower than IG must not belong to group Gk(ui, un). It is used to prune

the search space when constructing a group with specific diameter. The correctness is

guaranteed by Theorem 3.10.

Theorem 3.10. Let um and un be the mth and nth nearest neighbors of ui, where m < n.

If rankq(Gk(ui , um)) < rankq(Gk(ui , un)), ∀uj(I(uj, q.T ) ≤ IG) =⇒ uj /∈ Gk(ui, un).

Proof. Assume ∃uj(I(uj, q.T ) ≤ IG, uj ∈ Gk(ui, un). SinceD(Gk(ui, um)) = ||ui um|| <

||ui un|| = D(Gk(ui, un)) and I(Gk(ui, un), q.T ) ≤ I(uj, q.T ) ≤ IG = I(Gk(ui, um), q.T ),

we have rankq(Gk(ui , um)) ≥ rankq(Gk(ui , un)) that results in a contradiction.

Given two users ui and e, function GetCurrentResult (Algorithm 5) is invoked by func-

tion DOAIRGetNextGroup (Algorithm 4) to construct group Gk(ui, e) with the maximum

33

ranking score. Based on Lemma 3.5, function GetCurrent-Result first constructs C(uie)

to minimize the search space of Gk(ui, e) (line 1, illustrated by Example 3.4). If the size

of C(uie) is less than k-2 or C(uie) cannot fully cover T , it returns NULL (lines 2–3), be-

cause it is impossible to formulate a group Gk(ui, e) with less than k-2 users. Otherwise,

we use the line segment uie to partition the search space C(uie) into two sets GL and GR

(lines 4–5). The users from GL and GR whose interests are no less than that of ui and e

are put into GLU and GRU , respectively (lines 6–7). If there are no less than k-2 users in

GLU or GRU , we randomly select k-2 users from GLU and GRU such that their union with

{ui, e} fully covers T , and return them together with {ui, e} as the result (lines 8–10,

illustrated by Example 3.5). The correctness is guaranteed by Lemma 3.6. We then split

the users into two sets Gup and Gdown according to Imin{ui, e} (here, Imin{ui, e} denotes

the minimum interest of ui and e) (lines 11-12). Gup represents the user set whose interest

is no less than Imin{ui, e}, and Gdown represents the user set whose interest is less than

Imin{ui, e}. Afterwards, we iteratively construct Gk(ui, e) with the users in Gdown in a

decreasing order of group interest (lines 13–20). In other words, we prefer to search the

group with high interest, because a higher interest means a higher ranking score for group

Gk(ui, e) when the diameter uie is known. In each iteration, if there exists k-2 users such

that the distance between any pair of users is no more than ||ui e|| and their union with

{ui, e} fully covers T (by enumerating all possible groups of size k-2), Gk ∪ {ui, e} is

returned as the result (lines 15–17). Otherwise, the user p with the highest interest in

Gdown will be moved into Gup (lines 18–19). This process is repeated until all users in

Gdown have been checked.

Lemma 3.5. Let C(ui, uiuj) and C(uj, uiuj) be two circles centered at ui and uj with

the same raduis ||ui uj||. The intersection of C(ui, uiuj) and C(uj, uiuj) is denoted by

C(uiuj). We have ∀u ∈ Gk(ui, uj)(u ∈ C(uiuj)).

The proof is obvious and thus omitted.

Example 3.4. Figure 3.5 shows the two circles C(u1, u1u11) and C(u11, u1u11) centered

at u1 and u11 with radius ||u1 u11||. The intersection C(u1u11) covers 3 users (e.g., u3, u4

34

Algorithm 5 GetCurrentResult(Group G′

k, User ui, User e, Keywords T )1: C(uie)← Users from {u|u ∈ G′

k ∧ ‖|u e|| ≤ ||ui e||};2: if |C(uie)| < k − 2 or C(uie) cannot fully cover T then3: Return NULL;4: GL ← Users from C(uie) that are above the line segment uie;5: GR ← Users from C(uie) that are below the line segment uie;6: GLU ← Users from GL whose interest are no less than Imin{ui, e};7: GRU ← Users from GR whose interest are no less than Imin{ui, e};8: if |GLU | ≥ k − 2 or |GRU | ≥ k − 2 then9: Gk ← Select any k − 2 users from GLU or GRU such that their union with {ui, e} fully

covers T ;10: Return Gk ∪ {ui, e};11: Gup ← GLU ∪GRU ;12: Gdown ← C(uie)−Gup;13: Queue← Sort the users in Gdown according to their interest in descending order;14: repeat15: Gk ← k − 2 users from Gup such that the distance between any pair of users is no more

than ||ui e|| and their union with {ui, e} fully covers T ;16: if Gk is not empty then17: Return Gk ∪ {ui, e};18: User p← Queue.Dequeue();19: Gup ← Gup ∪ {p};20: until Queue is empty21: Return NULL;

and u9). Group Gk(u1, u11) can be constructed from the users inside C(u1u11). In other

words, the search space of Gk(u1, u11) is C(u1u11).

Lemma 3.6. If the number of users on one side s of diameter uiuj inside C(uiuj) is no

less than (k − 2) and their interest is no smaller than the minimum interest of ui and uj ,

group Gk(ui, uj) can be constructed by randomly selecting k − 2 users from s, including

ui and uj .

Proof. Since k − 2 users are selected from s, the distance between all pair users in

Gk(ui, uj) is no larger than diameter ||ui uj||. Hence, D(Gk(ui, uj)) = ||ui uj|| and

I(Gk(ui, uj), q.T ) = min{I(ui, q.T ), I(uj, q.T )}. The ranking score of Gk(ui, uj) is

maximized.

Example 3.5. Consider the 13 users of IR-tree in Figure 3.5. Based on Lemma 3.5, the

search space of G4(u1, u11) is C(u1u11) which contains 2 users {u3, u4} whose interest is

no less than 0.4 (the minimum interest of u1 and u11). Based on Lemma 3.6, due to u3 and

35

u1

u3

R5

R4

R7

R6

R2

R1

u2

u9

u5

u13

R8

u4

u11

u10

u8

u12

u7

u6

R3

Figure 3.5: Constructing G4(u1, u11)

u4 are in one side of the diameter u1u11 and 2≥4-2, thus {u3, u4}⋃{u1, u11} is returned

as G4(u1, u11) with maximum ranking score.

3.3 Performance Evaluation

This section describes the experiments used to evaluate the algorithms proposed for the

processing of SIG queries (i.e., IOAIR and DOAIR). We also consider a baseline algo-

rithm that is similar to algorithm IOAIR, but using the traditional R-tree index without

diameter constraint. We introduce the datasets and queries used in Section 3.3.1 and the

experiment setup in Section 3.3.2. The experimental results are presented in Section 3.3.3.

3.3.1 Datasets and Queries

We collect data from two popular location-based social networks in China, i.e., Jiepang3

and Dianping4. Jiepang provides the check-in service for the visitors who may check-in

the tourist places they like. Dianping provides the check-in service for the users to share

review comments on the POIs such as restaurants they prefer. The properties of these two

real datasets are shown in Table 3.3 below.

We randomly generate two query sets on the two datasets. The query set on Jiepang

contains 200 queries, while the query set on Dianping contains 500 queries. Each query

3http://www.jiepang.com4http://www.dianping.com

36

Table 3.3: Dataset PropertiesJiepang Dianping

Total # of users 353,493 2,053,214Total # of spatial objects 244,331 1,466,188Total # of check-in actions 5,250,466 17,527,599Total # of unique tags 2,101 153,211Average # of tags per spatial object 2 23Average # of tags per user interest 1.3 37

contains several keywords and the specified group size. The keywords are randomly gen-

erated from the tag set of the dataset. The number of the keywords varies from 1 to 5. The

group size k is assigned to values {20, 40, 60, 80, 100}.

3.3.2 Setup

The indexes, including the R-tree, the inverted file, and the IR-tree, used in this chapter

are disk resident. The page size is set to 4KB. The fanouts of the R-tree and the IR-tree

are both set at 100. All the algorithms are implemented in Java programming language.

The models of the CPU and RAM are Intel Core 2 Quad Processor 2.4G Hz and 4GB

DDR3 memory, respectively. The default values of k, α, and the number of query tags

are 50, 0.5, and 1, respectively.

3.3.3 Experimental Results

We evaluate the performance of the three algorithms when varying the value of parameter

k, α, the number of keywords, and the buffer size. We also test the scalability of the pro-

posed algorithms on the two different datasets. As in many other performance evaluation

on query processing, we report the overall performance using the average elapsed time

and the average I/O cost.

Varying group size k. In this experiment, we evaluate the performance of our proposed

algorithms varying the group size k. Figure 3.6 shows the average elapsed time and the

simulated I/O cost on Jiepang and Dianping datasets. The IOAIR and DOAIR algorithm-

s outperform the baseline approach for all values of k in terms of both metrics, since

the IR-tree is able to prune irrelevant leaf nodes whose interest is less than the current

interest constraint as early as possible. Notably, the algorithm DOAIR achieves much

37

10

100

1000

10000

100000

1e+06

1e+07

20 40 60 80 100

mill

isec

onds

k

BaselineIOAIR

DOAIR

10

100

1000

10000

100000

1e+06

1e+07

20 40 60 80 100

page

acc

esse

s

k

BaselineIOAIR

DOAIR

(a) Varying k on Jiepang

100

1000

10000

100000

1e+06

1e+07

1e+08

20 40 60 80 100

mill

isec

onds

k

BaselineIOAIR

DOAIR

10

100

1000

10000

100000

1e+06

1e+07

20 40 60 80 100

page

acc

esse

s

k

BaselineIOAIR

DOAIR

(b) Varying k on Dianping

Figure 3.6: Varying k

better performance than IOAIR. This is because Theorems 3.9 and 3.10 effectively prune

a significant amount of search space and Lemmas 3.5 and 3.6 assist to reduce distance

computation time for DOAIR.

Varying α. Parameter α is used to balance the group interest and the group diameter.

Users can adjust α to determine query results bias to interest or diameter. Figure 3.7

shows the performance of the three algorithms with different values of α. As discussed

in Section 3.2, IOAIR is a slight bias towards finding the k-size maximum interest group

with higher group interest, while DOAIR is in favour of searching the maximum interest

group with a smaller group diameter. When α is varied from 0.1 to 0.9, the average e-

lapsed time and the average simulated I/O cost of DOAIR on both datasets are increasing,

but that of IOAIR is decreasing. Owing to the advantages of DOAIR presented in the sec-

tion above, DOAIR still has overall better performance than IOAIR algorithm. However,

when α is high and the group size k is small, IOAIR achieves better performance than

DOAIR (see Figure 3.8). The reason is two-folded. First, IOAIR has the priority to deal

with the group with a high group interest. Thus, with a large α value, IOAIR can find the

38

10

100

1000

10000

100000

1e+006

1e+007

0.1 0.3 0.5 0.7 0.9m

illis

econ

ds

alpha

BaselineIOAIR

DOAIR

10

100

1000

10000

100000

1e+006

1e+007

0.1 0.3 0.5 0.7 0.9

page

acc

esse

s

alpha

BaselineIOAIR

DOAIR

(a) Varying α on Jiepang

100

1000

10000

100000

1e+006

1e+007

1e+008

0.1 0.3 0.5 0.7 0.9

mill

isec

onds

alpha

BaselineIOAIR

DOAIR

100

1000

10000

100000

1e+006

1e+007

0.1 0.3 0.5 0.7 0.9pa

ge a

cces

ses

alpha

BaselineIOAIR

DOAIR

(b) Varying α on Dianping

Figure 3.7: Varying α

final results quickly and terminate the query processing early. Second, as discussed above,

the performance efficiency of DOAIR is much attributed to its strong pruning ability to

reduce the distance computation cost (based on Lemmas 3.5 and 3.6). When k is small,

the distance computation cost is not very high, thereby weakening the pruning effect of

DOAIR. Combining these two effects, IOAIR outperforms DOAIR when α = 0.9 and

k = 10.

Varying the number of query tags. In this experiment, we evaluate the performance

of our proposed algorithms by varying the number of the query tags. In most cases, the

10

100

1000

10000

100000

1e+006

1e+007

10 20 30

mill

isec

onds

k

BaselineIOAIR

DOAIR

Figure 3.8: Varying k on Dianping (α = 0.9)

39

number of the query tags is small, thus we only consider the cases where varying the

number from 1 to 5. In Figure 3.9, we can see that again DOAIR demonstrates the best

performance in all cases tested. As no user’s interest information is integrated into the R-

tree, the baseline algorithm cannot prune the irrelevant tree nodes whose interest does not

satisfy the interest constraint; thus the baseline algorithm performs the worst in running

time and I/O cost. As discussed earlier, DOAIR shows better performance than IOAIR

due to its stronger pruning power and reduced distance computation cost.

Varying buffer size. To some extent, the buffer size affects the algorithm performance.

The bigger size of buffer setting in the memory, the more disk pages are buffered, and thus

the less I/O cost is incurred. In this experiment, we adopt the LRU (Least Recently Used)

buffering strategy to cache the disk pages. Figure 3.10 shows that DOAIR outperforms

the other two algorithms in all settings. With increasing the buffer size, as expected the

average I/O cost decreases notably. The average elapsed time also keeps the decreasing

pattern, but the degree is not that significant. This is because the most time-consuming

part is to compute the SIG groups after the irrelevant tree nodes are pruned.

10

100

1000

10000

100000

1e+06

1e+07

1 2 3 4 5

mill

isec

onds

# of keywords

BaselineIOAIR

DOAIR

10

100

1000

10000

100000

1e+06

1e+07

1 2 3 4 5

page

acc

esse

s

# of keywords

BaselineIOAIR

DOAIR

(a) Varying the number of query tags on Jiepang

10

100

1000

10000

100000

1e+06

1e+07

1 2 3 4 5

mill

isec

onds

# of keywords

BaselineIOAIR

DOAIR

10

100

1000

10000

100000

1e+06

1e+07

1 2 3 4 5

page

acc

esse

s

# of keywords

BaselineIOAIR

DOAIR

(b) Varying the number of query tags on Dianping

Figure 3.9: Varying the number of query tags

40

0

20000

40000

60000

80000

100000

0 200 400 600 800 1000

mill

isec

onds

buffer size

BaselineIOAIR

DOAIR

0

2000

4000

6000

8000

10000

12000

14000

0 200 400 600 800 1000

page

acc

esse

s

buffer size

BaselineIOAIR

DOAIR

(a) Varying buffer size on Jiepang

0

200000

400000

600000

800000

1e+06

0 2000 4000 6000 8000 10000

mill

isec

onds

buffer size

BaselineIOAIR

DOAIR

0

20000

40000

60000

80000

100000

120000

0 2000 4000 6000 8000 10000

page

acc

esse

s

buffer size

BaselineIOAIR

DOAIR

(b) Varying buffer size on Dianping

Figure 3.10: Varying Buffer Size

100

1000

10000

100000

1e+06

50k 100k 150k 200k 250k 300k

mill

isec

onds

number of users

BaselineIOAIR

DOAIR

100

1000

10000

100000

50k 100k 150k 200k 250k 300k

page

acc

esse

s

number of users

BaselineIOAIR

DOAIR

(a) Varying the Number of Users on Jiepang

100

1000

10000

100000

1e+06

1e+07

300k 600k 900k 1200k1500k1800k

mill

isec

onds

number of users

BaselineIOAIR

DOAIR

100

1000

10000

100000

1e+06

1e+07

300k 600k 900k 1200k1500k1800k

page

acc

esse

s

number of users

BaselineIOAIR

DOAIR

(b) Varying the Number of Users on Dianping

Figure 3.11: Varying the Number of Users

41

Varying the number of users. With the purpose of testing the scalability of our proposed

algorithms, in this set of experiments we vary the number of users in the two testing

datasets. As shown in Figure 3.11, our proposed algorithms DOAIR and IOAIR exhibit

good scalability performance. As the number of users grows, the average elapsed time

and the average simulated I/O cost of these algorithms on both datasets increase more

slowly than the baseline algorithm, resulting in better performance improvement for larger

datasets.

3.4 Summary

In this chapter, we have presented a new SIG query that considers both the users’ spatial

locations and their common interest in query keywords. We have proposed a family of

efficient algorithms based on the IR-tree, namely IOAIR and DOAIR, for the efficient

processing of SIG queries. IOAIR processes SIG queries based on the descending order

of interest to search the result group with the minimum diameter. IOAIR integrates the

distance constraint into the query optimization to prune search space. In contrast, DOAIR

adapts a diameter-oriented strategy to process SIG queries, which takes into account the

interest and diameter order simultaneously. Effective pruning techniques have been de-

veloped to prune irrelevant search space and accelerate the search speed. The experiments

based on two real datasets demonstrate that the DOAIR algorithm achieves the best per-

formance and outperforms the baseline algorithm by orders of magnitude.

42

Chapter 4

Geo-Social K-Cover Group Queries for

Collaborative Spatial Computing

The emergence of geo-social data has enriched the studies on queries in location-based

social network. As mentioned in Chapter 1, geo-social group query is one of the most

important problems for collaborative spatial computing. In this chapter, we propose a

new type of geo-social queries, namely geo-social k-cover group (GSKCG) queries, by

considering the users’ spatial containment and their social connections. The rest of this

chapter is organized as follows. Section 4.1 formally formulates the problem and an-

alyzes its complexity. Section 4.2 presents the KCGFinder algorithm along with a set

of effective pruning techniques. Section 4.3 presents the Enhanced SaR-tree structure

and introduces the integrated SaRBasedKCGFinder algorithm. Experimental results are

provided in Section 4.4. Finally, we summarize this chapter in Section 4.5.

4.1 Problem Formulation

In this section, we give some preliminaries and provide the problem statement, followed

by an example to elaborate the problem defined. Table 4.1 summarizes the notations used

throughout this chapter.

A GSKCG query is defined over a location-based social network (LBSN)G = (V,E),

where each vertex u ∈ V is a user and each edge e ∈ E denotes an acquainted relation

43

Table 4.1: Summary of notationsNotation DefinitionG = (V,E) Location based Social Network (LBSN)u, v a user of Gu.R an associated region of user uQ = (k, P ) a GSCKG query Q, P is a set of query points, and k indicates a social constraint k-coreG[V ′] a subgraph of G contains vertices V ′

SI an intermediate solution where |SI| ≤ sSU the set of remaining usersPS the set of query points covered by users in SNBS(v) the number of v’s neighbors in SVp a set of users whose associated region covers query point pUk a set of users that may appear in a k-coreCs a group of size s

Ck a connected k-core componentCk

s a s-size group where G[Cks ] is a k-core and Ck

s fully covers PM the maximum group size of query GSKCGListP a sorted user list according to the increasing size of Vp where p ∈ PA(p) the index of the last user u in ListP where p ∈ u.Rk(u) a k-core with u insideCBRu,k a rectangle that does not contain a k(u)iCBRu,k an internal CBR of u that does not contain a k(u)eCBRu,k an external CBR of u that does not contain a k(u)MBP (P ) a minimum bound rectangle which contains the query point set P

between the two users it connects. For any two users u, v ∈ V , there exists an edge

(u, v) ∈ E if and only if u and v are familiar with each other. Moreover, each user

u ∈ V has an associated region denoted by u.R.1 Such an LBSN can be easily derived by

combining the location and social data collected from real-life applications.

A GSKCG query aims to find a group of users with a desired social relationship. In

this chapter, we quantify the desire of the social relationship within a user group in terms

of k-core [55], a widely used model for detecting community structures in a graph.

Definition 4.1. (k-core) For a graph G = (V,E), a connected subgraph G′ = (V ′, E ′)

of G is a k-core if every vertex v ∈ V ′ has at least degree k.

We argue that k-core is a reasonable model to measure a user group’s social acquain-

tance level for two main reasons. First, the minimum degree constraint of k-core is an

important measure of group cohesiveness in social science research [55] and has been

widely adopted in the research of graph problems [15, 53, 71]. In our problem, k-core

is effective and flexible to capture a user group’s acquaintance level in real-life LBSNs.

Second, k-core decomposition has a linear time complexity, which makes it appealing in

real-life applications. Indeed, it has been used as an important social constraint in prac-

1For ease of exposition, we consider each user to have one associated region. Our solution can be easilyextended to the case where a user has multiple associated regions, as discussed later in Section 4.3.3.

44

tical applications [58]. Based on the k-core model, we formally define a GSKCG query

below.

Definition 4.2. (GSKCG query) Given an LBSN G = (V,E), a Geo-Social k-Cover

Group (GSKCG) query is defined as a 2-tuple Q = (k, P ), where k is a positive integer,

indicating the social acquaintance constraint, and P = {p1, p2, · · · , pm} is a set of query

points, indicating the spatial coverage constraint, and returns a set of users V ′ ⊆ V such

that:

1. P ⊂⋃

u∈V ′ u.R,

2. the subgraph G[V ′] of G is a k-core, and

3. the cardinality of G[V ′] is minimum.

Note that we require the returned user group to have the minimum cardinality. This

requirement is naturally derived from the real-world demands. For example, in the mo-

tivating examples in Chapter 1, retrieving a minimum set of users normally leads to the

minimum employment cost or ease of reaching a consensus. We choose to make k as an

input parameter in order to provide a generic geo-social query service for different ser-

vice requesters. For a service requester that aims to find a single user who covers all the

tasks, he/she can set k = 0, which allows the GSKCG query to consider only the spatial

containment constraint, but not the social constraint. In many other cases, setting k to a

non-zero value will provide much more flexibility for a requester. For example, a service

requester can issue multiple GSKCG queries with different k values in parallel and then

select a proper group that fits his/her business needs.

Example 4.1. Consider a simple LBSN G = (V,E) where the users’ acquaintance rela-

tions and associated regions are shown in Figure 4.1(a) and Figure 4.1(b), respectively.

The GSKCG queryQ = (k, P ) with k = 2 and P = {p1, p2, p3, p4} returns the user group

V ′ = {u1, u3, u4} because: 1) the joint regions of users in V ′ can cover all the query

points in P ; 2) the subgraph G[V ′] of G is a 2-core; and 3) the cardinality of G[V ′] is

minimum among all user groups that satisfy the first two conditions.

45

u1

u2

u3

u4

u5

u6

(a) Social networks

u1

u2

u3

u4

u5 u6

p1

p2

p3

p4

(b) Associated regions

Figure 4.1: An example of a location-based social network for GSKCGquery

As formally defined in Definition 4.2, a GSKCG query finds a set of users that satisfy

the given spatial and social constraints. For ease of presentation, we call a user group valid

if it satisfies both Conditions 1 and 2 in Definition 4.2. Next we analyze the complexity

of the GSKCG query problem.

Theorem 4.1. GSKCG query is NP-complete.

Proof. We establish the hardness by a reduction from a classical NP-complete problem,

namely the minimum set cover (MSC) problem. An instance of the MSC problem consists

of a universe U = {e1, e2, . . . , en} and a set of sets S = {S1, S2, . . . , Sm}, where Si ⊂ U .

The decision problem is to decide if we can find a subset S ′ of S such that all the elements

in U are fully covered by S ′ and the size of S ′ is minimum.

Given an instance of MSC, we construct an instance of a GSKCG query Q = (k, P )

on a set of users. Each element ei in U corresponds to a spatial query point in P , each set

Si corresponds to a user ui’s associated region ui.R, and the elements in Si corresponds

to the spatial points in ui’s associated region ui.R. We consider the restricted case of

GSKCG query when k = 0. It can be seen that there exists a solution to the MSC problem

if and only if there exists a solution to Q (i.e., find a minimum set of users such that all

given query points are fully covered by their associated regions).

Suppose we have a polynomial-time algorithm A that returns the query answer G′ =

{u′1, u′2, . . . , u′m} to a GSKCG query Q. If P is fully covered by the associated regions

of G′, then {S ′1, S ′2, . . . , S ′m} fully covers U and its size m is minimum. This implies

46

that a polynomial-time solution to the MSC problem is found, leading to a contradiction.

Therefore, there does not exist a polynomial-time algorithm A for the GSKCG query

problem.

In this chapter, we study how to efficiently process GSKCG queries. We aim for an

optimal solution that has short response time. This is mainly achieved by a set of effective

pruning strategies (see Section 4.2) and a novel index structure (see Section 4.3).

4.2 Algorithm Design

In this section, we present our KCGFinder algorithm and a set of pruning strategies for

answering GSKCG queries.

4.2.1 Basic Algorithm

To satisfy the minimum cardinality requirement of a GSKCG query, the general idea of

KCGFinder is to process the user groups in increasing order of group size and return the

current group as soon as it is valid.

Algorithm 6 gives the pseudo code of the KCGFinder algorithm. Before performing

a search on the input LBSN G = (V,E), we first conduct two filtering operations: spatial

filtering and social filtering. In spatial filtering, we use an R-tree to get the users whose

associated regions cover at least one query point p ∈ P (Line 1, Algorithm 6). In social

filtering, we adopt the core decomposition algorithm [8] to identify the user set Uk in

which the users belonging to S may appear in a k-core, and invoke a depth-first search

(DFS) to find the set of connected componentsH ofG[Uk] that each fully covers P (Lines

2–3, Algorithm 6).

In Line 4 of Algorithm 6, we compute the maximum cardinalityM of the components

in H , which gives the upper bound of the size of the returned user group. By definition,

the cardinality of a k-core is >= k + 1. Thus, we enumerate user groups in increasing

order of size from k + 1 to M . Given a size s, for each component Ck with size ≥ s,

we invoke the GetOptimalGroup function (see Algorithm 13) to find a size-s user group

47

Algorithm 6 KCGFinder(Query points P , Integer k, LBSN G)1: S ← The set of users in G that each covers at least one point in P ;2: Uk ← The set of users belonging to S that may appear in a k-core;3: H ← All connected components of G[Uk] that each fully covers P ;4: M ←maxCk∈H |Ck|;5: for s from k+1 to M do6: for each Ck in H do7: if |Ck| ≥ s then8: Ck

s ← GetOptimalGroup(Ck, k, s, P );9: if Ck

s 6= ∅ then10: Return Ck

s ;11: Return ∅;

Algorithm 7 GetOptimalGroup (Component G, Integer k, Integer s, Query points P )1: for each size-s user group Cs of G do2: if the number of edges of G[Cs] ≥ k(k + 1)/2 then3: if G[Cs] is k-core and P ⊆

⋃u∈Cs

u.R then4: Return Cs;5: Return ∅;

Cks whose joint regions fully cover P (for short, we say “Ck

s covers P ”) and which is a

k-core. If Cks is not empty, it is returned as the final optimal answer to the GSKCG query.

It can be observed that the main complexity of KCGFinder comes from the GetOpti-

malGroup function. Therefore, in the rest of this section, we focus on how to optimize

GetOptimalGroup via a set of pruning techniques. We give the general idea of GetOp-

timalGroup in Algorithm 7. GetOptimalGroup enumerates all size-s user groups and

checks whether they are valid. By the definition of k-core, we can prune out a user group

Cs if the number of edges in G[Cs] is < k(k + 1)/2 (Line 2, Algorithm 7).

For a systematic enumeration of all candidate user groups, we employ the branch and

bound algorithm [36]. In the branch and bound search process, we keep track of two user

sets SI and SU, which represent the intermediate solution set and the set of remaining

users, respectively. Initially, SI is empty, and SU is the set of all users in component G.

We iteratively add users from SU to SI to check whether the resultant group is valid. This

process can be organized into a tree structure, as illustrated in Figure 4.2, in which an

internal node represents an SI and a leaf node represents a size-s candidate group. In the

rest of this section, we explore a set of effective pruning strategies to speed up the branch

and bound search.

48

u1

u1u2

u1u2u3

NULL

u1u3

u1u2u4

Expanding

Backtracking

u1u3u4

u1 u3

u4Social

Spatial

u1.R : {p1}

u3.R : {p2,p3}

u4.R : {p4}

P={p1,p2,p3,p4}

Figure 4.2: Branch and bound search tree

4.2.2 Basic Pruning

We start with two basic pruning strategies, k-core (KC) based pruning and spatial query-

point coverage (SQPC) based pruning, based on the degree constraint in a k-core and the

spatial query point coverage constraint, respectively.

KC based pruning

By the definition of k-core, we know that the minimum degree of each vertex in a k-core

should be no less than k. Therefore, in the branch and bound search, if the minimum

degree constraint cannot be satisfied after adding any new users from SU to SI, the search

process should backtrack to the previous state of SI (that is, the parent node of the node

representing SI in the search tree). We give the critical condition under which the current

SI may form a valid group below.

Theorem 4.2. Let umin ∈ SI be the user with the minimum number of neighbors in SI. If SI

is in any valid group with size s, then |NBSI(umin)|+ s− |SI| ≥ k where |NBSI(umin)|

is the number of umin’s neighbors in SI.

Proof. Since we can add only s − |SI| users from SU to SI, the degree of umin in any

valid group with size s is at most |NBSI(umin)| + s − |SI| (when all users in SU are

neighbors of umin). By Definition 4.1, to form a valid group, the degree of umin in the

group should be ≥ k. This establishes the theorem.

Theorem 4.2 implies that if the current SI cannot satisfy this condition, the entire

subtree rooted at the node representing SI can be skipped.

49

Example 4.2. Consider the LBSN in Figure 4.1. Let SI={u2, u4}, SU={u1, u3}, s = 3

and k = 2. Since u2 has the minimum number of neighbors in SI and |NBSI(u2)| = 0,

we can verify that the condition in Theorem 4.2 does not hold, and therefore we can stop

searching the users in SU.

SQPC based pruning

Any valid user group should cover all query points P . If SU cannot fully cover the rest

query points in P − PSI , where PSI is the set of points covered by SI, adding any user

from SU to SI cannot form a valid group. In this case, the search process can safely prune

the subtree rooted at SI without missing the optimal solution.

In some cases, even though SU can cover all query points in P −PSI , the users of SU

are still not a member of any valid group. Theorem 4.3 is given to capture such cases.

Theorem 4.3. Let umax ∈ SU be the user whose region covers the most query points in

P − PSI . To form a valid group with size s, SU should satisfy:

|P − PSI |s− |SI|

≤ |Pumax| (4.2.1)

where |Pumax| is the number of query points covered by umax.

Proof. |P − PSI | is the number of query points not covered by SI , and s − |SI| is the

number of users to be added from SU to SI . On average, each user to be added should

cover at least |P−PSI |s−|SI |

query points. Therefore, the number of points in P − PSI covered

by umax must be greater than or equal to the average.

Intuitively, given the number of query points, the size of the user group and SI , Equa-

tion 4.2.1 gives the lower bound of the maximum number of query points in P −PSI that

a user in SU should cover.

Example 4.3. Consider the LBSN in Figure 4.1. Suppose SI={u2}, SU={u1,u3,u4,u5,u6},

k = 2 and s = 3. We can compute |P−PSI |s−|SI |

= |{p1,p3,p4}|3−1 =3

2and |Pumax|=1. Since Equa-

tion 4.2.1 is not true, there is no need to search users in SU .

50

4.2.3 Diameter Based Pruning

In this section, we propose pruning techniques based on the concept of social diameter.

We first give the definition of the diameter of a group Cks with size s.

Definition 4.3. (Diameter) The diameter of a user group Cks in an LBSN G is defined as

the longest shortest path length between any two users in G[Cks ], denoted by DIA(Ck

s ).

Let DIAub(Cks ) denote the upper bound of DIA(Ck

s ). It is easy to derive that DIAub(Cks ) =

s − k. However, this bound is too loose when s is big. In this chapter, we make use the

more strict bound proposed in [55].

Theorem 4.4. For a user group Cks ,

DIAub(Cks ) =

1 if s = k + 1

2 if k + 1 < s < 2k + 2

3[ sk+1

] + r(s, k)− 3 if s ≥ 2k + 2

(4.2.2)

where r(s, k) =

0 if mod(s, k + 1) = 0

1 if mod(s, k + 1) = 1

2 if mod(s, k + 1) = 2

This diameter upper bound of Cks introduces a way to measure whether two users can

co-exist in Cks . Next we present two pruning techniques called social shortest path (SHP)

based pruning and spatial-social shortest path (SOSP) based pruning.

SHP based pruning

The SHP based pruning is inspired by the observation that, if the shortest path length be-

tween two users exceeds DIAub(Cks ), they cannot appear simultaneously in Ck

s . It follows

that a user v ∈ SU can be added into SI only when the shortest path length between v

and u ∈ SI satisfies the condition presented in Theorem 4.5.

Theorem 4.5. Let v be the user to be added into SI from SU , and Dist(SI, v) be the

maximum shortest path length between the users in SI and v. v can be added into SI

51

only if the following inequation is satisfied:

Dist(SI, v) ≤ DIAub(Cks ) (4.2.3)

Proof. By Definition 4.3, the shortest path length between any two users in a valid group

Cks should be ≤ DIA(Ck

s ). If v can be added into SI , then Dist(SI, v) ≤ DIA(Cks ) must

be true. Therefore, Dist(SI, v) ≤ DIAub(Cks ) is also true.

Example 4.4. Consider the LBSN in Figure 4.1. Suppose k=1 and s=3. Let SI =

{u4} and SU = {u5, u6}. By Theorem 4.5, we can compute, for any valid group Cks ,

DIAub(Cks ) = 2. The shortest path lengths between u4 and u5, u4 and u6, are 3 and 4,

respectively. Thus, no user in SU can be added into SI to form a valid group.

SOSP based pruning

SHP based pruning can quickly verify whether there exists a user in SU to form a valid

group with the current SI . To further reduce the search space, we present SOSP based

pruning, which considers not only the shortest path length between two users but also the

users’ covered query points.

Intuitively, for any valid user group Cks , if a user u and all other users in the circle

centered at u with diameter DIA(Cks ) cannot fully cover all query points P , u cannot be

a member of Cks . This implies that, in this case, for the given specific values of k and

s, u could be removed from the search space without missing the optimal solution. This

provides extra pruning capabilities on top of SHP based pruning. Let NBpu be the user

that has the minimum shortest path length to a user u in an LBSN and that covers a query

point p ∈ P . We formally capture this intuition in Theorem 4.6.

Theorem 4.6. Let Dist(u,NBpu) be the shortest path length between u and NBp

u. For

any valid group Cks , if Dist(u,NBp

u) > DIAub(Cks ) for some p ∈ P and p 6∈ u.R, then u

cannot be a user of Cks .

Proof. Suppose v is a user ofCks and there exists a query point p ∈ P satisfyingDist(v,NBp

v) >

DIAub(Cks ). Since Ck

s is a valid group, by Theorem 4.5 we have Dist(v, u) ≤ DIA(Cks )

52

u1 u4 u2 u3 u5 u6

p1

p2

p4

p3

1

2

4

6

Figure 4.3: Sorted user list ListP

where u ∈ Cks . Since p ∈ P must be covered by Ck

s , we have NBpv ∈ Ck

s and therefore

Dist(v,NBpv) ≤ DIA(Ck

s ), leading to a contradiction. This completes the proof.

Below we provide an example to illustrate how SOSP based pruning works.

Example 4.5. Consider constructing a valid groupCks with k = 1 and s = 3 for the LBSN

in Figure 4.1. The user set of this LBSN is {u1, u2, u3, u4, u5, u6}. From Theorem 4.5, we

get DIAub(Cks ) = 2. Since NBp4

u5and NBp4

u6are both u4, we have Dist(u5, u4) = 3 and

Dist(u6, u4) = 4. From Theorem 4.6, we learn that both u5 and u6 should be removed

from the search space of finding Cks . Now we get a smaller search space {u1, u2, u3, u4}.

For diameter based pruning techniques, we need to compute the shortest path length

between any pair of users. However, it is impossible to calculate the length on the fly,

because it will substantially increase the total running time of our algorithm. A possible

method is to pre-compute all the lengths offline and then index them for online query

processing. However, this approach needs O(n2) storage, where n is the number of users

in the LBSN. It is not feasible to store such big indexes when n is large. In this chapter,

we adapt the tree-structured index constructed based on the concept of vertex cover (V C-

index) [14], which can efficiently process distance queries between users with a small

storage cost. We also employ the caching technique to accelerate querying the shortest

path length of users. Given two users, we first retrieve the shortest path length between

them in the cache. If the length is cached, we read it directly. Otherwise, the length is

calculated from the V C-index. For the strategy of replacing the cache, we adopt the least

recently used (LRU) method.

53

4.2.4 Access Order Based Pruning

In the section, we propose the last pruning strategy based on the observation that more

search space can be pruned if the users are accessed in a certain order in the branch and

bound search. Given a set of users V and a set of query points P , we place the users in V

into several sets V P ={Vp1 , Vp2 , · · · , Vp|P |}, where Vpi is the set of users whose associated

region covers the point pi ∈ P . Note that a user u may belong to multiple Vpi , because

u’s associated region may cover one or more query points. We first sort V P in increasing

order of Vpi’s size, and then sequentially access Vpi and push all users in Vpi into a user

list ListP . If a user has been pushed into ListP , he/she can be skipped in later operations.

Thereafter, the search process adds users from SU to SI according to the their indexes in

ListP . We give an example of constructing ListP .

Example 4.6. Consider the users and query points in Figure 4.1(b). We can place the

users into four sets, Vp1 = {u1}, Vp2 = {u2, u3}, Vp3 = {u3, u5, u6}, Vp4 = {u4}. To

construct the sorted user list ListP , we first add u1 (in Vp1) to ListP , then u4, u2, u3 in

order. After that, since u3 has been added when processing Vp2 , he/she will be skipped

when accessing Vp3 . Finally, u5 and u6 are added. The constructed ListP is given in

Figure 4.3 (ignore the arrows for the moment).

Next we discuss how to make use of ListP to gain additional pruning capability. For

a query point p, we define its access index in Listp as follows.

Definition 4.4. (Access index) The access index of a query point p ∈ P in a sorted user

list ListP , denoted by A(p), is the index of the last user whose associated region covers

p.

The access indexes are illustrated in Figure 4.3. Suppose p is the query point in P

that has not been covered by SI . If the smallest index of users in SU is greater than

A(p), the search process should backtrack to the parent node of SI . For a GSKCG query,

we maintain the access index for each query point. Note that, for a GSKCG query, the

access indexes and Listp just need to be calculated once and do not need to be updated.

Therefore, they can be constructed efficiently.

54

R5 R6

R1 R2 R3 R4

u1 u2 u3 u4 u5 u6 u7 u8

Root

CBRR5CBRR6

CBRR1CBRR4

CBRu1

CBRu2CBRu7

CBRu8

Figure 4.4: A sample SaR-tree

Example 4.7. Continue with Example 4.6. We have A(p1) = 1, A(p2) = 4, A(p3) = 6

andA(p4) = 2. Suppose SI = {u1}, SU = {u2, u3, u5, u6}, and P −PSI = {p2, p3, p4}.

According to the access order in Listp, u2 should be the first to be added to SI . We

compare u2’s index in ListP , 3, with A(p2), A(p3) and A(p4). Since A(p4) = 2 < 3, no

user in SU can be added to SI . The search process backtracks to SI’s previous state.

4.3 Hybrid Indexing

In this section, we design a novel index structure, the Enhanced Social-aware R-tree (SaR-

tree), to further accelerate query processing.

4.3.1 SaR-tree

The SaR-tree structure [74] is a variant of R-tree that indexes both spatial locations and

social relations. Figure 4.4 illustrates a simple SaR-tree. Different from a classical R-

tree, each entry of an SaR-tree contains two major pieces of information: a set of core

bounding rectangles (CBRs) (see Definition 4.5) that encodes the social information and

a minimum bounding rectangle (MBR) that encodes the spatial information as in an R-

tree. Intuitively, a CBR bounds the users by the social constraint while an MBR bounds

the users by the spatial constraint, and therefore an SaR-tree gains the ability of both

55

u1

u2u3

u4

u5

u6

u7

u8

u9

r1

r2

r3

Figure 4.5: Example of CBRs in an SaR-tree

social-based and spatial-based pruning for GSKCG query processing.

Definition 4.5. (Core bounding rectangle) Consider a user u ∈ G. Given a minimum

degree constraint k, the core bounding rectangle CBRu,k is a rectangle that contains

u and inside which any user group with u (excluding the users on the bounding edges)

cannot be a k-core.

Note that, for given u and k, CBRu,k may not be unique. We illustrate the idea of

CBR in the following example.

Example 4.8. Consider the LBSN in Figure 4.5. Given k = 2, the rectangle r1 is a

CBRu2,2 because any user group inside r1 that contains u2 is not a 2-core. Similarly, r3

is another CBRu2,2 for u2. In contrast, r2 is not a CBRu2,2 because {u1, u2, u5} in r2

form a 2-core.

In addition to CBRs and an MBR, each entry in an SaR-tree also contains a core

number. A user u’s core number is the maximum k for which u belongs to a k-core,

denoted by cn(u). The core number of an entry e is defined as the maximum of the core

numbers of the users covered by e, denoted by cn(e).

4.3.2 Enhanced SaR-tree

Unfortunately, the SaR-tree structure proposed in [74] cannot support GSKCG queries.

The main reason is that the method of computing CBRs in [74] assumes that each user is

associated with a spatial point, whereas in our problem each user has an associated region.

56

u1

u2

u3

u4

u5

u

u7

u6

(a) Social networks

u4

u1

u2u3

u

u5

u6

u7

(b) Associated regions

Figure 4.6: A sample LBSN for constructing CBR

This fact significantly complicates the problem and demands a new method to construct

CBRs.

We propose a novel index structure, known as the enhanced SaR-tree, to address this

problem. To construct an Enhanced SaR-tree over an LBSN, we first construct a standard

R-tree rtree and then compute the CBR for each entry in rtree. To compute the CBR of

an entry, we should know how to build a user’s CBR. The general idea of constructing a

user’s CBR includes two steps. First, as the users’ associated regions may intersect with

each other, we calculate the user’s internal CBR (see Definition 4.6). Second, given the

user’s internal CBR, we expand it to obtain the corresponding external CBR (see Defini-

tion 4.7), from which the user’s CBR will be selected. We give the formal definitions of

these two types of CBRs below. For ease of exposition, we denote “a k-core containing a

user u” by “k(u)”.

Definition 4.6. (Internal CBR) Given a k value, a user u’s internal CBR iCBRu,k is a

rectangle that is inside u.R and that does not contain a k(u).

Example 4.9. Consider the LBSN in Figure 4.6. Figure 4.7 shows some iCBRu,2 of user

u, marked by the shaded areas. Figure 4.7(a–b) and Figure 4.7(c–e) show iCBRu,2 of

user u in x-direction and y-direction, respectively. In Figure 4.7(a), the shaded area is an

iCBRu,2 of user u because: 1) it is inside u.R, and; 2) the users in this iCBRu,2 (i.e.,

u1, u2, and u) cannot form a 2-core containing u.

57

u4

u1

u2 u3

u

u5

l1 l2

u4

u1

u2 u3

u

u5

l1 l2

(a) (b)

u4

u1

u2 u3

u

u5

l2

l1

(e)

u4

u1

u2 u3

u

u5

l1

l2

u4

u1

u2 u3

u

u5

l2

l1

(c) (d)

Figure 4.7: Constructing user u’s internal CBRs

Definition 4.7. (External CBR) Given a user u’s internal CBR iCBRu,k, the correspond-

ing external CBR eCBRu,k is defined as a rectangle that: 1) contains this iCBRu,k, and;

2) is inside the MBR of u’s parent in rtree, and 3) does not contain a k(u).

Example 4.10. Continue with Example 4.9. Given a user u’s iCBRu,2 in Figure 4.7(a),

Figure 4.8 shows the corresponding eCBRu,2. The outermost rectangle marks the MBR

of u’s parent in the enhanced SaR-tree.

u4

u1

u2 u3

u

u5

u6

u7

Figure 4.8: Constructing a user u’s external CBRs

Algorithm 8 describes how to construct a CBR of a user u. We first use an R-tree to

find the users whose associated regions overlap with u.R, and add them into a user set

H (Line 1, Algorithm 8). We then construct a set of iCBRu,k of u from two directions

58

Algorithm 8 GetUserCBR (User u, Integer k, LBSN G, Enhanced SaR-tree rtree)1: H ← The users whose familiar regions overlap with u.R;2: X ← Left and right edges of the familiar region of each user in H;3: Y ← Top and bottom edges of the familiar region of each user in H;4: Sort the elements in X and Y in ascending order;5: iCBR(X)← GetInternalCBRs(u, k, X , H , G);6: iCBR(Y )← GetInternalCBRs(u, k, Y , H , G);7: iCBRs← iCBR(X) ∪ iCBR(Y );8: eCBRs← GetExternalCBRs(u, k, iCBRs, G, rtree);9: Return the element of eCBRs with the maximum area;

Algorithm 9 GetInternalCBRs (User u, Integer k, Line set X , User set H , LBSN G)1: LB ← A line on the left edge of u.R;2: `1 ← LB;3: `2 ← LB;4: iCBRs← ∅;5: while `1 and `2 do not exceed the right edge of u.R do6: OperateBoth(`1, `2, X , G);7: OperateL2(`2, X , G);8: iCBRs.add(Λ[`1, `2, u.R]);9: OperateL1(`1, X , G);

10: Return iCBRs;

(i.e., x-direction and y-direction). We put the left and right (or bottom and top) edges

of the familiar regions of the users in H into the line set X (or Y ), respectively, and

sort the lines in X (or Y ) in ascending order in order to facilitate the construction of

internal CBRs (Line 4, Algorithm 8). Then we use the GetInternalCBRs function to

generate the internal CBRs on both directions. Based on these internal CBRs, we invoke

the GetExternalCBRs function to calculate the corresponding external CBRs. Finally,

the external CBR with the maximum area is returned as u’s CBR. Next we elaborate

GetInternalCBRs and GetExternalCBRs.

The GetInternalCBRs function. The general idea of constructing iCBRu,k of a user u is

to alternately slide two vertical (or horizontal) lines on u.R in x-direction (or y-direction).

In the end, the area inside the intersection of these two lines and u.R will be u’s iCBRu,k.

Since the construction of iCBRu,k in y-direction is similar to that in x-direction, we only

discuss the case for x-direction.

Given a user u, a value of k, a sorted line setX and a user setH , we primarily perform

three kinds of operations on lines `1 and `2 (i.e., move `1 and `2 simultaneously, move `2

59

alone, and move `1 alone) to obtain iCBRu,k of u in x-direction. Initially, we place both

`1 and `2 on the left edge of u.R (Lines 1–3, Algorithm 9), and then move `1 and `2

rightward using one of the following operations:

1. OperateBoth: When `1 and `2 overlap with each other, we move them rightward

to the next line in X (but not exceeding the right edge of u.R) such that the users in

H whose familiar regions are touched by `1 and `2, denoted by H(`1), do not form

a k(u).

2. OperateL2: We move `2 rightward to the next line in X (not exceeding the right

edge of u.R) such that the users in the rectangle bounded by `1, `2 and u.R, denoted

by Λ[`1, `2, u.R], form a k(u).2 Now, Λ[`1, `2, u.R] is an internal CBR of u.

3. OperateL1: We move `1 rightward to the next line in X (not exceeding the right

edge of u.R) such that the users in Λ[`1, `2, u.R] do not form a k(u).3 Note that `1

is always on the left hand side of `2.

We alternate these three types of operations until both `1 and `2 stop at the right edge of

u.R. Finally, GetInternalCBRs returns all internal CBRs of u.

Example 4.11. Consider the LBSN in Figure 4.6. We illustrate how to compute iCBRu,2

of user u in x-direction in Figure 4.7. Initially, lines `1 and `2 are placed on the left edge

of u.R. Since at this time H(`1) = {u1, u2, u} and these users do not form a k(u) with

k = 2, there is no need to move `1 and `2. We then move `2 rightward to the left edge of

u4, and now the users in the rectangle Λ[`1, `2, u.R], {u1, u2, u4, u}, form a k(u). So the

current Λ[`1, `2, u.R] is an internal CBR of u. Next, we move `1 rightward until the right

edge of u1 because now the users in Λ[`1, `2, u.R], {u2, u4, u}, do not contain a k(u). We

continue this process until both `1 and `2 reach the right edge of u.R

The GetExternalCBRs function. Given the set of iCBRu,k returned by Algorithm 9,

Algorithm 10 is designed for constructing user u’s eCBRu,k. The basic idea is to expand

each iCBRu,k in iCBRs to obtain the corresponding eCBRu,k. We expand each edge

2Once `2 touches the left edge of a user’s familiar region, this user is in the rectangle.3Once `1 touches the right edge of a user’s familiar region, this user is not in the rectangle any more.

60

Algorithm 10 GetExternalCBRs (User u, Integer k, Internal CBRs iCBRs, LBSN G,Enhanced SaR-tree rtree)

1: eCBRs← ∅;2: for each internal CBR iCBRu,k in iCBRs do3: eCBRk,u ← Expand each edge of iCBRu,k until a k(u) appears or the edge reaches the

MBR boundary of u’s parent in rtree;4: eCBRs.add(eCBRk,u);

5: Return eCBRs;

of iCBRu,k outward until the users within iCBRu,k form a k(u). Recall that, by Defini-

tion 4.7, u’s eCBRu,k is inside the MBR of u’s parent in the enhanced SaR-tree rtree.

So we should stop expanding an edge once it reaches the boundary of the MBR.

Example 4.12. Continue with Example 4.11. Given iCBRu,2 of user u shown in Fig-

ure 4.7(a), we show an example of constructing the corresponding eCBRu,2 in Figure 4.8.

Assume that the outermost rectangle is the MBR of u’s parent. We sequentially move each

of the four edges of iCBRu,2 outward until getting the shadow area.

Finally, we discuss how to compute the CBR of each entry in the enhanced SaR-tree

by a bottom-up approach. A leaf entry’s CBR is the CBR of the user it represents. For

an internal entry e, let its child entries be e1, e2, · · · , em. Given the minimum degree

constraint k, e’s CBR CBRe can be computed by recursively applying the following

function on its child entries’ CBRs CBRei:

CBRei+11 ,k =

CBRei1,k,

if MBRei+1∩ CBRei1,k

= ∅

CBRei1,k∩ CBRei+1,k

otherwise

(4.3.4)

where CBReij ,kdenotes the CBR constructed from CBRej ,k, CBRej+1,k, · · · , CBRei,k.

Therefore, CBRe,k = CBRem1 ,k. It is easy to verify that, by this construction, any user

group within CBRe,k cannot be a k-core, giving extra pruning capabilities.

In practice, k usually does not have to be a large value. Setting k to 1, 2, or 3 normally

suffices for all ordinary requirements of social constraint. Thus, when k is small, we can

build indexes for each of the possible k values with reasonable space and time. Without

61

loss of generality, we also discuss the case when k is large. Here we can select a set of k

values to build the indexes by considering the property of k-core, that is, k-core⊆(k-1)-

core. With this property, we can only build indexes for k = 20, 21, · · · , 2blogcn(e)2 c where

cn(e) is the maximum core number of its child entries rooted at entry e. Given a GSKCG

query Q = (k′, P ), we can make use of the CBRs of k = 2blog2k′c(k is left close to k′) for

GSKCG query processing. This method may incur false positive users (the users whose

core number is less than k′), which may in turn enlarge the searching space and increase

the computation cost in the later steps. However, it does not compromise the correctness

of the query results and normally can be done with reasonable space cost of the indexes

and efficiency of query processing.

We also give a space complexity analysis for our proposed Enhanced SaR-tree. For an

entry e/user u, we only store the CBRs of e or u for the core number 20, 21, · · · , 2blogcn(e)2 c.

Let M denote the maximum core number of the users in G, s be the fanout of our index,

and n be the number of users in an LBSN G. The upper bound of the total number of

CBRs (denoted by Ncbr) in an Enhanced SaR-tree can be computed:

Ncbr ≤ 2n(blogM2 c+1)

s+∑

u∈V (blogcn(u)2 c)

≤ n(2(blogM2 c+1)

s+ 1 + blog2

∑u∈V cn(u)

nc)

(4.3.5)

Since M and∑

u∈V cn(u)

nis usually small in a social network, thus the space cost of CBRs

is comparable to that of G. For the datasets used in our experiments, the maximum M

and∑

u∈V cn(u)

nof both datasets are 52 and 3.8, respectively. We set the fanout s of our

indexes to 100. Hence, we can calculate the maximum number of CBRs of our Enhanced

SaR-tree on both datasets, which is around 2.12n.

4.3.3 GSKCG Query Processing

In this section, we present our integrated algorithm SaRBasedKCGFinder. Generally, the

algorithm consists of two steps: 1) filter impossible users based on the enhanced SaR-tree;

2) feed the remaining users to KCGFinder. We give the details of SaRBasedKCGFinder

in Algorithm 11.

62

Algorithm 11 SaRBasedKCGFinder(Query points P , Integer k, Enhanced SaR-Treertree, LBSN G)

1: MBR(P )← The minimum rectangle containing all points in P ;2: Initialize H with the root of rtree;3: while H has non-leaf entries do4: e← The first non-leaf entry in H;5: for each child entry e′ of e do6: if MBR(P ) ∩ e′.MBR 6= ∅ and cn(e′) ≥ k and MBR(P ) 6⊂ CBRe′,k then7: H.push(e′);8: VH ←The set of users represented by the entries in H;9: Return KCGFinder(P , k, G[VH ]);

We first calculate the minimum rectangle containing all query points P (i.e., the cov-

erage of P ), denoted by MBR(P ). We iteratively prune impossible users in the LBSN

G by traversing the enhanced SaR-tree rtree. Note that, for the same LBSN G, rtree

just needs to be constructed once and thereafter can be used for all GSKCG queries. At

each entry e of rtree, we compare MBP (P ) with e’s MBR and CBR and check the core

number of e in order to prune out the users that cannot appear in the final result (Line 6,

Algorithm 11). Finally, we feed the subgraph of G that contains the users represented by

the entries in H to KCGFinder and return its output.

It is easy to extend our algorithm to support the case where each user has multiple

associated regions. For each associated region of a user u, we index u and this associated

region one time in the Enhanced SaR-tree. Thus, the number of associated regions of u

corresponds the times of u being indexed. In spatial filtering, if a user u appears more

than one time, we simply combine them together. The following branch and bound search

process remains the same as the case where each user has exactly one associated region.


In this section, we experimentally study the performance of three algorithms. The first one

is the basic KCGFinder (referred to as Baseline) presented in Section 4.2.1. The second

one is KCGFinder coupled with the set of pruning techniques (Advanced). The third one

is SaRBasedKCGFinder (SaRBased).

63

Table 4.2: Dataset propertiesBrightkite Gowalla

Total # of users 58,228 196,591Total # of friend relations 214,018 950,327Medium region area (km2) 12.24 10.23Diameter (longest shortest path) 16 14Total # of check-ins 4,491,143 6,442,890Maximum # of cores 52 51

4.4.1 Datasets and Queries

We evaluate the proposed algorithms on two datasets collected from Brightkite and Gowal-

la,4 two real-world LBSNs. The properties of the two datasets are summarized in Ta-

ble 4.2. Since these websites do not directly provide users’ regions, we use a density-

based clustering method to form their regions from check-in locations. A user may have

several clustered associated regions. By default, we choose the region with the most

check-ins as a user’s associated region. The medium areas of the regions are 12.24 km2

on Brightkite and 10.23 km2 on Gowalla, respectively. More specifically, the cumulative

distribution of the region sizes on Brightkite is: 21.5% of regions ≤ 0.1 km2; 33.2%, ≤1

km2; 45.6%, ≤10 km2; 79.4%, ≤100 km2, and so on. For Gowalla, the distribution is:

19.1% of regions ≤ 0.1 km2; 28.2%, ≤1 km2; 50.7%, ≤10 km2; 80.4%, ≤100 km2, and

so on. In our experiments, we also show the results of the case where users have mul-

tiple associated regions. The testing GSKCG queries are randomly generated on these

two real-life datasets. The query set on Brightkite includes 100 queries, while the query

set on Gowalla includes 200 queries. Each query contains several query points that are

randomly selected from all users’ associated regions.

4.4.2 Setup

All the algorithms are implemented in Java programming language. The models for the

CPU and RAM are Intel Xeon X5650 Processor 2.67G Hz and 8GB DDR3 memory,

respectively. The number of query points and the value of k in a query both vary from 1

to 5. Unless explicitly specified, the default value of k and the default number of query

points in a query are both set to 3. The fanout of the Enhanced SaR-tree is 100. The

4Publicly available at: http://snap.stanford.edu/.

64

10

100

1000

10000

100000

1 2 3 4 5

Ru

nn

ing

Tim

e (

ms)

Varying # of k (Brightkite)

BaselineAdvancedSaRBased

100

1000

10000

100000

1e+006

1 2 3 4 5

Ru

nn

ing

Tim

e (

ms)

Varying # of k (Gowalla)


Figure 4.9: Running time vs. k value

storage cost and building time of the Enhanced SaR-tree on Brightkite and Gowalla are

(96.3 MB, 10.23mins) and (275.6 MB, 29.03mins), respectively.


We evaluate the performance of these three algorithms under different parameter settings.

As in many other performance evaluation schemes for query processing [41, 67, 68], we

report the overall performance in terms of the average query running time.

Effect of the value of k. In the first set of experiments, we evaluate the performance

of the algorithms under different k values. From Figure 4.9, we can observe that both

Advanced and SaRBased perform substantially better than Baseline. Note that the y-axis

is in log-scale. With the increase of k, SaRBased exhibits increasingly better performance

than Baseline and Advanced. This is because, the larger k is, the larger CBRs are, and

therefore the query points are more likely to be covered by larger CBRs, leading more

tree branches to be pruned. When k = 1, the difference of query time among these three

algorithms is small. The reason is that k = 1 indicates a loose social constraint and small

CBRs, resulting in weak pruning capabilities.

Effect of the number of query points. In Figure 4.10, we examine the query perfor-

mance by varying the number of query points. In general, the query time increases when

the number of query points increases because more query points require more candidate

users to be added into the search space. Compared to the other two algorithms, SaRBased

is relatively less sensitive to the increase of the number of query points. Even when the

number is 5, the performance of SaRBased is still reasonably good.

Effect of the coverage of query points. In this set of experiments, we evaluate the

65

10

100

1000

10000

100000

1 2 3 4 5R

un

nin

g T

ime

(m

s)

Varying # of Query Points (Brightkite)


100

1000

10000

100000

1e+006

1e+007

1 2 3 4 5

Ru

nn

ing

Tim

e (

ms)

Varying # of Query Points (Gowalla)


Figure 4.10: Running time vs. number of query points

0

100

200

300

400

0.1km2

1km2

10km2

100km2

Ru

nn

ing

Tim

e (

ms)

Varying Query Points Coverage (Brightkite)


100

1000

10000

100000

0.1km2

1km2

10km2

100km2

Ru

nn

ing

Tim

e (

ms)

Varying Query Points Coverage (Gowalla)


Figure 4.11: Running time vs. query point coverage

performance by varying the coverage of query points. From Figure 4.11, we find that

SaRBased still performs best under different coverages. When the coverage is small,

the query points are more likely to fully fall into some CBRs of the enhanced SaR-tree,

leading to a smaller search space. This explains why the running time is shorter when the

coverage is smaller.

Effect of multiple associated regions. As a proof-of-concept, in Figure 4.12, we present

the performance of the algorithms when each user has on average 3 associated regions.

Although it takes longer to process the queries, the general trends of Figure 4.12 are

similar to those of a single associated region presented in Figures 4.9 and 4.10. When a

user is associated with more associated regions, the running time increases because the

number of users covering the query points increases, which implies a larger search space.

Due to space limitations, we only show the performance on Brightkite. We observe similar

results on Gowalla.

Pruning capabilities of different strategies. In Figure 4.13, we show the pruning capa-

bilities of different pruning strategies, where BP, DBP, AO and SAR stand for basic prun-

ing, diameter based pruning, access order based pruning and enhanced SaR-tree based

pruning, respectively. We can see that all strategies can help reduce the running time.

66

100

1000

10000

100000

1 2 3 4 5

Runnin

g T

ime (

ms)



100

1000

10000

100000

1 2 3 4 5

Ru

nn

ing

Tim

e (

ms)



Figure 4.12: Running time under multiple familiar regions

10

100

1000

10000

100000

1 2 3 4 5

runnin

g tim

e (

ms)


BPBP+DBPBP+DBP+AOBP+DBP+AO+SAR

10

100

1000

10000

100000

1 2 3 4 5

Runnin

g T

ime (

ms)


BPBP+DBPBP+DBP+AOBP+DBP+AO+SAR

Figure 4.13: Pruning capabilities of different schemes

In particular, AO is most effective when the value of k or the number of query points is

relatively large.

Sizes of returned user groups. Figure 4.14 shows the average group size of the query

results. We can see that when k or the number of query points is increasing, the group

size is also increasing. The reason is that when k is increasing, the lower bound of the

group size to form a k-core group is increasing. Meanwhile, if the number of the query

points is increasing, the group may need more users to cover the query points.

Effect of the size of LSBNs. Next, we show the scalability of the algorithms under

various network sizes in Figure 4.15. We randomly extract several subsets of users with

increasing sizes to test the algorithms’ scalability. As expected, the result demonstrates

that SaRbased achieves the best efficiency in all cases. Even over relatively large networks

(e.g., more than 180k users), it still responds quickly, demonstrating the potential for

practical use.

Quality of query results. Finally, we compare the quality of results returned by GSKCG

and the existing spatial task outsourcing in terms of the average group size and the average

social cohesiveness (e.g., the average number of the familiar persons of each member in a

group) of the returned group. Specifically, we consider the typical spatial task outsourcing

67

2

4

6

8

10

1 2 3 4 5# o

f gro

up s

ize

Varying # of k

BrightkiteGowalla

2

4

6

8

10

1 2 3 4 5

# o

f gro

up s

ize

Varying # of query points

BrightkiteGowalla

Figure 4.14: Size of query results

0

100

200

300

400

500

600

10k 20k 30k 40k 50k

Ru

nn

ing

Tim

e (

ms)

Varying Network Size (Brightkite)


10

100

1000

10000

100000

30k 60k 90k 120k 150k 180k

Ru

nn

ing

Tim

e (

ms)

Varying Network Size (Gowalla)


Figure 4.15: Running time vs. network size

(STO) problem that finds a minimum group to collaboratively cover a given number of

spatial point related task. We set the parameters of the GSKCG query Q = (k, P ) to be

k = 2 and |P | = 5 and the parameters of STO query Q = (P ) (i.e., not considering the

social constraint) to be |P | = 5. From the experimental results shown in Figure 4.16,

we have the following interesting observation: the average size of the group returned

by GSKCG queries is close to the size of the group returned by STO queries, whereas

the average social cohesiveness of the group returned by GSKCG queries is much larger

than that of the group returned by STO queries. Thus, we conclude that, compared with

STO queries, the GSKCG query can find groups with much better social cohesiveness at

the cost of a small increase in the group size. This is very meaningful for the real-life

applications of collaborative spatial computing.

68

0

2

4

6

GSKCG STO

Qua

lity

of r

esul

t

Query types (Brightkite)

group sizesocial cohesiveness

0

2

4

6

GSKCG STO

Qua

lity

of r

esul

t

Query types (Gowalla)

group sizesocial cohesiveness

Figure 4.16: Quality comparison of the returned groups

4.5 Summary

In this chapter, we have introduced a new practical type of GSKCG queries that considers

both users’ associated spatial regions and their social acquaintance levels. A GSKCG

query aims to find a minimum user group that covers all query points and that is a k-

core. We have proposed an efficient algorithm SaRBasedKCGFinder to find the optimal

solution, whose success lies in a set of effective pruning strategies and a novel index

structure. Extensive experiments on two real-life datasets demonstrate the efficiency and

effectiveness of our solution.

69

Chapter 5

Towards Social-aware Ridesharing

Group Query Services

As described in Chapter 1, ridesharing is a promising approach for saving energy con-

sumption and assuaging traffic congestion while satisfying people’s needs in commute.

However, the main problem in the current ridesharing systems is the trust issue which

makes the acceptance level of ridesharing low. To tackle this problem, in this chapter, we

propose a new kind of ridesharing queries, namely social-aware ridesharing group (SaRG)

queries, which is based on trip matching and social acquaintance. The rest of this chap-

ter is organized as follows. The SaRG query problem is formulated in Section 5.1. We

propose the baseline algorithm and a set of pruning strategies for SaRG query process-

ing in Section 5.2. In Section 5.3, we present several incremental approaches to reduce

a large number of repeated computations. The SIR-tree index structure is presented in

Section 5.4. Experimental results are reported in Section 5.5, followed by the summary

of this chapter in Section 5.6.

5.1 Problem Formulation

In this section, we present some preliminaries and provide the problem statement, fol-

lowed by an example to illustrate the problem defined. Table 5.1 summarizes the notations

used throughout this chapter.

70

Table 5.1: Summary of notationsNotation DefinitionG = (V,E) a social network.D the rider space in which each rider has a ride request.G[V ′] the subgraph of G containing only V ′.u, ui, v, vi a user of G, u or ui represents a driver, v or vi represents a rider.tpv = (o, d) tpv indicates the rider v’s ridesharing trip where o and d represent the origin

and destination of v, respectively.tpu tpu is a driver u’s ridesharing trip.qu an SaRG query of the driver u.D(tpv , tpu) the travel cost of a rider vGk

s (u) a ridesharing group containing a driver u and a set of s riders.D(tpu, Gk

s (u)) the travel cost of a ridesharing group Gks (u).

SI an intermediate solution set where |SI| ≤ s.SU the set of remaining riders.Dlb(tpu, SI) the travel cost lower bound of any valid ridesharing group derived from SI .Lmv a size-m rider list in which the seen riders are sorted in ascending order by

their travel costs.CS the sorted list of seen riders.Dlb(v,CS) the lower bound of the travel cost on the unseen rider v in D − CS.Dlb(Gk

s (u), Lmv ) the lower bound of the travel cost on the unseen ridesharing groups Gk

s (u) inLmv .

NBSI(v) the set of v’s neighbors in SI .Dia(Gk

s (u)) the diameter of Gks (u).

Diaub(Gks (u)) the upper bound of Dia(Gk

s (u)).A(v) the access index of the last user of user v’s neighbor in Lm

v .kSI(v) the core number of user v in the subgraph G[SI].cmax(e) the maximum core number of the users rooted at the entry e of the SIR-tree.nb[e|x] the user set containing the users whose social distances to all users rooted at e

are ≤ x.

As motivated by the social-aware ridesharing framework in Chapter 1, we define an

SaRG query over a set of riders D and a social network G=(V,E). Each rider v∈D

has a ridesharing trip request denoted by tpv=(o, d) where o and d represent the origin

and destination of v’s trip, respectively. For the social network G, each vertex v∈V

is a user (either a driver or a rider) and each edge e∈E denotes an acquainted relation

between two users it connects. Each driver u’s ride offer forms an SaRG query qu that

will be introduced later. Once the RSP receives an SaRG query qu from a driver u, it will

return u with the most suitable riders from D by considering trip matching and social

acquaintance. Before giving the formal definition of an SaRG query, we explain how to

measure trip matching between the riders and driver, and social acquaintance among the

members in a ridesharing group.

An SaRG query aims to find a ridesharing group with a desired level of social acquain-

tance. To model such social acquaintance, we assume the existence of a social network

graph in which users are connected if they have acquaintance relationships (e.g., friends

or colleagues). Such a network might be derived from call graphs based on telephone call

71

Table 5.2: Survey results (216 participants)Social Model Acceptable for RidesharingStar(friend) 95.43%Star(friend of friend) 71.23%1-core 92.24%

detail records (CDRs) or online social networks such as Facebook and Twitter [16]. There

are a number of social models that can be employed to measure the social cohesiveness

of a ridesharing group, such as star (friend) (one central user has direct connections to all

other users), star (friend of friend) (one central user has direct or through-a-friend connec-

tions to all other users), and k-core (see Definition 5.1, each user has direct connections

with at least k other users).

Definition 5.1. (k-core) Given a graph G=(V,E), a k-core is a connected subgraph

SG=(SV, SE) (SV⊆V , SE⊆E) in which each vertex v∈SV has at least degree k.

To compare these social models, we have conducted an online survey with 216 vol-

unteers to evaluate their acceptance levels for ridesharing (see Table 5.2 for the survey

result).1 In addition to users’ acceptance level, the feasibility of forming ridesharing

groups in real-life applications (i.e., whether the service provider can find ridesharing

groups for drivers which satisfy the social model being used) is an equally important fac-

tor in selecting an appropriate social model. For this reason, we have also examined the

potential groups under different social models for the users of New York City in two real

datasets (Brightlike and Gollawa [17]). Figure 5.1 gives the numbers of potential size-5

ridesharing groups. It can be observed that while the star (friend) model achieves a good

acceptance level, it is too demanding to form a good number of social groups. Combining

these two aspects, in this chapter we take k-core as the primary social model to address

the social-aware ridesharing problem.

We next explain how to measure trip matching of a ridesharing group. The primary

cost of a rider in Slugging is the travel cost between the rider’s origin, destination and

the driver’s origin, destination. Therefore, we define the travel costs of a rider and a

ridesharing group as follows.

1http://www.sojump.com/.

72

0 20000 40000 60000 80000

Gollawa

Brightlike

Star (Friend)

Star (Friend of Friend)

1-Core

Figure 5.1: Numbers of potential social groups of size 5

Definition 5.2. (Travel cost of a rider) Given the trip tpu of a driver u’s ride offer, the

travel cost of a rider v is defined as:

D(tpv, tpu) = ||tpv.o, tpu.o||+ ||tpv.d, tpu.d||, (5.1.1)

where ||·, ·|| denotes the Euclidean distance between two spatial points.

A ridesharing group consists of a driver u and a size-s set of riders, denoted byGks(u),

where s is the number of available seats. Note that the size of a ridesharing group is s+1.

Definition 5.3. (Travel cost of a ridesharing group) Given a driver u’s trip tpu, the

travel cost of a ridesharing group Gks(u) is

D(tpu, Gks(u)) =

∑v∈Gk

s (u)

D(tpv, tpu). (5.1.2)

We call a ridesharing group Gks(u) a k-core group if the subgraph G[Gk

s(u)] of the

underlying social network G is a k-core. Now we are ready to define an SaRG query.

Definition 5.4. (SaRG query) Given a set of riders D and a social network G=(V ,E),

an SaRG query is defined as a quadruple qu=(u,k,s,tpu), where qu.u is the driver (query

issuer), qu.k and qu.s are positive integers indicating the social acquaintance constraint

in terms of k-core and the number of available seats for ridesharing respectively, and

qu.tpu is the driver u’s trip, which returns the ridesharing group Gks(u) with the minimum

travel cost among all k-core ridesharing groups with size s+1 in G. A ridesharing group

is valid with respect to an SaRG query qu if it is a k-core group and its size is s+1.

The returned ridesharing group should have the smallest travel cost because naturally

73

only riders whose origin and destination are close to those of the driver are willing to join

the driver’s ridesharing. Below we give an example to illustrate an answer to an SaRG

query.

o1d1

o2

d2

o3

d3

o

d

v2

v1v3

u

1 1

2 1.5

2.5 3

Social Level

Spatial Level

Figure 5.2: An example of an SaRG query

Example 5.1. Consider a social network G=(V ,E), a set of riders D={v1,v2,v3}, a

driver u, as shown in Figure 5.2. The travel cost of a rider vi (1≤i≤3) is listed in the

right table of Figure 5.2. The SaRG query qu=(u,k,s,tpu) with k=2, s=2, and tpu=(o,d)

returns Gks(u)={u,v1,v3} because {u,v1,v3} is the group with the minimum travel cost

D(tpu, {u, v1, v3})=(1+1)+(2.5+3)=7.5 among all size-3 2-core ridesharing groups (the

other group is {u,v2,v3}).

We establish the hardness of the SaRG query problem in the theorem below.

Theorem 5.1. The SaRG query problem is NP-hard.

Proof. We prove the hardness by a reduction from a classical NP-Complete problem,

namely p-clique problem [33]. An instance of the p-clique problem consists of a graph

G′=(V ′,E ′) where V ′ and E ′ are the vertex set and edge set of G′, respectively. The

decision problem is to find whether there exists a clique (i.e., a complete subgraph) of

size-p in G′.

Given an instance of p-clique, we construct an instance of SaRG query qu=(u,k,s,tpu)

on a set of users with G=G′, s=p-1, k=s, and make the travel cost between any two users

in G be 1. If G′ contains a p-clique, there must exist a group of size p in G such that each

member in this group has social connections with the other s members of this group, and

the group has a minimum travel cost of s. We thus prove the necessary condition. On the

74

other hand, if G of the SaRG problem contains a group of size p and k=s, G′ in the p-

clique problem must contain a clique with size p, too. This gives the sufficient condition.

Hence, the theorem is proved.

In this chapter, we tackle the problem of efficiently processing SaRG queries in prac-

tical settings.

5.2 Algorithm Design

In this section, we present an algorithm named RSGExplorer and a set of pruning strate-

gies to obtain the optimal answer to an SaRG query.

5.2.1 RSGExplorer Algorithm

The general idea of RSGExplorer is that, given an SaRG query qu=(u,k,s,tpu), we first

retrieve the top-m (m≥s) riders in D with the minimum travel cost, and then invoke the

branch and bound search to find the current optimal answer Gks(u) in these top-m riders.

If the travel cost ofGks(u) is less than the travel cost lower bound of the unseen ridesharing

groups, Gks(u) is returned as the final optimal answer. Otherwise, we continue to retrieve

the top-(m+1) rider and reinvoke the branch and bound search to find the next optimal

answer. The above process repeats until the final optimal answer is identified.

To retrieve the top-m riders with the minimum travel cost, we build two spatial RTree

indexes [28], rtreeo and rtreed, to index the origins and destinations of the riders in D,

respectively. By adopting typical kNN search in spatial databases [52] over the RTree in-

dex, we can easily visit the riders in increasing order of the distance between their origins

(destinations) and the driver’s origin (destination). Algorithm 12 shows the pseudo code

of RSGExplorer. In the beginning, we initialize a set of variables (Lines 1–7). The prior-

ity queues Qo and Qd are initialized as (rtreeo.root,0) and (rtreed.root,0), respectively.

The elements of Qo (Qd) are sorted in increasing order by the shortest distances between

their corresponding RTree entries and the driver’s origin (destination). m and Dlb(v,CS)

are respectively initialized to s and∞, which are used for finding the top-m riders with

75

Algorithm 12 RSGExplorer(Driver u, Integer k, Integer s, Trip tpu, SocialNetwork G,RTree rtreeo, RTree rtreed)

1: Initialize priority queues Qo and Qd with entries (rtreeo.root, 0) and (rtreed.root, 0), re-spectively;

2: Integer m← s;3: Initialize the sorted rider lists CS and Lm

v as ∅;4: Ridesharing group Gk

s(u)← φ;5: Double cost←∞6: Double Dlb(v, L

mv )←∞;

7: Initialize the rider sets SI and SU as ∅;8: while |CS| ≤ |D| do9: v′ ← GetNextRider(Qo,rtreeo); // tpv′ .o is closest to tpu.o

10: v′′ ← GetNextRider(Qd,rtreed); // tpv′′ .d is closest to tpu.d11: Insert rider v′ (v′′) into CS unless v′ ∈CS (v′′ ∈CS);12: if |CS| ≥ m then13: Dlb(v,CS)← ||tpv′ .o, tpu.o||+ ||tpv′′ .d, tpu.d||;14: Rider vm ← the m-th rider in the rider list CS;15: if D(tpvm , tpu) ≤ Dlb(v,CS) then16: Insert the first m riders of CS into Lm

v ;17: SI ← vm;18: SU ← Lm

v − vm;19: Gk

s(u)′ ← GetOptimalGroup(u,k,s,tpu,SI ,SU ,Gks(u),G);

20: if D(tpu, Gks(u)′) ≤ cost then

21: Gks(u)← Gk

s(u)′;22: cost← D(tpu, G

ks(u))

23: if cost ≤ Dlb(Gks(u), Lm

v ) then24: Return Gk

s(u);25: m=m+1;26: Return ∅;

the minimum travel cost. The optimal ridesharing group to return, Gks(u), is initialized to

∅. Note that, the initial travel cost of an empty Gks(u), cost, is set to∞. Two sorted rider

lists CS and Lmv , in which the riders are sorted in ascending order by their travel costs, are

both initialized as ∅. In addition, two rider sets SI and SU are also set to ∅ for the branch

and bound search over Lmv in a later stage.

After the initialization stage, we use the function GetNextRider(·) (the typical kNN

search mentioned above) to find the next rider v′ (v′′) whose origin (destination) is closest

to the driver’s origin (destination). We compute the travel cost of v′ (v′′) and insert v′ (v′′)

into CS if v′ 6∈CS (v′′ 6∈CS) (Lines 9–11). Once the size of CS becomes≥ m, we calculate

the travel cost lower bound Dlb(v, Lmv ) of the unseen riders according to Theorem 5.2.

Theorem 5.2. Let v′ and v′′ be the riders newly found by GetNextRider(·) in rtreeo and

76

rtreed respectively, and CS be the seen sorted rider list. The travel cost lower bound of

the unseen rider v 6∈ CS is Dlb(v,CS)=||tpv′ .o, tpu.o||+||tpv′′ .d, tpu.d||.

Proof. Since v′ and v′′ are the riders newly found by GetNextRider(·) in rtreeo and

rtreed respectively, for any unseen rider v, we have ||tpv.o, tpu.o|| ≥ ||tpv′ .o, tpu.o|| and

||tpv.d, tpu.d|| ≥ ||tpv′ .d, tpu.d||. Thus, the travel cost of v

D(tpv, tpu) = ||tpv.o, tpu.o||+ ||tpv.d, tpu.d||

≥ ||tpv′ .o, tpu.o||+ ||tpv′′ .d, tpu.d||

= Dlb(v,CS)

Thus, this theorem is proved.

If the travel cost of the m-th rider vm in CS is ≤ Dlb(v,CS), the top-m riders in D

with the minimum travel cost are found (Lines 13–15). Afterwards, we insert the first m

riders in CS into Lmv (Line 16). We add vm into SI and Lm

v -vm into SU in order to make

sure that vm is in the newly found group, which can avoid duplicately enumerating the

ridesharing groups that appear in the previous iterations (Lines 17–18). The intuition is

that, all the ridesharing groups, which consist of s+1 users from Lmv -vm (i.e., Lm−1

v ), have

been checked in previous branch and bound search over the search space Lm−1v . Thus we

only need to check the remaining ridesharing group consisting of vm and the other s users

from Lmv -vm.

We then invoke Algorithm 13 to find the current optimal answer Gks(u)′ in Lm

v (Line

19). If the travel cost of Gks(u)′ is less than or equal to the travel cost lower bound of the

unseen ridesharing groups, Gks(u)′ is returned as the final optimal answer (Lines 20–24).

The correctness is guaranteed by Theorem 5.3.

Theorem 5.3. Let vi be the i-th rider in the current sorted rider list Lmv . A lower bound

of the travel cost of unseen ridesharing group Gks(u) in Lm

v is

Dlb(Gks(u), Lm

v ) = D(tpvm , tpu) +s−1∑j=1

D(tpvj , tpu). (5.2.3)

77

Proof. For any unseen ridesharing group Gks(u), there must exist a rider vt ∈ Gk

s(u) such

thatD(tpvt , tpu)>D(tpvm , tpu). As v1,v2,. . . ,vs−1 are the top-(s-1) riders in Lmv , we have∑

v∈Gks (u)−vt

D(tpv, tpu) >∑s−1

j=1D(tpvj , tpu). Therefore, we have

D(tpu, Gks(u)) =

∑v∈Gk

s (u)

D(tpv, tpu)

= D(tpvt , tpu) +∑

v∈Gks (u)−vt

D(tpv, tpu)

> D(tpvm , tpu) +s−1∑j=1

D(tpvj , tpu)

= Dlb(Gks(u), Lm

v )

This proves the theorem.

Algorithm 13 shows the pseudo code of GetOptimalGroup which attempts to find the

most suitable ridesharing group from Lmv . For a systematic enumeration of all candidate

ridesharing groups, we employ the branch and bound search algorithm. In the branch

and bound search process, we keep track of two rider sets SI and SU, which represent the

intermediate solution set and the set of remaining riders, respectively. This process can

be organized into a tree structure, as illustrated in Figure 5.3, in which an internal node

represents an SI and a leaf node represents a size-s rider set. Given an internal node SI, we

can derive a lower travel cost bound Dlb(tpu, SI) of any valid ridesharing group derived

from SI and SU, which is guaranteed by Theorem 5.4.

Theorem 5.4. Let v′ be the rider with the maximum travel cost in SI . The travel cost

lower bound of any valid ridesharing group derived from SI and SU is

Dlb(tpu, SI) = (s− |SI|) ∗D(tpv′ , tpu)

+∑v∈SI

D(tpv, tpu). (5.2.4)

Proof. Suppose Gks(u) is an intermediate answer derived from SI and SU . For any rider

v ∈ Gks(u)− SI , we have D(tpv, tpu) > D(tpv′ , tpu). Thus,

∑v∈Gk

s (u)−SID(tpv, tpu) >

78

v1

v1v2

v1v2v3

NULL

v1v3

v1v2v4

Expanding

Backtracking

v1v3v4

Figure 5.3: Branch and bound search tree

∑v∈Gk

s (u)−SID(tpv′ , tpu) = (s− |SI|) ∗D(tpv′ , tpu). We have:

D(tpu, Gks(u)) =

∑v∈Gk

s (u)

D(tpv, tpu)

=∑

v∈Gks (u)−SI

D(tpv, tpu) +∑v∈SI

D(tpv, tpu)

≥(s− |SI|) ∗D(tpv′ , tpu) +∑v∈SI

D(tpv, tpu)

=Dlb(tpu, SI)

Thus, this theorem is proved.

Based on this travel cost lower bound, any ridesharing group derived from SI and

SU with travel cost ≥ Dlb(tpu, SI) can be pruned from the search space (Line 2). We

iteratively add riders with the minimum travel cost from SU to SI and check whether

the resultant group is valid (Lines 4–8). A property of k-core is that if a vertex is not

in the maximum k-core of G, it cannot be in any k-core subgroup of G. Thus, if the

maximum k-core computed from G[u ∪ SI ′ ∪ SU ] cannot cover all riders in u ∪ SI ′,

no valid ridesharing group can be derived from SI and SU (Lines 10–12). Otherwise, we

recursively call GetOptimalGroup with the current values of the input arguments to find

the optimal answer Gks(u)′. If D(tpu, G

ks(u)′) < D(tpu, G

ks(u)), we update the current

optimal answer Gks(u) with Gk

s(u)′ (Lines 14–17).

We establish the correctness of RSGExplorer below.

Theorem 5.5. RSGExplorer finds the correct answer to an SaRG query.

Proof. We prove it by contradiction. Assume that RSGExplorer returnsGks(u) as the final

79

Algorithm 13 GetOptimalGroup(Driver u, Integer k, Integer s, Trip tpu, RiderSet SI ,RiderSet SU , SaRG Gk

s(u), SocialNetwork G)1: while |SI|+|SU |≥s do2: if D(tpu, G

ks(u)) ≤ Dlb(tpu, SI) then

3: Break;4: Select the rider v with the minimum travel cost from SU ;5: SI ′←SI∪{v}, SU←SU -{v};6: if |u∪SI ′|=s+1 then7: if G[u∪SI ′] is a k-core then8: Return u∪SI ′;9: else

10: Compute the maximum k-core group S from G[u ∪ SI ′ ∪ SU ];11: if u ∪ SI ′ 6⊆ S then12: Break;13: else14: SU ← S-SI ′-u;15: Gk

s(u)′ ← GetOptimalGroup(u,k,s,tpu,SI ′,SU ,Gks(u),G);

16: if D(tpu, Gks(u)′) < D(tpu, G

ks(u)) then

17: Gks(u)← Gk

s(u)′;18: Return Gk

s(u);

optimal answer to qu = (u, k, s, tpu). Now suppose there exists Gks(u)′ with the minimum

travel cost such that D(tpu, Gks(u)′)<D(tpu, G

ks(u)), where Gk

s(u) and Gks(u)′ are found

fromLmv andLm′

v respectively. There are three possible cases: (1) Ifm<m′, which means

thatLmv ⊂Lm′

v , RSGExplorer first findsGks(u), and thenGk

s(u)′. By Theorem 5.3, we have

D(tpu, Gks(u)) ≤ Dlb(Gk

s(u), Lmv ) ≤ D(tpu, G

ks(u)′), which contradicts the assumption

thatD(tpu, Gks(u)′)≤D(tpu, G

ks(u)). (2) Ifm =m′, which means thatLm

v =Lm′v , we have

D(tpu, Gks(u)′) =D(tpu, G

ks(u)). Thus, RSGExplorer must returnGk

s(u) orGks(u)′ as the

final answer. (3) If m>m′, which means that Lm′v ⊂ Lm

v , RSGExplorer first finds Gks(u)′,

and then Gks(u). Again by Theorem 5.3, we have D(tpu, G

ks(u)′) ≤ Dlb(Gk

s(u), Lmv′) ≤

D(tpu, Gks(u)). Thus, RSGExplorer must return Gk

s(u)′ in Lm′v as the final answer, not

Gks(u) in Lm

v . Hence, the correctness of RSGExplorer is proved.

As proved in Theorem 5.5, RSGExplorer correctly finds the optimal answer. However,

the enumerating process is time consuming. Thus, we develop several pruning strategies

to prune the search space in order to accelerate the search speed.

80

v3 v4

uv7v5

v6v2

v1

(a) A social network

Rider

2

3.5

4

4

4.5

4.5

7

(b) The travel cost of rider vi

Figure 5.4: An example of SaRG query

5.2.2 Quota Available Strategy

By definition of k-core, we know that the degree of any vertex in the subgraph G[Gks(u)]

should be at least k. For the rider sets SI and SU, if the minimum vertex degree ofG[u∪SI]

is ≤ s − |SI|, the quota left in any valid ridesharing group containing u ∪ SI, adding any

rider in SU into SI cannot form a valid ridesharing group. This intuition is formalized in

Theorem 5.6.

Theorem 5.6. Let NBu∪SI(v) be the set of neighbors of v in u∪SI . Ifmin{|NBu∪SI(v)||v ∈

u ∪ SI} + s − |SI| < k, no valid ridesharing group can be derived from the current SI

and SU.

Proof. Let v′ be the user with the minimum number of neighbors in u∪SI . Since we can

add only s− |SI| users from SU to SI , the degree of v′ in any valid group with size s+1

is at most |NBSI(v′)| + (s + 1) − |u ∪ SI| = |NBSI(v

′)| + s − |SI| (when all users in

SU are neighbors of v′). By Definition 5.1, to form a valid group, the degree of v′ in the

group should be ≥ k. This establishes the theorem.

Example 5.2. Consider an example of an SaRG query qu=(u,k,s,tpu) with k=3, s=3,

in Figure 5.4. The social relations of the users are shown in Figure 5.4(a). Suppose

the current SI = {v1, v2} and SU = {v3, v4, v5, v6, v7}, we can calculate that u has the

minimum number of neighbors among u ∪ SI = {u, v1, v2} leading to |NBu∪SI(u)| = 0.

Since |NBu∪SI(u)| + s − |SI| = 0 + 3 − 2 = 1 < k, according to Theorem 5.6, we

conclude that no valid ridesharing group can be derived from the current SI and SU.

81

Table 5.3: Access indexes of users in Figure 5.4A(u) A(v1) A(v2) A(v3)

6 3 1 5A(v4) A(v5) A(v6) A(v7)

7 6 5 4

In Theorem 5.6, we only consider the quota constraint between the group size s+1

and the social constraint k. In some cases, even if SI satisfies Theorem 5.6, riders from

SU still cannot be added into SI to form a valid group, for example, when the riders in SI

do not have neighbors in SU. Hence, we design an access index (see Definition 5.5) for

efficiently detecting such cases (see Theorem 5.7).

Definition 5.5. (Access index) Let idx(v) be the index of user v ∈ Lmv . The index of u is

set to idx(u) = 0. The access index of a user v ∈ u ∪ Lmv is A(v) = max{idx(v′)|v′ ∈

NBu∪Lmv

(v)}.

Continue with the example of SaRG query in Figure 5.4. Lmv = {v1, v2, v3, v4, v5, v6, v7}

is the sorted user list in which the users are sorted by their travel costs in ascending order.

The access indexes of the users in u ∪ Lmv are given in Table 5.3.

Based on the definition of access index, we have the following theorem.

Theorem 5.7. Let k(u ∪ SI) be the maximum core number of the subgraph G[u ∪ SI]. If

max{idx(v)|v ∈ SI} ≥ min{A(v)|v ∈ u ∪ SI} and k(u ∪ SI) < k, no valid ridesharing

group can be derived from the current SI and SU .

Proof. The condition k(u∪SI) < k means that a rider v′ ∈ SU should be added into SI

to increase the core number of SI ∪ u to form a k-core. Since max{idx(v)|v ∈ SI} ≥

min{A(v)|v ∈ u ∪ SI}, it means that a user in u ∪ SI does not have neighbor in SU ,

which results in no valid ridesharing group (i.e., k-core) formed from the current SI and

SU . This theorem is proved.

Theorem 5.7 tells that if u∪ SI is not a k-core group and there exists a user v ∈ u∪ SI

that has no neighbor in SU, no valid ridesharing group can be derived from current SI and

SU. Example 5.3 illustrates a case of Theorem 5.7.

82

Example 5.3. Consider a social network and a sorted rider listLmv ={v1,v2,v3,v4,v5,v6,v7}

with the access indexes shown in Table 5.3. Consider an SaRG query qu={u,k,s,tpu} with

k=3 and s=3, the current SI={v1,v2,v3} and SU={v4,v5,v6,v7}. We havemax{idx(v)|v ∈

{v1, v2, v3}}= idx(v3) = 3 and k(u, v1, v2, v3) = 0. From Table 5.3, we knowmin{A(v)|v ∈

{u, v1, v2, v3}} = A(v2) = 1. Based on Theorem 5.7, because idx(v3) ≥ A(v2) and

k(u, v1, v2, v3) < k, thus we cannot derive a valid ridesharing group from the current SI

and SU .

5.2.3 Group Diameter Strategy

In this section, we propose another pruning technology based on the concept of k-core

group diameter. For a given size-(s+1) k-core group, we first present the definition of the

group diameter, and then derive the diameter upper bound of a size-(s+1) k-core group.

Based on the diameter upper bound, we introduce our diameter based pruning method.

Definition 5.6. (Diameter) The diameter of a ridesharing group Gks(u) in G is defined as

the longest social distance (i.e., the longest shortest path length) between any two users

in G[Gks(u)], denoted by Dia(Gk

s(u)).

Let Diaub(Gks(u)) denote the upper bound of Dia(Gk

s(u)). In this chapter, we make

use of the k-core group diameter upper bound proposed in [55].

Theorem 5.8. For a valid ridesharing group Gks(u),

Diaub(Gks(u)) =

1 if s = k

2 if k < s < 2k + 1

3[ s+1k+1

] + r(s+ 1, k)− 3 if s ≥ 2k + 1

(5.2.5)

where r(s+ 1, k) =

0 if mod(s+ 1, k + 1) = 0

1 if mod(s+ 1, k + 1) = 1

2 if mod(s+ 1, k + 1) = 2

This diameter upper bound of Gks(u) indicates a way to measure whether two users

83

can co-exist in Gks(u). Next we present Lemma 5.1 to prune the search space by using the

diameter upper bound of Gks(u).

Lemma 5.1. Consider an SaRG query qu = (u, k, s, tpu). For each v ∈ SU , if the social

distance between v and any user v′ ∈ u ∪ SI is larger than Diaub(Gks(u)), v cannot be

added into SI to form a valid ridesharing group.

Example 5.4. Consider an SaRG query qu=(u,k,s,tpu) with k=2 and s=2, the users’

social relations, and the travel costs shown in Figure 5.4. According to Theorem 5.8,

we can get the diameter upper bound of the query qu is 1. Assume that the current SI

and SU are ∅ and {v1,v2,v3,v4,v5,v6,v7} respectively. As the driver u must be a member

of valid ridesharing group, thus the riders in SI whose social distances to u are larger

than 1 should be filtered out from SU . We then get SU={v4,v6}. Obviously, the diameter

based pruning method can substantially shrink the search space. We can quickly derive

the optimal ridesharing group Gks(u)={u,v4,v6}.

5.2.4 k-plex Based Strategy

In this section, we present a novel pruning technique based on the concept of k-plex [6].

The advantage of k-plex lies in its property that, if G′ is a k-plex, any subgraph of G′

is also a k-plex. In contrast, k-core does not share such a property. Fortunately, we can

easily transfer a k-plex to a k-core in order to enjoy this property. To find an SaRGGks(u),

we convert it to finding a k-plex of size s+1 where k=(s+1-k). Thus, if we can identify

a maximum k-plex in G′ whose size is ≥ s + 1, there must exist a k-core with size s+1.

Otherwise, G′ does not contain a k-core with size s+1 (i.e., no valid ridesharing group

can be found from G′).

Definition 5.7. (k-plex) Given a graphG=(V,E), a k-plex is a subgraph SG = (SV, SE)

(SV⊆V , SE⊆E) in which each vertex v∈SV has at least degree |SG|-k.

To estimate the maximum size of a k-plex, we adopt the approach presented in [47]

and calculate the size upper bound Bp(G) of a maximum k-plex in a graph G as follows

Bp(G) = mini=1,··· ,p{1

iB(Ci

1, · · · , Cim)}, (5.2.6)

84

and

B(Ci1, · · · , Ci

m) =

mi∑j=1

min{2k − 2 + k mod 2, k + ai,j,

∆(G[Cij]) + k, |Ci

j|} (5.2.7)

where k=s+1-k, Ci1, · · · , Ci

m are co-k-plexes [18] in which each vertex of V appears

exactly i times, ai,j = max{m : |{v ∈ V ∧degG(v) ≥ m}| ≥ k+m} for each Cij , p is an

integer parameter to limit the iterations of computing, and degG(v) represents the vertex

v’s degree in G.

Lemma 5.2. Given an SaRG query qu = (u, k, s, tpu), if the size upper bound of k-plex

Bp(u∪SI∪SU) (k=(s+1)-k) is less than s+1, no valid ridesharing group can be derived

from the current SI and SU.

The size upper bound of the maximum k-plex is effective in pruning the search space.

Example 5.5 illustrates a case of Lemma 5.2.

Example 5.5. Consider the users shown in Figure 5.4(a). Assume that the current SI={v1,

v2, v3} and SU={v4,v5,v6,v7}. Given an SaRG query qu={u,k,s,tpu} with k=4 and s=4,

we can calculate the size upper bound of a ((4+1)-4)-plex, which is 4. Since the requested

group size is 4+1=5 > 4, no valid ridesharing group can be derived from the current SI

and SU.

5.2.5 Query Processing

In Algorithm 14, we integrate the three aforementioned types of pruning strategies into

GetOptimalGroup. We call this integrated algorithm GetOptimalGroupStar. The differ-

ences between GetOptimalGroupStar and GetOptimalGroup are in Lines 4–14 of Algo-

rithm 14. If there is not enough quota in SI to form a k-core, the search process on

the current SI and SU will stop and backtrack to the last stage of SI (Lines 4–6). Oth-

erwise, we will continue to verify if max{idx(v)|v ∈ SI} ≥ min{A(v)|v ∈ u ∪ SI}

and k(u ∪ SI) < k. If yes, the search process on the current SI and SU will be stopped

85

Algorithm 14 GetOptimalGroupStar(Driver u, Integer k, Integer s, Trip tpu, RiderSetSI , RiderSet SU , SaRG Gk

s(u), SocialNetwork G)1: while |SI|+ |SU | ≥ s do2: if D(tpu, G

ks(u)) ≥ Dlb(tpu, SI) then

3: Break;4: Select the user v′ with minimum |NBSI∪u(v′)| from SI ∪ u;5: if min{|NBu∪SI(v)||v ∈ u ∪ SI}+ s− |SI| < k then6: Break;7: if max{idx(v)|v ∈ SI} ≥ min{A(v)|v ∈ u ∪ SI} and k(u ∪ SI) < k then8: Break;9: for each rider v ∈ SU do

10: if all social distances from v to SI ≤ Diaub(Gks(u)) then

11: Add v into SU ′;12: Compute the group size upper bound Bp(G[u∪SI ∪SU)] of (s+1-k)-plex in G[u∪SI ∪

SU ];13: if Bp(G[u ∪ SI ∪ SU ]) < s+ 1 then14: Break;15: Select the rider v with minimum travel cost from SU ′;16: SI ′ ← SI ∪ {v}, SU ′ ← SU ′ − {v};17: if |u ∪ SI ′| = s+ 1 then18: if G[u ∪ SI ′] is a k-core then19: Return u ∪ SI ′;20: else21: Compute the maximum k-core S of the subgraph G[u ∪ SI ′ ∪ SU ];22: if u ∪ SI ′ 6⊆ S then23: Break;24: else25: SU ← S − SI ′ − u26: Gk

s(u)′ ← GetOptimalGroupStar(u,k,s,tpu,SI ′,SU ′,Gks(u),G);

27: if D(tpu, Gks(u)) > D(tpu, G

ks(u)′) then

28: Gks(u)← Gk

s(u)′;29: Return Gk

s(u);

(Lines 7–8). The correctness is guaranteed by Theorem 5.6 and Theorem 5.7. Given the

current SI ∪ u, the users in SU, whose social distances to all the users in SI are less than

the group diameter upper bound Dub(Gks(u)) justified will be filtered out from SU. This

pruning strategy (justified by Lemma 5.1) can substantially shrink the search space and

reduce the time cost (Lines 9–11). Afterwards, according to Lemma 5.2, we calculate

the group size upper bound Bp(u∪SI∪SU ) of an ((s+1)-k)-plex in G[u∪SI∪SU ]. If

Bp(u∪SI∪SU )<s+1, no valid Gks(u) is found from the current SI and SU (Lines 12–14).

86

5.3 Incremental Strategies

For the GetOptimalGroupStar algorithm, there is still room to further improve its perfor-

mance. First, in each iteration, when a rider is added into or removed from SI and SU, the

core decomposition algorithm is invoked to recompute ku∪SI′∪SU(v) where v∈u∪SI ′∪SU

(Line 21, Algorithm 14), which is a time-consuming operation. In fact, as illustrated lat-

er there is no need to recompute all the ku∪SI′∪SU(v). Second, the conditions that some

riders cannot co-exist in a valid ridesharing group, which were checked in the previous

iterations, may still hold in the subsequent iterations. By properly reusing the previous

useful information, it is possible to avoid many repeated computations. Thus, in this sec-

tion, we design several incremental strategies (i.e., incremental computation of core num-

bers, social diameter-based bounding and neighbor-based bounding) to further reduce the

running time.

5.3.1 Incremental Computation of Core Numbers

In GetOptimalGroupStar, each time a rider is added into or removed from SI and SU,

the core decomposition algorithm is invoked over the current SI and SU. It means that we

need to recompute the core numbers of all riders in the current SI and SU. Such operations

are conducted frequently during the search process, which increases the running time.

Example 5.6 illustrates such a case.

Example 5.6. Consider the users in Figure 5.4(a). Let G[V ] be a subgraph of G, which

has been decomposed in the previous iteration, where V ={u,v1,v2,v3,v4,v5,v6}. The core

numbers of these vertices are: kV (u)=2, kV (v1)=1, kV (v2)=1, kV (v3)=1, kV (v4)=2,

kV (v5)=2, kV (v6)=2. Here, kV (v) denotes the core number of v in G[V ]. When user

v7 is added into V , the core numbers of u, v1,v2,v3,v4,v5,v6 do not change. Only the newly

added user v7’s core number needs to be computed, which is 1.

Here, we make use of Theorem 5.9 from [53] to shrink the vertex space by indicating

which vertices’ core numbers may change. We also adopt the Traversal Algorithm in [53]

for the incremental core decomposition when a user is added into or removed from the

87

current search space.

Theorem 5.9. Given a graph G=(V,E), if an edge (u, v) is inserted (removed) and

kV (u)≤kV (v), then only the vertex w∈V , which has kV (w)=kV (u) and is reachable from

u via a path consisting of the vertices with core number equal to kV (u), may have its core

number incremented (decremented).

Example 5.7. Continue with Example 5.6. Before the user v7 is added, the core num-

bers of the vertices in V ={u,v1,v2,v3,v4,v5,v6,v7} are: kV (u)=2, kV (v1)=1, kV (v2)=1,

kV (v3)=1, kV (v4)=2, kV (v5)=2, kV (v6)=2, kV (v7)=0. After v7 is added, since we have

an edge (v4, v7) inserted and kV (v7)=0 ≤ kV (v4), according to Theorem 5.9, only the

vertices whose core number is 0 and which are reachable from v7 via a path consisting

of vertices with core number 0 may have their core number changed. Since no user in V

satisfies this condition, only kV (v7) needs to increase by 1, leading to kV (v7)=1. The time

cost is reduced by avoiding recomputing kV (vi) (1≤i≤6).

5.3.2 Social Diameter-based Bounding

The diameter-based pruning technique presented in Section 5.2.3 indicates that some users

cannot co-exist in a ridesharing group due to the group diameter upper bound. Suppose

users v′ and v′′ cannot co-exist in any Gks(u) found from the current search space S. If

several users are added into S to form a new search space S ′, any user group from S ′

containing v′ and v′′ still cannot satisfy the group diameter upper bound. Therefore, if

such combinations of users calculated in the previous iterations can be cached, we can

prune out all the user groups containing such combinations directly. When the cache size

is fixed, the SI with a smaller size has a higher priority to be cached. This is because the

set of a smaller size usually appears in a higher level of the branch and bound search tree

(see Figure 5.3), allowing to prune more combinations which cannot co-exist in a valid

ridesharing group.

Example 5.8. Continue with the example in Figure 5.4(a). Given an SaRG query qu

= (u, k, s, tpu) with k=2 and s=2, we can calculate the diameter upper bound of a

88

valid ridesharing group Gks(u), 1, based on Lemma 5.1. Thus, any user set whose social

diameter is more than 1 cannot be the final result. Before adding v7 into the search space,

we have already known v2,v3 cannot co-exist in a valid group. Hence, we cache this

combination v2 and v3. In the next iteration, when v7 is added into the search space, any

user group containing v2,v3 can be directly removed from the solution space.

5.3.3 Neighbor-based Bounding

Consider an intermediate rider set SI and a remaining rider set SU for an SaRG query

qu=(u,k,s,tpu) with k=2 and s=3. Assume the current G[u∪SI] is a size-3 2-core. To

form a valid ridesharing group of size-(3+1), one more rider v∈SU should be selected

and added into SI. According to the previous strategy, v should be the rider in D with the

minimum travel cost. However, if v is not a neighbor of any member in u∪SI , adding v

into SI may not help form a valid 2-core ridesharing group. If the next added riders are

most similar to v, the time cost would increase.

To reduce such non-beneficial adding operations, one possible way is to quickly find

a travel cost lower bound of the optimal ridesharing group derived from current u ∪ SI

and its members’ social neighbors in SU to prune these non-beneficial operations. Here,

we design a greedy algorithm GreedyRSGSearch (Algorithm 15) to find such a travel cost

lower bound. The general idea is to greedily retrieve an ((s+1)-k)-plex containing u∪ SI

with size s+1. Then, the valid ridesharing group, which is composed of the users of the

found ((s+1)-k)-plex, is updated as the current optimal answer to prune the search space

in future iterations when its travel cost is the current lowest.

Algorithm 15 shows the pseudo code of GreedyRSGSearch. We first initialize rider

v as the rider in SU with the maximum travel cost, and the neighbor set NBs of users in

SI as ∅ (Lines 1–2). Thereafter, we continue to add the rider v′ into SI until the size of

u∪SI ≥ (s+1). To select the rider v′, we first add all the neighbors of the users in u∪SI

that belong to SU into NBs (Line 4), and then select the rider v′ ∈NBs with the minimum

travel cost that makesG[v′∪u∪SI] an ((s+1)-k)-plex (Lines 5–8). If such a rider v′ exists,

we replace v by v′ and add it into SI. We repeat the same process until the size of u∪ SI is

89

Algorithm 15 GreedyRSGSearch(Driver u, Integer k, Integer s, Trip qu, RiderSet SI ,RiderSet SU , SocialNetwork G)

1: Rider v ← the rider in SU with maximum travel cost ;2: RiderSet NBs← ∅;3: while |u ∪ SI| < s+ 1 do4: NBs← the neighbors of the users in u ∪ SI that belong to SU;5: for each user v′ ∈NBs do6: if G[v′∪u∪SI] is a ((s+1)-k)-plex then7: if D(tpv′ , tpu)≤D(tpv, tpu) then8: v←v′;9: if v is not the previous rider then

10: Add v into SI;11: else12: Return ∅13: Return SI;

≥ s+1. Otherwise, we return the empty set. If we find an ((s+1)-k)-plex with size-(s+1),

the current ridesharing group u∪SI is returned, and the travel cost of the group u∪SI

provides the travel cost lower bound to prune the search space in the subsequent itera-

tions. During the search process of GetOptimalGroupStar, GreedyRSGSearch is invoked

to find a valid ridesharing group with a tight travel cost bound. If the returned group is

empty, which means that no tight travel cost bound can be found, GetOptimalGroupStar

continues using the previously found travel cost lower bound to prune the search space.

Note that GreedyRSGSearch needs to be invoked only if the current G[u ∪ SI] is an

((s+1)-k)-plex. The reason is that u∪SI has more chance to form a valid ridesharing

group when it is a ((s+1)-k)-plex. Otherwise, the time cost would increase much.

Example 5.9. Consider the users’ social relations and their travel costs in Figure 5.4. As-

sume the current SI=∅ and SU= {v1,v2,v3,v4,v5,v6,v7}. Given an SaRG query qu=(u,k,s,tpu)

with k=2 and s=2, we call GreedyRSGSearch to search a travel cost lower bound from

u ∪ SI={u} and G in Figure 5.4. We can get an intermediate answer Gks(u)={u, v4, v6}

and calculate the group travel cost lower boundD(tpu,{u,v4,v6}) =D(tpv4 ,tpu) +D(tpv6 ,tpu)

= 4 + 4.5 = 8.5. Therefore, there is no need to attempt the groups whose travel cost ≥

8.5. For example, when u ∪ SI = {u, v3}, there is no need to move v5, v6, v7 from SU to

SI; when u ∪ SI = {u, v4}, there is no need to move v5, v6, v7 from SU to SI.

90

5.4 Hybrid Index

Index is a commonly used technique to optimize query performance. Recently, several

approaches have been developed for geo-social group queries by considering the users’

Euclidean distances and their social relations, e.g., SR-tree [68] and SaR-tree [39]. How-

ever, these indexes are not directly applicable to our problem. In this section, we first

propose an R-tree based index, namely Social-Info R-tree, which incorporates the social

information, and then integrate the proposed index into RSGExplorer.

5.4.1 SIR-tree

For the reason of efficiency, we propose a hybrid indexing structure, the Social-Info R-

tree (SIR-tree), to support the simultaneous computation of the spatial distance and the

social constraint. It is a tree-based structure which is able to prune the search space using

the maximum core bound and the ridesharing group diameter bound. Each internal tree

node e stores the following social information: (i) the maximum core number cmax(e) of

the child nodes rooted at this node; (ii) the set nb[e|x] containing the users whose social

distances to all users rooted at e are less than or equals to x. Since nb[e|x] =⋃

e′∈e nb[e′|x]

where e′ is the child node of e, the SIR-tree can be built in a bottom-up fashion. Figure 5.5

shows an example of SIR-tree. In RSGExplorer, an SIR-tree is used to find the next rider

with the minimum travel cost. Its advantages are at least two-fold:

• By cmax(e), it can prune the users who cannot appear in the final k-core result set

as early as possible;

• By nb[e|x], it can prune the users whose social distances to the query issuer are

larger than Diaub(Gks(u)) as early as possible.

Based on the SIR-tree proposed above, Theorem 5.10 is given below to assist in prun-

ing the search space during query processing.

Theorem 5.10. Consider an SaRG query qu=(u,s,k,tpu) and an internal node e in SIR-

tree. If cmax(e) < k or u 6∈ nb[e|Diaub(Gks(u))], then any user rooted at node e cannot

be a member of the final optimal answer.

91

R5 R6

R1 R2 R3 R4

v1 v2 v3 v5 v4 v6 v7 u

SocialInfo 4

SocialInfo 3

SocialInfo 5 SocialInfo 6

SocialInfo 7

R7

R5 R6

R1 R2 R3 R4

SocialInfo 1cmax(R1)=1

nb[R1|1]={v1,v2,v3}

SocialInfo 2cmax(R2)=2

nb[R2|1]={v1,v3,v4,v5,v6}

Figure 5.5: An example of SIR-tree

Proof. It can be easily derived from Definition 5.1 and Lemma 5.1. We omit it here

because of the space limitation.

Example 5.10. Consider an SaRG query qu=(u,k,s,tpu) with k=2 and s=2, the social

network shown in Figure 5.4(a), and the social information of the nodes R1 and R2 in

Figure 5.5. According to Theorem 5.8, we can calculate Diaub(Gks(u))=1. During the

search process, R1 can be pruned due to the fact cmax(R1) < k; R2 can be pruned

because u 6∈ nb[R2|Diaub(Gks(u))].

5.4.2 Query Processing

To process SaRG queries with an SIR-tree, we need to reconcile the method introduced

in Section 5.3 with a major modification of how to find the next rider with the lowest

travel cost. Since there is no social information recorded in the R-tree, the rider we get is

only spatially close to tpu. However, if a user’s core number is less than the query social

constraint k, or the social distance between a user and the query issuer is larger than the

valid group diameter upper bound, the user should be pruned from the search space as

early as possible. Otherwise, it will increase the computational cost in the subsequent

brand and bound search. With the help of the SIR-tree, we can prune the tree nodes in

which the users cannot appear in the final result in advance and shrink the search space

of the brand and bound search for the optimal answer. Thus, the SIR-tree based algorithm

achieves better query performance than the algorithms proposed in the previous sections.

92

Table 5.4: Dataset propertiesBrightkite(America) Brightkite(Europe) Gowalla(America) Gowalla(Europe)

Total # of users 12,363 4,385 18,983 26,912Total # of friend relations 115,506 12,271 115,506 157,006Total # of trips 12,363 4,385 18,983 26,912Diameter (social diamter) 10 11 13 14Maximum # of cores 34 25 36 39


In this section, we experimentally evaluate the performance of three algorithms: The first

one is the basic RSGExplorer with three pruning strategies (referred to as Baseline) pre-

sented in Section 5.2; the second one is Baseline with the incremental methods (referred

to as Incremental) presented in Section 5.3; the last one is Incremental based on SIR-tree

(referred to as SIRBased) presented in Section 5.4.

5.5.1 Experimental Settings

We make use of four datasets extracted from Brightkite and Gowalla [17]: Brightkite

(America), Brightkite (Europe), Gowalla (America), and Gowalla (Europe). The proper-

ties of the four datasets are summarized in Table 5.4.

Each query set on these four datasets includes 100 queries. Each query contains a

query issuer u randomly generated from the corresponding user space, a group size s

varying from 4 to 7, a social constraints k from 1 to 4, and a query trip tpu randomly

selected from the users’ trips. Unless explicitly specified, the default values of k and s in

a query are 3 and 5, respectively.

All the algorithms are implemented in Java programming language. The models of

the CPU and RAM are Intel Xeon X5650 Processor 2.67G Hz and 8GB DDR3 memory,

respectively. The fanouts of R-tree and SIR-tree are 100.


We evaluate the query processing performance of these three algorithms under differ-

ent parameter settings. Following many other query processing performance evaluation

methods, we report the overall query performance in terms of the average elapsed time.

93

10

100

1000

10000

100000

4 5 6 7R

unni

ng ti

me

(ms)

s (k=3)

BaselineIncrementalSIRBased

(a) Brightkite(America)

10

100

1000

10000

4 5 6 7

Run

ning

tim

e (m

s)

s (k=3)


(b) Brightkite(Europe)

10

100

1000

10000

100000

1e+006

4 5 6 7

Run

ning

tim

e (m

s)

s (k=3)


(c) Gowalla(America)

10

100

1000

10000

100000

4 5 6 7R

unni

ng ti

me

(ms)

s (k=3)


(d) Gowalla(Europe)

Figure 5.6: Running time vs. group size

Effect of s. In the first set of experiments, we evaluate the query performance under

different s values. From Figure 5.6, we can observe that both Incremental and SIRBased

perform better than Baseline. Note that the y-axis is in log-scale. Under different values of

s, SIRBased achieves the best performance. This conforms to our theoretical analysis: the

SIR-tree structure can efficiently prune many irrelevant users who cannot satisfy either the

social diameter or core number constraints as early as possible, leading to a much smaller

search space. Even when s is small, SIRBased algorithm performs the best because a

small group size leads to a small diameter which results in a good pruning ability of the

SIR-tree.

Effect of k. The parameter k is used by the query issuer to flexibly define the social con-

straint. In Figure 5.7, we examine the query performance by varying the social parameter

k. A larger k means that the returned group has a tighter cohesiveness. That is, each mem-

ber should be familiar with more other members. We can observe that a larger k results

in better performance because it implies a smaller social diameter, which in turn allows

to prune out more users from the search space. Compared to Baseline and Incremental,

SIRBased achieves consistently better query performance for different k values.

Effect of the number of riders. In this set of experiments, we show the performance of

94

10

100

1000

10000

100000

1 2 3 4

Run

ning

tim

e (m

s)

k (s=5)



10

100

1000

10000

100000

1 2 3 4

Run

ning

tim

e (m

s)

k (s=5)


(b) Brightkite(Europe)

10

100

1000

10000

100000

1e+006

1 2 3 4

Run

ning

tim

e (m

s)

k (s=5)


(c) Gowalla(America)

10

100

1000

10000

100000

1e+006

1 2 3 4

Run

ning

tim

e (m

s)

k (s=5)


(d) Gowalla(Europe)

Figure 5.7: Running time vs. k value

the algorithms under various numbers of riders (i.e., the size of the rider space) in Fig-

ure 5.8. We randomly extract several subsets of the rider space to evaluate the algorithms’

performance. As expected, the result demonstrates that SIRBased achieves the best query

efficiency in all cases. Compared to Incremental and SIRBased, Baseline is more sensitive

to the number of riders. Its query processing time increases rapidly with the increase of

the number of riders.

400

800

1200

1600

2000 4000 6000 8000 10000

Run

ning

tim

e (m

s)

# of riders (k=3, s=5)



0

2000

4000

6000

8000

4000 8000 12000 16000 20000

Run

ning

tim

e (m

s)

# of riders (k=3, s=5)


(b) Gowalla(Europe)

Figure 5.8: Running time vs. the number of riders

Pruning capabilities of different strategies. In Figure 5.9, we show the query perfor-

mance of different pruning strategies. Here, we report the different strategies used in

Incremental and SIRBased, where IC, DB, NB and SIR stand for incremental computation

95

0

3000

6000

9000

12000

1 2 3 4R

unni

ng ti

me

(ms)

k (s=5)

ICIC+DBIC+DB+NBIC+DB+NB+SIR


0

5000

10000

15000

20000

4 5 6 7

Run

ning

tim

e (m

s)

s (k=3)

ICIC+DB

IC+DB+NB IC+DB+NB+SIR

(b) Gowalla(Europe)

Figure 5.9: Pruning abilities of different schemes

0

20

40

60

1 2 3 4

Tra

vel c

ost

k (s=5)

Brightkite (America)Gowalla (Europe)

0

20

40

60

4 5 6 7

Tra

vel c

ost

s (k=3)

Brightkite (America)Gowalla (Europe)

Figure 5.10: Travel cost vs. k or s

of core number, social diameter-based bounding, neighbor-based bounding and SIR-tree

based pruning, respectively. In general, all the strategies help to reduce the running time.

In particular, NB and SIR are more effective than others when the k value is small or

when the s value is large. As explained in Section 5.3.3, neighbor-based bounding usual-

ly helps to find a relatively tight group travel cost lower bound as early as possible which

is beneficial for pruning the search space in future iterations.

Travel costs of returned groups. Finally, we demonstrate the average group travel costs

of the query results. We can see that when k or s increases, the travel cost also increases.

The reason is that for a larger k value, it is more difficult to form a group with tight social

relations while being close to the query issuer, thus the travel cost increases accordingly.

On the other side, when the group size increases, more users are included in the returned

group, making the average travel cost increase as per the definition of travel cost. An

interesting observation is that when the rider space is larger, the travel cost of the returned

group is smaller. This is because when the rider space is larger, there are more candidate

riders near the query issuer, giving more opportunities to form a ridesharing group with a

smaller travel cost.

96

5.6 Summary

In this chapter, we have introduced a newly practical type of SaRG queries that investigate

ridesharing problem with flexible social constraints. An SaRG query aims to find a group

of riders where each rider’s ridesharing route is close to the query issuer and each rider

in this group should be familiar with k other members. We proposed several efficient

algorithms to tackle the SaRG queries. An extensive empirical study on real datasets

demonstrates that the proposed algorithms achieve desirable query performance.

97

Chapter 6

Conclusions and Future Work

6.1 Conclusions

In this thesis, we have identified several real-life group queries given the new emergence

of geo-social data in location-based social networks. Our contributions made in this thesis

are summarized as follows:

• We firstly proposed a new type of SIG queries that finds a k-size maximum interest

group in location-based social networks and proving that the SIG query problem

is NP-complete. Two efficient algorithms IOAIR and DOAIR based on the IR-

tree have been developed for the processing of SIG queries. Extensive empirical

evaluation on real datasets validated the performance efficiency of the proposed

query processing algorithms.

• We secondly formulated another type of GSKCG queries, which is of practical use-

fulness in many real-life applications. We formally proved that this problem is NP-

complete. We have proposed the algorithm KCGFinder to answer GSKCG queries

and improved its performance by exploring a set of effective pruning techniques

from different perspectives. We designed a novel index structure, the Enhanced

Social-aware R-tree (SaR-tree) to provide extra pruning capabilities on top of the

pruning techniques developed for KCGFinder. We have also developed the algo-

rithm SaRBasedKCGFinder that integrates KCGFinder and the Enhanced SaR-tree

98

structure. Extensive experiments on real-life datasets demonstrated that our pro-

posed algorithm performs well under a wide range of parameter settings.

• We finally developed a new type of SaRG queries to accommodate the real-life need

of considering social comfort and trust in ridesharing. We proved that the SaRG

query is NP-hard. We have proposed an efficient algorithm named RSGExplorer

and a set of efficient pruning techniques to answer SaRG queries. We have also

devised several incremental strategies by reducing repeated computations to further

speed up query processing. We designed a novel index structure, Social-Info R-tree

(SIR-tree), to further prune the search space and then proposed the SIRBased algo-

rithm to integrate the RSGExplorer algorithm and the SIR-tree structure. Experi-

mental results showed that our proposed algorithms achieve desirable performance.

6.2 Future Work

With the research findings obtained above, we plan to further extend our studies so as

to enrich the group query processing techniques. Below we list some open questions for

potential future research:

• Firstly, we plan to extend spatial-aware interest group queries. We will extend it

to a top-k SIG query that finds the best k user groups in a single query. So far

we have not considered the social relationships among users. We will incorporate

social relationships as an important criterion in group formation and develop novel

query processing techniques.

• Secondly, we plan to work on the following extensions for geo-social k-cover group

queries. The social graph used in Chapter 4 is unweighed, we intend to extend our

algorithm to support a weighted social graph. In some cases, we may not need an

exact solution. How to design an efficient approximation algorithm with a tight

approximation bound is also our future work.

• Thirdly, we plan to further investigate social-aware ridesharing group queries. We

99

will attempt to design a general framework of social-aware ridesharing that accom-

modates various mainstream trip matching and social acquaintance options. We are

going to integrate our proposed techniques into a real ridesharing system to evaluate

the practical effectiveness of our proposed SaRG query solutions.

100

Bibliography

[1] N. Agatz, A. Erera, M. Savelsbergh, and W. Wang. Sustainable passenger trans-

portation: Dynamic ridesharing. Erasmus Research Instution of Management, 2009.

[2] N. Agatz, A. Erera, M. Savelsbergh, and X. Wang. Optimization for dynamic ride-

sharing: A review. European Journal of Operational Research, 223(2):295–303,

2012.

[3] N. Armenatzoglou, S. Papadopoulos, and D. Papadias. A general framework for

geo-social query processing. Proc. Int’l Conf. Very Large Data Bases (PVLDB ’13),

6(10):913–924, 2013.

[4] A. Attanasio, J. F. Cordeau, G. Ghiani, and G. Laporte. Parallel tabu search heuristic-

s for the dynamic multi-vehicle dial-a-ride problem. Parallel Computing, 30(3):377–

387, 2004.

[5] E. Badger. Slugging–the people transit miller-mccune. 2011.

[6] B. Balasundaram, S. Butenko, and I. V. Hicks. Clique relaxations in social net-

work analysis: The maximum k-plex problem. Operations Research, 59(1):133–

142, 2011.

[7] R. Baldacci, V. Maniezzo, and A. Mingozzi. An exact method for the car pooling

problem based on lagrangean column generation. Journal of Operations Research,

52(3):422–439, 2004.

[8] V. Batagelj and M. Zaversnik. An o(m) algorithm for cores decomposition of net-

works. CoRR, 2003.

101

[9] R. W. Calvo, F. de Luigi, P. Haastrup, and V. Maniezzo. A distributed geographic in-

formation system for the daily carpooling problem. Computer Operation Research,

31(13):2263–2278, 2004.

[10] X. Cao, G. Cong, C. S. Jensen, and B. C. Ooi. Collective spatial keyword querying.

In Proc. ACM Int’l Conf. Management of Data (SIGMOD ’11), pages 373–384,

2011.

[11] L. Chen, G. Cong, C. S. Jensen, and D. Wu. Spatial keyword query processing: An

experimental evaluation. In Proc. Int’l Conf. Very Large Data Bases (PVLDB ’13),

pages 217–228, 2013.

[12] S.-J. Chen and L. Lin. Modeling team member characteristics for the formation of a

multifunctional team in concurrent engineering. IEEE Transactions on Engineering

Management, 51(2):111–124, 2004.

[13] T. Chen, M. A. Kaafar, and R. Boreli. The where and when of finding new friends:

Analysis of a location-based social discovery network. In Proc. Int’l Conf. Web and

Social Media (ICWSM ’13), pages 329–336, 2013.

[14] J. Cheng, Y. Ke, S. Chu, and C. Cheng. Efficient processing of distance queries in

large graphs: A vertex cover approach. In Proc. ACM SIGMOD Int’l Management

of Data (SIGMOD ’12), pages 457–468, 2012.

[15] J. Cheng, Y. Ke, S. Chu, and M. T. Ozsu. Efficient core decomposition in massive

networks. In Proc. IEEE Int’l Conf. Data Engineering (ICDE ’11), pages 51–62,

2011.

[16] B. Cici, A. Markopoulou, E. Frias-Martinez, and N. Laoutaris. Assessing the po-

tential of ride-sharing using mobile and social data: A tale of four cities. In Proc.

ACM Int’l Conf. Pervasive and Ubiquitous Computing (UbiComp ’14), pages 34–43,

2014.

[17] S. L. N. D. Collection. Online available at: http://snap.stanford.edu/.

102

[18] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval of the top-k most relevant

spatial web objects. In Proc. Int’l Conf. Very Large Data Bases (PVLDB ’09), pages

337–348, 2009.

[19] G. Cong, H. Lu, B. C. Ooi, D. Zhang, and M. Zhang. Efficient spatial keyword

search in trajectory databases. CoRR, 2012.

[20] J. Cordeau. A branch-and-cut algorithm for the dial-a-ride problem. Journal of

Operations Research, 54(1):573–586, 2003.

[21] J. F. Cordeau and G. Laporte. The dial-a-ride problem: Models and algorithms.

Annals of Operations Research, 153(1):29–46, 2007.

[22] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. Closest pair

queries in spatial databases. SIGMOD Record, 29(2):189–200, 2000.

[23] I. De Felipe, V. Hristidis, and N. Rishe. Keyword search on spatial databases. In

Proc. IEEE Int’l Conf. Data Engineering (ICDE ’08), ICDE ’08, pages 656–665,

2008.

[24] P. M. d’Orey, R. Fernandes, and M. Ferreira. Empirical evaluation of a dynamic and

distributed taxi-sharing system. In Proc. IEEE Int’l Conf. Intelligent Transportation

Systems (CITS ’12), pages 140–146, 2012.

[25] Y. Doytsher, B. Galon, and Y. Kanza. Querying geo-social data by bridging spatial

networks and social networks. In Proc. ACM Int’l Workshop on Location Based

Social Networks (LBSN ’10), pages 39–46, 2010.

[26] J. Fan, G. Li, L. Zhou, S. Chen, and J. Hu. Seal: Spatio-textual similarity search.

CoRR, 2012.

[27] N. Garg, G. Konjevod, and R. Ravi. A polylogarithmic approximation algorithm for

the group steiner tree problem. In Proc. ACM-SIAM Int’l Symposium on Discrete

Algorithms (SODA ’98), pages 253–259, 1998.

103

[28] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proc. ACM

Int’l Conf. Management of Data (SIGMOD ’84), pages 47–57, 1984.

[29] F. Harary and I. C. Ross. A procedure for clique detection using the group matrix.

Sociometry, 1957.

[30] G. R. Hjaltason and H. Samet. Incremental distance join algorithms for spatial

databases. In Proc. ACM SIGMOD Int’l Management of Data (SIGMOD ’98), pages

237–248, 1998.

[31] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM Trans-

actions on Database Systems, 24(2):265–318, 1999.

[32] Y. Huang, R. Jin, F. Bastani, and X. S. Wang. Large scale real-time ridesharing with

service guarantee on road networks. In Proc. Int’l Conf. Very Large Data Bases

(PVLDB ’14), pages 2017–2028, 2014.

[33] R. M. Karp. Reducibility among combinatorial problems. In Complexity of Com-

puter Computations, pages 85–103. 1972.

[34] N. Katayama and S. Satoh. The sr-tree: An index structure for high-dimensional

nearest neighbor queries. In Proc. ACM Int’l Conf. Management of Data (SIGMOD

’97), pages 369–380, 1997.

[35] M. Kolahdouzan and C. Shahabi. Voronoi-based k nearest neighbor search for spa-

tial network databases. In Proc. Int’l Conf. Very Large Data Bases (PVLDB ’04),

pages 840–851, 2004.

[36] A. H. Land and A. G. Doig. An automatic method for solving discrete programming

problems. 50 Years of Integer Programming 1958-2008, pages 105–132, 2010.

[37] T. Lappas, K. Liu, and E. Terzi. Finding a team of experts in social networks. In

Proc. ACM Int’l Conf. Knowledge Discovery and Data Mining (SIGKDD ’09), pages

467–476, 2009.

104

[38] C.-T. Li and M.-K. Shan. Team formation for generalized tasks in expertise social

networks. In Proc. IEEE Int’l Conf. Social Computing (ICSC ’10), pages 9–16,

2010.

[39] Y. Li, R. Chen, J. Xu, Q. Huang, H. Hu, and B. Choi. Geo-social group queries

with minimum acquaintance constraint. IEEE Transactions on Knowledge and Data

Engineering, accepted to appear.

[40] Y. Li, D. Wu, J. Xu, B. Choi, and W. Su. Spatial-aware interest group queries in

location-based social networks. Data and Knowledge Engineering, 92:20–38, 2014.

[41] W. Liu, W. Sun, C. Chen, Y. Huang, Y. Jing, and K. Chen. Circle of friend query in

geo-social networks. In Proc. Int’l Conf. Database Systems for Advanced Applica-

tions (DASFAA ’12), pages 126–137, 2012.

[42] C. Long, R. C.-W. Wong, K. Wang, and A. W.-C. Fu. Collective spatial keyword

queries: A distance owner-driven approach. In Proc. ACM Int’l Conf. Management

of Data (SIGMOD ’13), pages 689–700, 2013.

[43] J. Lu, Y. Lu, and G. Cong. Reverse spatial and textual k nearest neighbor search. In

Proc. ACM Int’l Conf. Management of Data (SIGMOD ’11), pages 349–360, 2011.

[44] S. Ma and O. Wolfson. Analysis and evaluation of the slugging form of ridesharing.

In Proc. ACM Int’l Conf. Advances in Geographic Information Systems (SIGSPA-

TIAL ’13), pages 64–73, 2013.

[45] S. Ma, Y. Zheng, and O. Wolfson. T-share: A large-scale dynamic taxi ridesharing

service. In Proc. IEEE Int’l Conf. Data Engineering (ICDE ’13), pages 410–421,

2013.

[46] B. Martins, M. J. Silva, and L. Andrade. Indexing and ranking in geo-ir systems. In

Proc. ACM SIGSPATIAL Int’l Workshop on Geographic Information Retrieval (GIR

’05), pages 31–34, 2005.

105

[47] B. Mcclosky and I. V. Hicks. Combinatorial algorithms for the maximum k-plex

problem. Journal of Combinatorial Optimization, 23(1):29–49, 2012.

[48] B.-U. Pagel, H.-W. Six, H. Toben, and P. Widmayer. Towards an analysis of range

query performance in spatial data structures. In Proc. ACM SIGACT-SIGMOD-

SIGART Symp. Principles of Database Systems (PODS ’93), pages 214–221, 1993.

[49] D. Papadias, Q. Shen, Y. Tao, and K. Mouratidis. Group nearest neighbor queries.

In Proc. IEEE Int’l Conf. Data Engineering (ICDE ’04), pages 301–312, 2004.

[50] M. Rigby, A. Kruger, and S. Winter. An opportunistic client user interface to support

centralized ride share planning. In Proc. ACM Int’l Conf. Advances in Geographic

Information Systems (SIGSPATIAL ’13), pages 34–43, 2013.

[51] J. B. Rocha-Junior, O. Gkorgkas, S. Jonassen, and K. Nørvag. Efficient processing of

top-k spatial keyword queries. In Proc. Int’l Conf. Advances in Spatial and Temporal

Databases (SSTD ’11), pages 205–222, 2011.

[52] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. ACM

Int’l Conf. Management of Data (SIGMOD ’95), pages 71–79, 1995.

[53] A. E. Sarıyuce, B. Gedik, G. Jacques-Silva, K.-L. Wu, and U. V. Catalyurek. Stream-

ing algorithms for k-core decomposition. Proc. Int’l Conf. Very Large Data Bases

(PVLDB ’13), 6(6):433–444, 2013.

[54] S. Scellato, A. Noulas, R. Lambiotte, and C. Mascolo. Socio-spatial properties of

online location-based social networks. In Proc. Int’l Conf. Web and Social Media

(ICWSM ’15), pages 329–336, 2011.

[55] S. B. Seidman. Network structure and minimum degree. Social Networks, 5:269–

287, 1983.

[56] J. Shi, N. Mamoulis, D. Wu, and D. W. Cheung. Density-based place clustering in

geo-social networks. In Proc. ACM Int’l Conf. Management of Data (SIGMOD ’14),

pages 99–110, 2014.

106

[57] H. Shin, B. Moon, and S. Lee. Adaptive multi-stage distance join processing. SIG-

MOD Record, 29(2):343–354, 2000.

[58] M. Sozio and A. Gionis. The community-search problem and how to plan a success-

ful cocktail party. In Proc. ACM Int’l Conf. Knowledge Discovery and Data Mining

(SIGKDD ’10), pages 939–948, 2010.

[59] Y. Tao, X. Xiao, and R. Cheng. Range search on multidimensional uncertain data.

ACM Transactions on Database Systems, 32(3):15–54, 2007.

[60] N. Z. S. D. Tours. http://www.newzealandselfdrivetours.co.nz/,

2015.

[61] A. T. U. S.-D. R. Trips. http://www.autotoursusa.com/, 2015.

[62] K. Tsubouchi, K. Hiekata, and H. Yamato. Scheduling algorithm for on-demand bus

system. Information Technology: New Generations, 2009.

[63] D. Wu, G. Cong, and C. S. Jensen. A framework for efficient spatial web object

retrieval. Journal of Very Large Data Bases, 21(6):797–822, 2012.

[64] D. Wu, M. L. Yiu, C. S. Jensen, and G. Cong. Efficient continuously moving top-

k spatial keyword query processing. In Proc. IEEE Int’l Conf. Data Engineering

(ICDE ’11), pages 541–552, 2011.

[65] Z. Xiang, C. Chu, and H. Chen. A fast heuristic for solving a large-scale static

dial-a-ride problem under complex constraints. European Journal of Operational

Research, 174(2):1117–1139, 2006.

[66] S. Yan and C. Y. Chen. An optimization model and a solution algorithm for the

many-to-many car pooling problem. Annals of Operations Research, 191:37–71,

2011.

[67] D.-N. Yang, Y.-L. Chen, W.-C. Lee, and M.-S. Chen. On social-temporal group

query with acquaintance constraint. In Proc. Int’l Conf. Very Large Data Bases

(PVLDB ’11), pages 397–408, 2011.

107

[68] D.-N. Yang, C.-Y. Shen, W.-C. Lee, and M.-S. Chen. On socio-spatial group query

for location-based social networks. In Proc. ACM Int’l Conf. Knowledge Discovery

and Data Mining (SIGKDD ’12), pages 949–957, 2012.

[69] N. J. Yuan, Y. Zheng, L. Zhang, and X. Xie. T-finder: A recommender system

for finding passengers and vacant taxis. IEEE Transations on Knowledge and Data

Engineering, 25(10):2390–2403, 2013.

[70] D. Zhang, Y. M. Chee, A. Mondal, A. Tung, and M. Kitsuregawa. Keyword search

in spatial databases: Towards searching by document. In Proc. IEEE Int’l Conf.

Data Engineering (ICDE ’09), pages 688–699, 2009.

[71] Y. Zhang and S. Parthasarathy. Extracting analyzing and visualizing triangle k-core

motifs within networks. In Proc. IEEE Int’l Conf. Data Engineering (ICDE ’12),

pages 1049–1060, 2012.

[72] W. Zhao, Y. Qin, D. Yang, L. Zhang, and W. Zhu. Social group architecture based

distributed ride-sharing service in vanet. Journal of Distributed Sensor Networks,

2014:1–8, 2014.

[73] Y. Zhou, X. Xie, C. Wang, Y. Gong, and W.-Y. Ma. Hybrid index structures for

location-based web search. In Proc. ACM Int’l Conf. Information and Knowledge

Management (CIKM ’05), pages 155–162, 2005.

[74] Q. Zhu, H. Hu, J. Xu, and W.-C. Lee. Geo-social group queries with minimum

acquaintance constraint. CoRR, abs/1406.7367, 2014.

[75] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing

Surveys, 38(2):1–56, 2006.

[76] A. Zzkarian and A. Kusiak. Forming teams: an analytical approach. IIE Transac-

tions, 31(1):85–97, 1999.

108

Curriculum Vitae

Academic qualifications of the thesis author, Mr. Yafei LI:

• Received the degree of Bachelor of Engineering from Henan Normal University,

July 2006.

• Received the degree of Master of Engineering from Suzhou University, July 2009.

June 2015

109

hong kong baptist university doctoral thesis efficient

Documents