exploring cell tower data dumps for supervised learning-based point

GeoinformaticaDOI 10.1007/s10707-015-0237-7

Exploring cell tower data dumps for supervisedlearning-based point-of-interest prediction(industrial paper)

Ran Wang1 ·Chi-Yin Chow1 ·Yan Lyu1 ·Victor C. S. Lee1 ·Sarana Nutanong1 ·Yanhua Li2 ·Mingxuan Yuan3

Received: 3 December 2014 / Revised: 20 July 2015 / Accepted: 28 September 2015© Springer Science+Business Media New York 2015

Abstract Exploring massive mobile data for location-based services becomes one of thekey challenges in mobile data mining. In this paper, we investigate a problem of finding acorrelation between the collective behavior of mobile users and the distribution of pointsof interest (POIs) in a city. Specifically, we use large-scale cell tower data dumps collectedfrom cell towers and POIs extracted from a popular social network service, Weibo. Ourobjective is to make use of the data from these two different types of sources to build a modelfor predicting the POI densities of different regions in the covered area. An applicationdomain that may benefit from our research is a business recommendation application, wherea prediction result can be used as a recommendation for opening a new store/branch. The

� Chi-Yin [email protected]

Ran [email protected]

Yan [email protected]

Victor C. S. [email protected]

Sarana [email protected]

Yanhua [email protected]

Mingxuan [email protected]

1 Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue,Kowloon, Hong Kong

2 Department of Computer Science, Worcester Polytechnic Institute (WPI), Worcester, USA

3 Huawei Noah’s Ark Lab, Shatin, Hong Kong

http://crossmark.crossref.org/dialog/?doi=10.1186/10.1007/s10707-015-0237-7-x&domain=pdf

mailto:[email protected]







Geoinformatica

crux of our contribution is the method of representing the collective behavior of mobile usersas a histogram of connection counts over a period of time in each region. This representationultimately enables us to apply a supervised learning algorithm to our problem in order totrain a POI prediction model using the POI data set as the ground truth. We studied 12state-of-the-art classification and regression algorithms; experimental results demonstratethe feasibility and effectiveness of the proposed method.

Keywords Spatio-temporal data analysis · Classification · Regression ·Cell tower data dumps · Point-of-interest prediction

1 Introduction

The ubiquity of mobile devices such as smartphones and tablet computers enables us tocollect useful spatial and temporal data in a large scale and also opens up the possibilityof extracting useful information from the data [10, 17, 33]. For example, a popular map-ping service, Google Maps, makes use of real-time GPS records obtained from the usersof Google Location Services to show the current traffic conditions of different road seg-ments on the maps. Another example is a driving direction recommendation system calledT-Drive [36], which makes use of trajectories collected from over 33,000 taxis in a periodof three months to compute the fastest route for users.

In this paper, we focus on a specific type of mobile-user data known as cell tower datadumps, which contain connection records collected by 9,563 cell towers operated by theChina Mobile Limited1 in Guangzhou, China, as illustrated in Fig. 1a. This data set wascollected within a time period of six days (from 4 September 2013 to 9 September 2013).For the purpose of this investigation, we focus on records produced by phone calls andSMSs. For each record, we use the connection time, and the identifier and location of eachcell tower. We extracted 18,290 restaurants in Guangzhou from Weibo2, a popular Chinesesocial network web site, as our point-of-interest (POI) data set, as depicted in Fig. 1b.

The main objective of our research is to make use of the cell phone and POI data sets tohelp predict the existence of a POI and the number of POIs in the vicinity of a cell tower.An application domain that may benefit from POI prediction is a business recommenda-tion application, where a company is interested in generalizing the pattern of POIs of aparticular type (e.g., a coffee shop) in order to identify areas that have a great potential ofsupporting its business but have not been fully utilized yet. Our investigation is driven by ahypothesis that there is a correlation between the collective behavior of mobile users andthe existence of a certain type of POIs in a certain area.

The main challenge of this work is twofold: (1) Representation. To test our hypothesis,we should find a meaningful representation of collective mobile user behaviors by summa-rizing a large amount of data extracted from the cell tower data dumps. For example, the celltower network in a city like Guangzhou generates user connection records in the scale oftens of Gigabytes on a daily basis. (2) Application. To provide LBS, we should find effec-tive techniques to predict the existence of a certain type of POIs and the number of POIsin a certain area. For example, if our framework predicts that a certain area should haverestaurants, but that area does not have any restaurant, it has potential for a new restaurant.

1http://www.chinamobileltd.com2http://weibo.com

http://www.chinamobileltd.com

http://weibo.com

Geoinformatica

113.0 113.2 113.4 113.6 113.8 114.0

22.6

22.8

23.0

23.2

23.4

23.6

23.8

Longitude

Latitu

de

Cell Tower

113.0 113.2 113.4 113.6 113.8 114.0

22.6

22.8

23.0

23.2

23.4

23.6

23.8

Longitude

Latitu

de

POI Restaurant

Fig. 1 Geographical distribution of cell towers and restaurants in the Guangzhou city of China

To overcome the representation challenge, mobile user data can be summarized in twodifferent methods. The first method is to group the records by users, and get an action listor a moving trajectory of each user; and the other is to group them by cell towers, and getthe spatio-temporal features of the geographical areas. In this work, we adopt the secondmethod due to the following reasons.

– There is no exact location information for users. Each record shows the cell tower that amobile device is connected to rather than the exact location of the device. In fact, evenif the user stays in a fixed location, he or she may connect to different cell towers dueto some uncontrollable factors such as signal intensity and facility maintenance.

– The number of cell tower connections of different users has a small mean and a largestandard deviation. That is to say, one user may make a number of connections in a daybut another user may make zero connection. As a result, the numbers of connectionsmade by different users vary tremendously, and the average number is too small to beconsidered as a trajectory data set.

The result from the cell-tower-based summarization method is a spatio-temporal data setwith cell towers spanning the spatial dimensions. In the temporal dimension, each cell toweris associated with a histogram of connection counts where each histogram bin occupies atime period of one hour. In this way, the collective behavior of mobile users of an entirecity is compactly represented as connection counts over a period of time from different celltowers.

For the application challenge, we aim to design a framework to build up a model betweenthe features of mobile user behaviors and LBS. In particular, we study how to employ state-of-the-art supervised learning algorithms to design (i) a classification model to predict thePOI existence (i.e., naive bayes [21], radial basis function (RBF) framework [13], sup-port vector machine (SVM) [28], decision trees (DT) [19], bagging [6], adaboost [8]) and(ii) a regression model to predict the number of POIs (i.e., simple linear regression, linearregression [22], isotonic regression [2], pace regression [31], addictive regression [24], andregression via discretization [27]).

Geoinformatica

In general, the contributions of our work can be summarized as follows.

– We formulate a generic representation method of summarizing cell tower data dumpsfor mobile user behaviors.

– We design a framework with classification and regression algorithms to build upa model between mobile users’ behaviors and LBS for business recommendationapplications.

– We conduct extensive evaluation of our framework on real cell tower data dumps andPOI data set. Experimental results show that there is a strong correlation between thecollective behavior of mobile users and the restaurant data set and demonstrate thefeasibility and effectiveness of the proposed framework.

The remainder of this paper is organized as follows. Section 2 gives a brief introductionto supervised machine learning, and highlights related work. In Section 3, we describe howto predict the POI existence and the number of POIs based on the cell tower data dumps, andpresent the proposed framework. In Section 4, we present implementation details and ana-lyze extensive experimental results to study the feasibility and effectiveness of the proposedframework and analyze their results. Finally, Section 5 concludes this paper.

2 Related work

Most existing work on mobile and spatio-temporal data focuses on recommender sys-tems [1, 34, 38], urban planning [3], discovering [35], social networking services [40], etc.In particular, mobile phone call data and cellular network data are often used to discoveruseful information in various scenarios such as traffic anomalies [18], regions of differ-ent functions in a city [35], routine behavior patterns of people [16, 39], and importantplaces [15], etc. Besides, they are also used for urban analysis [20] and urban planning, suchas characterizing dense urban areas [29] and capturing city dynamics [3]. In general, themost commonly used techniques include collaborative filtering, density estimation, imageand signal processing, etc. However, none of them put their focus on machine learning,especially supervised learning, which is also a potential tool to mine useful information andmake accurate prediction on mobile phone call data or cellular network data for valuablelocation-based applications.

Supervised learning [5] refers to the problem of inferring a model from a set of labeledtraining samples, in order to achieve accurate predictions on unseen data. Given a trainingset X with N labeled samples, i.e., X = {(xi , yi)}Ni=1, each sample is associated with a setof conditional attributes xi = {xi1, xi2, . . . , xiL} and a decision attribute yi . The goal is tolearn a function f : x → y, such that given a new unlabeled sample x = {x1, x2, . . . , xL},its desired output value could be predicted by y = f (x). Besides, the learning task isclassification or regression if the decision attribute is discrete or continuous, respectively. Inorder to solve a supervised learning problem, the solution has to perform the steps as shownin Fig. 2. Each step has unique significance that may affect the final performance.

Currently, the most widely used classification models include naive bayes classifier(NBC) [21], support vector machines (SVMs) [28], decision trees (DTs) [19], artificial neu-ral networks (ANNs) [13], etc. While the most widely used regression models include linearregression [22], pace regression [31], isotonic regression [2], etc. Due to the well-known no-free-lunch theorem [32], no algorithm can perform best on all problems. Thus, each modelhas unique advantages that can be adopted under certain environments, meanwhile, eachone has its own restrictions that may affect the final performance.

Geoinformatica

Fig. 2 Structure of a supervised learning process

Supervised learning covers a wide range of application domains such as image process-ing [30], text classification [25], face recognition [7], video indexing [37], etc. Besides,several learning techniques have been applied on mobile and spatio-temporal data in recentliterature. In [11], kernel-based SVM is used as a classifier in the detection of harmful algalblooms in the Gulf of Mexico based on mobile data. In [26], the random forest approach isused to classify the land usage in a city based on mobile phone activities. In [4], a density-based clustering algorithm is proposed for a wide range of spatio-temporal data. To thebest of our knowledge, no one has applied supervised learning models to predict the POIexistence or the number of POIs in a certain region of a city using cell tower data dumps.

3 Using supervised learning for POI predictions

In this section, we will describe how to apply the supervised learning models to predict thePOI existence or the number of POIs in a region of a city based on the cell tower data dumpsand Weibo POI data.

3.1 Pre-clustering of cell towers

As demonstrated in Fig. 1, the geographical distributions of the cell towers and POIs inGuangzhou city are roughly consistent with each other. That is to say, if a given region hasa larger number of cell towers, it also has a high chance to cover a larger number of POIs,and vice versa. Besides, the density of cell towers is also related to the user visiting rate. Forexample, the downtown is usually the most popular and busiest area in a city, so it recordsthe highest user visiting rate, and thus needs more cell towers. In comparison, very fewpeople visit the suburb in a day, thus the density of cell towers is low in such area. Havingthese basic observations, it is possible to predict the POI existence or the number of POIsin a region based on the user visiting rate, which is reflected by the number of connectionsestablished by cell towers in that region.

Given N cell towers T = {T1, T2, . . . , TN } with geographical location information, wedenote Ti = (ti1, ti2), where ti1 and ti2 represent the longitude and latitude of Ti , i =1, 2, . . . , N , respectively. The intuitive scheme is to divide the city into N regions R ={R1, R2, . . . , RN }, such that each region contains one cell tower. These regions could bedefined by the Voronoi diagram [9], which treats each cell tower as a seed. Given a point ina region, the point is closer to the seed of the region than the seeds in other regions, i.e.,

∀x ∈ Ri, d(x, Ti) ≤ d(x, Tj ),

Geoinformatica

where i ∈ {1, 2, . . . , N}, and j = 1, . . . , i −1, i +1, . . . , N . An example of the Voronoidiagram with 10 seeds located in a unit square is given in Fig. 3.

Suppose there is a set of M POIs P = {P1, P2, . . . , PM } with geographical locationinformation, we denote Pi = (pi1, pi2), where pi1 and pi2 represent the longitude andlatitude of Pi , i = 1, 2, . . . , M , respectively. For a given POI Pi , the region that coversit could be discovered by a nearest neighbor (NN) search process among T. Finally, thenumber of POIs in each region is computed as the target that we aim to predict. However,when it comes to a real application, we have to consider the following two issues:

– The signal intensity of a cell tower is not stable, which leads to an unreliable relationbetween the POI density and the number of connections. For example, given two neigh-boring regions with similar POI densities, their user visiting rates are also supposed tobe similar. However, the signal intensity of one cell tower may be much stronger thanthat of another. Thus, when a user is in an intermediate location between them, thestronger one will always make the connection for the user.

– Due to an unbalanced distribution of cell towers, the separated regions may be too smallin downtown and too large in suburb. As a result, the number of POIs covered by themmay be balanced out and have no obvious difference.

In order to overcome the above-mentioned problems, we conduct a pre-clustering pro-cess on the cell towers, such that the cell towers with similar geographical informationare grouped into one cluster. Accordingly, their regions defined by the Voronoi diagramare merged, and the numbers of their covered POIs are summed up as the target thatwe aim to predict. As the most widely used one, k-means clustering technique [12] isadopted, which aims to partition N observations (i.e., T1, T2, . . . , TN ) into k sets (i.e.,S = {S1, S2, . . . , Sk}), so as to minimize the within-cluster sum of square:

argminS

k∑

i=1

∑

Tj ∈Si

||Tj − μi ||, (1)

where

μi = 1

ki

∑

Tj ∈Si

Tj , (2)

and ki is the number of cell towers in the i-th cluster.In this work, k could be treated an input number related to the evaluation unit and defined

by the user. For instance, the user could define a smaller k if he wants to evaluate largerregions and a lager k if he wants to evaluate smaller regions. In other words, there is no

Fig. 3 Voronoi diagram with 10seeds in a unit square

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Geoinformatica

best or worst number of k, its value is decided by the user’s willingness. Obviously, it ishard for us to try all the possible k values, thus, we test several representative values, i.e.,{250, 500, 1000, 2500, 5000}. Due to limited space, we only plot the clustering result whenk = 250, as shown in Fig. 4.

3.2 Density distribution of the number of POIs

We use kernel density estimation to get the distribution characteristics of the number ofPOIs, in order to investigate whether the data is suitable for supervised learning models.Kernel density estimation is the generalized form of histogram, which gives the continuousdistribution of a set of observations. Given k cell tower clusters grouped by the k-meansalgorithm (i.e., {S1, S2, . . . , Sk}), the covered region of Si is denoted by R∗

i , and the numberof POIs located in R∗

i is denoted by ni , then the kernel density estimation of the number ofPOIs is

fh(n) = 1

k

k∑

i=1

Kh(n − ni) = 1

kh

k∑

i=1

K

(n − ni

h

), (3)

where n is the argument for the density estimation, i.e., the number of POIs in a region, K

is the kernel function and h is the bandwidth. By applying Gaussian kernel, i.e.,

Kh(n) = 1√2π

exp− n22 , (4)

the estimator (3) becomes

fh(n) = 1

kh

k∑

i=1

1√2π

exp− (n−ni )

2

2h2 . (5)

113.0 113.2 113.4 113.6 113.8 114.0

22.6

22.8

23.0

23.2

23.4

23.6

23.8

Longitude

Latit

ude

Fig. 4 Pre-clustering result of cell towers when k = 250

Geoinformatica

According to [23], we select the optimal bandwidth as h = (4σ 2/3k)15 , where σ is the

standard deviation of {n1, . . . , nk}. Finally, the density distribution of the number of POIsis derived as shown in Fig. 5.

Figure 5a gives the distribution of the original data without the pre-clustering process. Itis easy to observe that many cell towers do not cover any POI, and when the POI number

10 15 20 25 30

0.0

0.2

0.4

0.6

0.8 Kernel Density Estimation

Number of POIs

Den

sity

50 10 15 20 25 30

0.00

0.05

0.10

0.15

0.20

0.25

0.30


Number of POIs

Den

sity

50

50

0.00

0.05

0.10

0.15

Kernel Density Estimation

Number of POIs

Den

sity

403020100 0 80 100 120 140

0.00

0.01

0.02

0.03

0.04

0.05


Number of POIs

Den

sity

604020

0 50 100 150 200 250 300

0.00

00.

005

0.01

00.

015

0.02

00.

025 Kernel Density Estimation

Number of POIs

Den

sity

Kernel Density Estimation

Number of POIs

Den

sity

0 100 200 300 400

0.00

00.

002

0.00

40.

006

0.00

80.

010

0.01

2

Fig. 5 Density distribution of the number of POIs

Geoinformatica

is larger than 10, the density is approximately zero. That is to say, the numbers of POIscovered by the cell towers have no obvious difference. In this case, it is hard to establish asupervised learning model for both classification and regression. However, the distributionbecomes more rational with a pre-clustering process, as shown in Fig. 5b to f. Basically,with the decrease of the number of clusters, we have the following observations:

– The difference among clusters becomes more obvious with a larger range of numbersof POIs per cluster.

– The distribution becomes more smooth with a smaller range of density.– The percentage of clusters that do not cover any POI becomes much smaller.

Table 1 reports the statistic information of the cell tower clusters. It is clear that withthe decrease of the number of clusters, the ratio of empty clusters becomes smaller, andthe average number of POIs per cluster becomes larger. Besides, we define the inter-clusterstandard deviation (SD) as

σ1 =√√√√1

k

k∑

i=1

(ni − μ)2, (6)

and intra-cluster SD as

σ2 = 1

k

k∑

i=1

√√√√1

ki

∑

j∈R∗i

(nij − μi)2, (7)

where μ = 1k

∑ki=1 ni , μi = 1

ki

∑j∈R∗

inij , and ki is the number of cell towers in R∗

i .Obviously, both Eqs. 6 and 7 increase with the decrease of the number of clusters. However,the increasing amplitude of (6) is much larger than that of Eq. 7, which demonstrates thatthe pre-clustering process can enlarge the difference among clusters while retaining thesimilarity of cell towers in the same cluster.

3.3 Refine time resolution

We aim to use spatio-temporal data to perform POI predictions. In Sections 3.1 and 3.2, wehave introduced how to make use of the spatial data. In this section, we further discuss howto make use of the temporal data.

Basically, the time in a day can be divided into 24 slots in the unit of an hour. Each slotdefines a feature for the cell tower T . Each connection record indicates that a user has visitedin the region covered by T , thus the connection frequency distribution of a region could

Table 1 Cell tower cluster information

No. No. clusters No. clusters Minimum Maximum Average Inter-cluster Intra-cluster

clusters with zero with non- No. POIs No. POIs No. POIs SD SD

POI zero POI per cluster per cluster per cluster

9,563 5,292 4,271 0 192 2 4.26 0

5,000 2,187 2,813 0 192 4 7.12 0.78

2,500 880 1,620 0 192 8 13.02 1.28

1,000 202 798 0 278 19 27.5 1.84

500 62 438 0 374 38 50.66 2.09

250 17 233 0 591 76 95.56 2.36

Geoinformatica

possibly reflect the characteristics of its user visiting rate. Given a region, the distributionsin different days are supposed to be similar. However, this statement does not hold in thereality. Figure 6 demonstrates the connection frequency distributions of a region in differentdays. We pay attention to the following observations:

– There is no unified pattern for the distributions in different days.– The distribution in weekday is more uniform than that of weekend, but Friday is in

between.– There may have some missing values, which give zero connection in a time slot.

The reason for the first observation is obvious. Since user activity is dynamic, it is hardto find a unified pattern for different days. The second observation is also easy to explain,since people always have different living habits in weekdays and weekends. In weekdays,they have a regular time schedule for working and rest, but in weekends, even the sameperson can take part in different activities. Besides, Friday is a transition between weekdaysand weekends, thus it exhibits some pluralistic characteristics. As for the third observation,it is possibly caused by some facility problems, such as a poor signal intensity or periodicmaintenance of the cell towers.

Furthermore, it is observed from Fig. 6 that the connection distribution in a weekday canbe roughly divided into several intervals. Take Fig. 6a as an instance:

– The frequency is the lowest during 0:00am to 7:00am, since this period is the sleepingtime for most people.

Fig. 6 Connection frequency of the cell towers in a region in different days

Geoinformatica

– The frequency gradually increases during 7:00am to 9:00am, and reaches a small peakduring 9:00am to 12:00pm.

– During 12:00pm to 14:00pm, the frequency decreases a little bit, since this period is thesiesta time for some people.

– During 14:00pm to 18:00pm, the frequency maintains a high level and reaches anotherpeak.

– Finally, the frequency begins to decrease until mid-night.

It is noteworthy these rules are not strictly obeyed by all the weekdays. However, theyreflect some basic features of the data, which are consistent with peoples’ daily life. Thus,we refine the time resolution of a weekday into seven new time slots as listed in Table 2,and compute the new features as the average number of connections during the slots. As forthe weekend, the 24 time slots are retained. Finally, the feature vector of a given region Ri

is denoted as xi = (xi1, xi2, . . . , xiL), where L = 7 ∗ 4 + 24 ∗ 2 = 76 (i.e., four weekdaysand two weekend days), with each dimension reflecting its user visiting rate in a specifictime slot.

3.4 The proposed framework

Given a certain area in a city, a company wants to know whether there should exist any POIor how many POIs should be there for business planning. Thus, it is useful to resolve theseproblems from the view points of both classification and regression.

By considering the issues discussed above, the POI prediction framework is sketched inAlgorithm 1. The algorithm consists of three main steps.

Step 1: The Voronoi diagram step In this step, the city is divided into a number ofconsecutive regions based on the Voronoi diagram by taking the cell towers as the seeds.Then, the number of POIs located in each region is found. (Lines 2 to 3)

Step 2: The clustering step This step performs the k-means clustering algorithm on thecell towers, and the cell towers with similar geographical locations are grouped into thesame cluster. Each cell tower cluster defines a region of the city, with a feature vector (i.e.,including the cell tower identifier and time of each connection) extracted from the cell towerdata dumps. (Lines 5 to 15)

Step 3: The supervised learning step Finally, the POI existence (treated as positive ifthere exists any POI and negative if no POI exist) or the number of POIs is taken as the

Table 2 Time slots in a weekdayTime slot Duration Activity

0:00 to 7:00 7 hours Sleeping hours

7:00 to 9:00 2 hours Morning rush hours

9:00 to 12:00 3 hours Morning working hours

12:00 to 14:00 2 hours Lunch hours

14:00 to 18:00 4 hours Afternoon working hours

18:00 to 21:00 3 hours Evening rush & dinner hours

21:00 to 24:00 3 hours Home hours

Geoinformatica

output target of the region, and a learner f is built up based on these labeled regions for aclassification or regression model that will be used to predict POI existence or the numberof POIs in the region, respectively. (Lines 17 to 21)

Once the learner f is trained based on a set of given regions, it can be used in twodirections. (1) Prediction of unknown regions: when there comes a new region without anyPOI information, we can extract its feature vector from the user connection records of thecell tower data dumps and predict the POI existence or the number of POIs by f , then,the company can make business plans based on the classification and regression results.(2) Evaluation of existing regions: given a region, if the regression number of POIs is largerthan or equal to the real one, there may be adequate number of POIs; however, if the regres-sion number is smaller than the actual one, it indicates a possibility to set up more POIs inthe future.

4 Implementation and analysis

In this section, we first describe how to implement the classification and regression learningmodes for our proposed framework, and then analyze extensive experimental results to studythe feasibility and effectiveness of the proposed framework.

4.1 Implementation

We here present the implementation details of the classification and regression learningmodes in Algorithm 1. Note that the model selection is not the main concern in this work,thus we just adopt several widely used parameter settings for the learning algorithms.

Classification learning mode The purpose of this experiment is to correctly identifywhether there exists any POI in a given region of a city. We study six state-of-the-art algo-rithms for the classification learning model, which are naive bayes classifier (NBC), radialbasis function (RBF) network, SVM, decision tree, bagging, and adaboost. The first fouralgorithms are single classifier based methods. Among them, NBC is a probabilistic classi-fier based on the bayes’ theorem [21]. We apply the Gaussian function to estimate the classprobabilities, where the parameters μ and σ 2 are computed as the mean and variance ofthe training samples in this class, respectively. RBF network [13] is an artificial neural net-work that adopts the radial basis function as the activation function. SVM [28] is a binaryclassification model based on statistical learning theory, which aims to generate an optimalseparating hyper-plane that can maximize the margin between the two referred classes. Weapply the soft-margin SVM with the Gaussian RBF kernel, where the kernel parameter γ

and the slack variable C are set to 1. Decision Tree (DT) [19] is a rule based classifier, whichbuilds up a knowledge-based expert system by inductive inference from training samples.The induction of DT is a recursive process that follows a top-down approach by repeat-edly splitting of the training set. We apply the standard C4.5 algorithm, which adopts theinformation gain ratio as the criterion to split nodes. The last two algorithms are aggregatedmethods based on ensemble learning. For bagging [6], the bag size is 100, the number ofiterations is 10, and REPTree is employed as the base classifier. For adaboost [8], the weightthreshold is set as 100, the number of iterations is 10, and a decision stump is used as thebase classifier.

Geoinformatica

Regression learning mode The purpose of this experiment is to directly predict the num-ber of POIs located in a given region of a city. We also study six state-of-the-art algorithmsfor the regression learning model, which are isotonic regression, linear regression, paceregression, simple linear regression, addictive regression, and regression via discretization.The linear regression and the simple linear regression are two regression models based onstatistics [22], the difference between them is that the linear regression has one or moreexplanatory variables, while the simple linear regression has only one explanatory variable.The isotonic regression [2] is a non-linear model based on numerical analysis, which couldbe formulated as a quadratic programming (QP) problem. The pace regression [31] is anaggregated method consisting of a group of estimators, which are either overall optimal orconditionally optimal. The addictive regression [24] is a nonparametric model, which adopts

Geoinformatica

a smooth function to fit the shape of the training data. Finally, the regression via discretiza-tion [27] transforms a regression problem into a classification one, and gets the predictionresult by using a classification learning system.

4.2 Experiment settings

We first conduct the experiments on the original data without a pre-clustering process,then set the number of clusters as {250, 500, 1000, 2500, 5000} and observe the trend ofthe result. In order to avoid random effects, we conduct 10-fold cross validation 10 times,and observe the average values. The experiments are conducted with the standard machinelearning toolbox WEKA [14], which are performed on a computer with an Intel Core 2 DuoCPU with 4GB memory, it runs on 32-bit Windows 7.

4.3 Classification results

In a binary classification problem, the result can be summarized into a confusion matrix asshown in Table 3.

We adopt three metrics to evaluate the performance, i.e., testing accuracy, precision, andrecall, which are respectively defined as:

T esting Accuracy = T P + T N

T P + T N + FP + FN, (8)

Precision = T P

T P + FP, (9)

and

Recall = T P

T P + FN. (10)

Basically, testing accuracy gives the overall rate of correctly classified testing samples, pre-cision gives the correct rate in the set that has been classified as positive, and recall givesthe correct rate in the real positive set.

The average values of the 10×10 results (10 trials of 10-fold cross validation) regardingthe three evaluation metrics with different numbers of clusters are shown in Fig. 7. It canbe seen that NBC has obtained the highest precision, but the testing accuracy and recall arelower than others. This is probably because NBC is more sensitive to the imbalanced prob-lem, i.e., it tends to correctly classify the positive samples but wrongly classify the negativesamples. Besides, bagging and boosting have shown the most stable performance amongthe six algorithms. The reason is straightforward, since the ensemble mechanism makesthe final decision by combining the classification results of multiple classifiers, which usu-ally outperforms single classifier. Finally, the three statistical algorithms, i.e., RBF network,SVM, and DT, have shown similar performances. They have obtained lower accuracy andprecision than bagging and boosting, but have shown the highest recall, which demonstratesthat they are less sensitive to the imbalanced problem.

It is observed from Fig. 7 that the changing trend of recall is not clear. From Eqs. 9 and10, we can see that the difference between precision and recall lies in the denominator, i.e,

Table 3 Confusion matrix ofclassification result True Positive (TP) False Positive (FP)

False Negative (FN) True Negative (TN)

Geoinformatica

0.5

0.6

0.7

0.8

0.9

Number of Clusters

Acc

urac

y

9563 5000 2500 1000 500 250

Naive BayesRBF NetworkSVMDecision TreeBaggingAdaboost

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Number of ClustersP

reci

sion

9563 5000 2500 1000 500 250


0.0

0.2

0.4

0.6

0.8

1.0

Number of Clusters

Rec

all

9563 5000 2500 1000 500 250


Fig. 7 Comparative results of different classification models

precision is affected by the negative samples that have been wrongly classified as positiveand recall is affected by the positive samples that have been wrongly classified as nega-tive. As shown in Table 1, when a pre-clustering process is performed, the data set becomesimbalanced. Especially when k is small, the number of positive samples is much largerthan the number of negative samples. In this case, the learning process gets biased towardspredicting positive for almost all the samples by default and frequently misclassifies thenegative samples. In other words, it is easier to have False Positive result (classify a nega-tive samples as positive), but is irregular to have False Negative result (classify a positivesamples as negative). This explains why the changing trend of recall is not clear.

Furthermore, it can be seen that both precision and recall are mainly decided by theTrue Positive rate, which represents the number of positive samples that have been correctlyclassified. Obviously, the smaller k is, the more imbalanced the data set will be. In this case,all the adopted learning algorithms will get biased towards the positive class and lead toa high True Positive rate. However, when k = 5, 000, the data set achieves a state that isrelatively balanced. In this case, the learning algorithms will no longer get biased towardsany class, some of them may achieve a higher True Positive rate and others may achieve a

010

2030

4050

Number of Clusters

Trai

ning

Sec

onds

9563 5000 2500 1000 500 250


0.00

0.05

0.10

0.15

0.20

Number of Clusters

Tes

ting

Sec

onds

9563 5000 2500 1000 500 250


Fig. 8 Efficiency of different classification models

Geoinformatica

Number of Clusters

Roo

t Mea

n S

quar

ed E

rror

9563 5000 2500 1000 500 250

Isotonic RegressionLinear RegressionPace RegressionSimple Linear RegressionAddictive RegressionRegression Via Discretization

Number of ClustersR

elat

ive

Abs

olut

e E

rror

(%

)


2040

6080

7075

8085

9095

100

8085

9095

100

105

110

Number of Clusters

Roo

t Rel

ativ

e S

quar

ed E

rror

(%

)

9563 5000 2500 1000 500 250 9563 5000 2500 1000 500 250


Fig. 9 Comparative results of different regression models

lower True Positive rate. As a result, there is an irregularity among different methods whenk = 5, 000.

We put more focus on the testing accuracy and precision, both of which have a clearincreasing trend with the decrease of the number of clusters. In fact, without a pre-clusteringprocess (shown as 9,563 clusters in Fig. 7), both the testing accuracy and precision arebetween 0.5 and 0.6, which are slightly better than a simple random guess. When the num-ber of clusters becomes smaller, both of them have an obvious improvement. That is tosay, the prediction is more effective when the city is divided into larger region areas. Thisobservation is easy to explain, since in a small region, the relation between the number ofconnections of cell towers and the user visiting rate may be unreliable due to some uncon-trollable factors such as a poor signal intensity and periodical maintenance. However, whenthe cell towers are grouped geographically, the negative effects of such problems can bealleviated. For instance, if a cell tower has some missing values, the prediction on it may beinaccurate. However, if it is grouped into a cluster with other cell towers, such missing val-ues could be alleviated to a certain extent. The larger the cluster is, the better the predictionresult will be. Besides, the adopted learning algorithms have different advantages regardingdifferent evaluation metrics. For instance, the two ensemble based learning algorithms cangive much higher testing accuracy than NBC, but fail to outperform it regarding precision.Thus, it is important to choose an appropriate method based on the requirement and purposeof the problem.

The training time and testing time of the classification algorithms are given in Fig. 8a andb, respectively. We can see that SVM is the most time consuming one to solve this problem,while the execution time of all the other algorithms are in an acceptable range.

4.4 Regression results

In a regression problem, the most commonly used evaluation metric is the root mean squarederror (RMSE), which is defined as:

RMSE =√√√√1

k

k∑

i=1

(yi − yi )2, (11)

where yi is the target number of POIs located in the region covered by the i-th cell towercluster, yi is the number of POIs predicted by the regression model, and k is the number ofclusters.

Geoinformatica

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Number of Clusters

Trai

ning

Sec

onds

9563 5000 2500 1000 500 250


02

46

8

Number of Clusters

Tes

ting

Sec

onds

9563 5000 2500 1000 500 250


Fig. 10 Efficiency of different regression models

However, as depicted in Table 1, the ranges of the prediction targets differ a lot withdifferent numbers of clusters, which lead to some incomparable results. In this case, weadopt another two metrics, i.e., relative absolute error (RAE) and root relative squared error(RRSE), which are respectively defined as:

RAE =∑k

i=1 |yi − yi |∑ki=1 |yi − y| × 100 %, (12)

and

RRSE =√√√√

∑ki=1(yi − yi )2

∑ki=1(yi − y)2

× 100 %, (13)

where

y = 1

k

k∑

i=1

yi . (14)

Basically, the RAE takes the total absolute error and normalizes it by dividing the totalabsolute error of the simple predictor, and the RRSE takes the total squared error and nor-malizes it by dividing the total squared error of the simple predictor. For both the RAE andRRSE, smaller values are better, and 100 % represents the baseline of just predicting themean. Thus, values less than 100 % are considered as effective for predicting the number ofPOIs in a region.

The average values of the 10×10 results (10-fold cross validation 10 times) regardingthe RMSE, RAE, and RRSE with different numbers of clusters are depicted in Fig. 9. It canbe seen that most of the selected algorithms, i.e., isotonic regression, linear regression, paceregression, simple linear regression, and addictive regression, have obtained very similarperformances with regard to the three evaluation metrics. This is because all of these algo-rithms try to construct the regression curve by directly utilizing the input samples. However,regression via discretization transforms the original samples into some intervals, which maylose some important information and perform worse than others.

Geoinformatica

Table 4 The RAE, RRSE, training time and testing time for regression results when k = 250

Method RAE (%) RRSE (%) Train (s) Test (s)

Isotonic 68.05±12.90 79.72±14.56 0.0175 0.0000

Linear 67.96±12.16 78.63±16.91 0.0223 0.0002

Pace 69.17±15.18 86.12±24.66 0.0089 0.0002

Simple linear 68.26±12.54 78.48±13.43 0.0005 0.0000

Addictive 73.40±15.43 86.57±21.24 0.0238 0.0000

Discretization 81.39±20.72 99.72±23.06 0.0341 0.0125

With the decrease of the number of clusters, the RMSE increases quickly, while the RAEand RRSE decrease gradually. We put more focus on the RAE and RRSE. Without a pre-clustering process (shown as 9,563 clusters in Fig. 9), both the RAE and RRSE are around100 %, which cannot perform better than just predicting the mean. However, when the cityis divided into fewer regions with a smaller number of clusters, the error can be reduced inmost cases. In fact, except the regression via discretization, all the other algorithms exhibita very clear decreasing trend, which demonstrate the effectiveness of the prediction.

The training time and testing time of the regression algorithms are given in Fig. 10aand b, respectively. Basically, the training time of the six algorithms gradually decreaseswhen the number of clusters gets smaller. As for the testing time, except the regression viadiscretization, all the methods can perform very fast. Finally, the RAE, RRSE, training time,and testing time for k = 250 are listed in Table 4. When the city is divided into 250 regions,the linear regression and the simple linear regression can give the lowest RAE and RRSE,respectively. Besides, both the training time and testing time are below 0.1 second.

4.5 Summary

From both the classification and regression results, we can summarize that when the city isdivided into less regions, the predictions on the POI existence and the number of POIs aremore accurate. As aforementioned, one major reason is that the relation between the numberof connections of cell towers and the user visiting rate in a small region is unreliable dueto a poor signal intensity and periodical maintenance. However, these negative effects canbe alleviated by defining larger regions. It is hard to tell which learning method is the best,since different methods can exhibit different advantages with regard to different evaluationmetrics. It leaves us a possibility to improve the performance by designing some adaptivealgorithms, which could be one of our future research directions.

5 Conclusions

In this paper, we proposed a supervised learning-based framework for predicting the exis-tence of POIs and the number of POIs in a given region using the spatio-temporal featuresextracted from cell tower call dumps in Guangzhou, China and the information of a setof restaurants collected from the Chinese social network Weibo. The Voronoi diagram isadopted to divide the Guangzhou city into small and consecutive regions geographically.Then, a k-means clustering process is performed on the cell towers to merge small regionsinto larger ones. The connection frequencies of cell towers are taken as the features of aregion, and a classification or regression model is used to predict the POI existence or the

Geoinformatica

number of POIs in a given region, respectively. We have studied 12 state-of-the-art classifi-cation and regression algorithms. Experimental results show the feasibility and effectivenessof the proposed framework.

We consider two related research problems as our future work: the problem of deter-mining the value of k and the choice of time resolution. One possible solution to the firstproblem is to design a metric for k based on some objective factors. For example, if the met-ric is based on travel time, we need to find the total length of the roads in the city (length),the average driving speed of vehicles (speed), and how much time the user is willing tospend (time); thus, a driving distance speed × t ime can be computed as the total length ofthe roads in a cluster, and k can be determined as length/(speed×time). For the second prob-lem, the main idea is to separate two consecutive hours if they exhibit an obvious change ofconnection frequency (i.e., a new time slot should begin). In the future, we will collect datain other cities to verify the effectiveness of this method.

Acknowledgments R. Wang and C.-Y. Chow were partially supported by a research grant (CityU ProjectNo. 9231131). S. Nutanong was partially supported by a CityU research grant (CityU Project No. 7200387).This work was also supported by the National Natural Science Foundation of China under the Grant61402460.

References

1. Bao J, Zheng Y, Mokbel MF (2012) Location-based and preference-aware recommendation using sparsegeo-social networking data. In: ACM SIGSPATIAL

2. Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD (1972) Statistical inference under orderrestrictions: The theory and application of isotonic regression. Wiley, New York

3. Becker RA, Caceres R, Hanson K, Loh JM, Urbanek S, Varshavsky A, Volinsky C (2011) A tale of onecity: Using cellular network data for urban planning. IEEE Pervasive Computing 10(4):18–26

4. Birant D, St-dbscan AK (2007) An algorithm for clustering spatial–temporal data. DKE 60(1):208–2215. Bishop CM (2006) Pattern recognition and machine learning (information science and statistics).

Springer, New York6. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–1407. Chen XM, Liu WQ, Lai JH, Li Z, Lu C (2012) Face recognition via local preserving average

neighborhood margin maximization and extreme learning machine. Soft Comput 16(9):1515–15238. Collins M, Schapire RE, Singer Y (2002) Logistic regression, adaboost and bregman distances. Mach

Learn 48(1-3):253–2859. Ghosh S, Lee K, Moorthy S (1995) Multiple scale analysis of heterogeneous elastic structures using

homogenization theory and voronoi cell finite element method. IJSS 32(1):27–6210. Goh JY, Taniar D (2004) Mobile data mining by location dependencies. In: IDEAL11. Gokaraju B, Durbha SS, King RL, Younan NH (2011) A machine learning based spatio-temporal data

mining approach for detection of harmful algal blooms in the Gulf of Mexico. IEEE J-STARS 4(3):710–720

12. Hartigan JA, Wong MA (1979) Algorithm as 136: A k-means clustering algorithm. J R Stat Soc: Ser C:Appl Stat 28(1):100–108

13. Haykin S (1994) Neural networks: A comprehensive foundation. Prentice Hall PTR14. Holmes G, Donkin A, Weka IH (1994) Witten: A machine learning workbench. In: ANZIIS15. Isaacman S, Becker R, Caceres R, Kobourov S, Martonosi M, Rowland J, Varshavsky A (2011)

Identifying important places in people’s lives from cellular network data. In: Pervasive Computing16. Kanasugi H, Sekimoto Y, Kurokawa M, Watanabe T, Muramatsu S, Shibasaki R (2013) Spatiotemporal

route estimation consistent with human mobility using cellular network data. In: IEEE PerCom17. Miller HJ, Han J (2009) Geographic data mining and knowledge discovery. CRC Press18. Pan B, Zheng Y, Wilkie D, Shahabi C (2013) Crowd sensing of traffic anomalies based on human

mobility and social media. In: ACM SIGSPATIAL19. Quinlan JR (1996) Improved use of continuous attributes in C4.5. JAIR 4:77–9020. Ratti C, Williams S, Frenchman D, Pulselli RM (2006) Mobile landscapes: using location data from cell

phones for urban analysis. Environ Plan B: Planning and Design 33(5):727

Geoinformatica

21. Rish I (2001) An empirical study of the naive bayes classifier. In: IJCAI22. Seber GAF, Lee AJ (2012) Linear regression analysis, volume 936. John Wiley & Sons23. Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density

estimation. JRSS, Series B 53(3):683–69024. Stone CJ (1985) Additive regression and other nonparametric models. Ann Stat:689–70525. Tong S, Koller D (2002) Support vector machine active learning with applications to text classification.

J Mach Learn Res 2:45–6626. Toole JL, Ulm M, Gonzalez MC, Bauer D (2012) Inferring land use from mobile phone activity. In:

ACM UrbComp27. Torgo L, Gama J (1996) Regression by classification. In: Advances in Artificial Intelligence, pp 51–6028. Vapnik V (2000) The nature of statistical learning theory. Springer29. Vieira MR, Frias-Martinez V, Oliver N, Frias-Martinez E (2010) Characterizing dense urban areas from

mobile phone-call data: Discovery and social dynamics. In: IEEE SocialCom30. Wang L, Huang YP, Luo XY, Wang Z, Luo SW (2011) Image deblurring with filters learned by extreme

learning machine. Neurocomputing 74(16):2464–247431. Wang Y, Witten IH (1999) Pace regression. Technical Report 99/12, Department of Computer Science,

The University of Waikato32. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE TEVC 1(1):67–8233. Yavas G, Katsaros D, Ulusoy O, Manolopoulos Y (2005) A data mining approach for location prediction

in mobile environments. DKE 54(2):121–14634. Ye M, Yin P, Lee W-C, Lee D-L (2011) Exploiting geographical influence for collaborative point-of-

interest recommendation. In: ACM SIGSPATIAL35. Yuan J, Zheng Y, Xie X (2012) Discovering regions of different functions in a city using human mobility

and pois. In: ACM SIGKDD36. Yuan J, Zheng Y, Xie X, Sun G (2013) T-drive: Enhancing driving directions with taxi drivers’

intelligence. IEEE TKDE 25(1):220–23237. Zha Z, Wang M, Zheng Y, Yang Y, Hong R, Chua T (2012) Interactive video indexing with statistical

active learning. IEEE TMM 14(1):17–2738. Zhang J-D, Chow C-Y (2013) iGSLR: Personalized geo-social location recommendation: A kernel

density estimation approach. In: ACM SIGSPATIAL39. Zheng J, Liu S, Ni LM (2013) Effective routine behavior pattern discovery from sparse mobile phone

data via collaborative filtering. In: IEEE PerCom40. Zheng Y, Chen Y, Xie X, Ma WY (2009) Geolife2.0: A location-based social networking service. In:

IEEE MDM

RanWang received the B.Sc. degree in computer science from the College of Information Science and Tech-nology, Beijing Forestry University, Beijing, China, in 2009, and the Ph.D. degree from the City Universityof Hong Kong, Hong Kong, in 2014. She is currently a Post-Doctoral Senior Research Associate with theDepartment of Computer Science, the City University of Hong Kong. Her current research interests includepattern recognition, machine learning, fuzzy sets and fuzzy logic, and their related applications.

Geoinformatica

Chi-Yin Chow received the M.S. and Ph.D. degrees from the University of Minnesota-Twin Cities in 2008and 2010, respectively. He is currently an assistant professor in Department of Computer Science, City Uni-versity of Hong Kong. His research interests include spatio-temporal data management and analysis, GIS,mobile computing, and location-based services. He is the co-founder and co-organizer of ACM SIGSPATIALMobiGIS 2012, 2013, and 2014.

Yan Lyu received the M.S. degree in pattern recognition and intelligent systems from University of Sci-ence and Technology of China, China, in 2013. She is currently working toward the Ph.D. degree in theDepartment of Computer Science, City University of Hong Kong. Her research interests include data mining,machine learning and location-based services.

Geoinformatica

Victor C. S. Lee received the Ph.D. degree in computer science from the City University of Hong Kongin 1997. He is an Assistant Professor with the Department of Computer Science, City University of HongKong. His research interests include data management in mobile computing systems, real-time databases,and performance evaluation. Dr. Lee is a member of the ACM and IEEE Computer Society. From 2006 to2007, he was the Chairman of the Computer Chapter, IEEE Hong Kong Section.

Sarana Nutanong is an Assistant Professor in the Department of Computer Science at City University ofHong Kong. He received his PhD from the University of Melbourne. Before joining CityU in January 2014,he was a Postdoctoral Research Associate at University of Maryland Institute for Advanced Computer Studiesbetween 2010 and 2012 and held a research faculty position at the Johns Hopkins University from 2012 to2013. His research interests include scientific data management, dataintensive computing, spatial-temporalquery processing, and large-scale machine learning. More specifically, his research is aimed at providing alarge-scale, high-throughput support for computational scientific exploration applications.

Geoinformatica

Yanhua Li is a researcher with HUAWEI Noah’s Ark LAB, Hong Kong. He obtained two PhD degreesin computer science from University of Minnesota, Twin Cities in 2013, and in electrical engineering fromBeijing University of Posts and Telecommunications in 2009. His broad research interests are in analyzing,understanding, and making sense of big data generated from various complex networks in many contexts. Hisspecific interests include large-scale network data sampling, measurement, and performance analysis, andspatio-temporal data analytics. He has held visiting positions in Bell Labs in New Jersey, Microsoft ResearchAsia, and HUAWEI research labs of America. He served on TPC of INFOCOM 2015, ICDCS 2014, 2015,and he is the co-chair of SIMPLEX 2015.

Mingxuan Yuan is currently a Researcher of Huawei Noah’s Ark lab, Hong Kong. Before that, he servedas a PostDoc fellow in the Department of Computer Science and Engineering of the Hong Kong Universityof Science and Technology. His research interests include big telecom (spatiotemporal) data storage/mining,telecom data mining and data privacy.

exploring cell tower data dumps for supervised learning-based point

Documents