a day of your days: estimating individual daily journeys ... · through daily journeys, as we...

A Day of Your Days: Estimating Individual Daily JourneysUsing Mobile Data to Understand Urban Flow

Eduardo Graells-GarridoTelefónica I+DSantiago, Chile

Diego Saez-TrumperEURECAT

Barcelona, Spain

ABSTRACTNowadays, travel surveys provide rich information about urban mo-bility and commuting patterns. But, at the same time, they havedrawbacks: they are static pictures of a dynamic phenomena, areexpensive to make, and take prolonged periods of time to finish.However, the availability of mobile usage data (Call Detail Records)makes the study of urban mobility possible at levels not knownbefore. This has been done in the past with good results–mobiledata makes possible to find and understand aggregated mobility pat-terns. In this paper, we propose to analyze mobile data at individuallevel by estimating daily journeys, and use those journeys to buildOrigin-Destiny matrices to understand urban flow. We evaluate thisapproach with large anonymized CDRs from Santiago, Chile, andfind that our method has a high correlation (ρ = 0.89) with thecurrent travel survey, and that it captures external anomalies in dailytravel patterns, making our method suitable for inclusion into urbancomputing applications.

1. INTRODUCTIONTravel surveys provide information about urban mobility and com-

muting patterns, mainly through Origin-Destiny matrices derivedfrom them. These matrices allow urban planners and policy makersto understand travel patterns in urban mobility. Such surveys areusually collected once per decade, due to their expensiveness (bothin time and monetary costs). Moreover, they are limited in someways: travel surveys are static pictures of a dynamic phenomenaand, due to its sample size, they are limited to big areas (eitheradministrative or designed).

In this paper, we propose to use mobile data used for billing,which indicates a subset of the antennas a mobile device has con-nected through the day, as well as the corresponding timestamps. Weuse these digital footprints to build extended travel diaries. Traveldiaries are the basic elements of travel surveys, but we extend themthrough daily journeys, as we detect not only trips, but also other“non-trip” activities. From these daily journeys we build Origin-Destiny (OD) matrices, at a fraction of the expenses needed to buildOD matrices from travel surveys.

Under review–do not distribute. Contact the authors before citing.

Our main contribution is a method to detect these disaggregateddaily journeys using Call Detail Records (CDRs). Our approachis based on graphical timelines and computational geometry al-gorithms, which are applied having in mind transport-based rulesregarding trip duration and distance. We use an anonymized CDRdataset from one of the largest telecommunications company inChile, with a market share of 38.18% as of June 2015. Chile, beingone of the developing countries with highest mobile phone pen-etration, is a good candidate for analyzing urban mobility usingCDRs–for instance, there are 132 mobile subscriptions per 100 in-habitants.1 Particularly, we focus on Santiago, its capital and mostpopulated city.

To evaluate our results, we compare our predicted urban flow (inthe form of an OD matrix) with the last travel survey for Santiago,performed during 2012–2013. In terms of OD pairs, we obtain a veryhigh correlation (ρ = 0.89), indicating that our method recognizesthe urban flow on the city. We apply our methods at differentdays, and find that, in addition, we detect how urban flows changein the presence of unexpected conditions. This, jointly with thedisaggregated nature of our method, has potential for applicationsin urban computing [13], discussed at the end of this paper.

2. RELATED WORKThe estimation of OD matrices, and thus, urban mobility patterns,

is not new. The most basic way of estimating those patterns isby travel surveys, but also other methods have been developed. Acommon method is based on traffic counts [2], but the massive avail-ability of other kinds of information has allowed to estimate suchmatrices in other ways, for instance, by using smart-card passivedata [9] and, as in this paper, mobile data [1, 4, 5, 7, 11].

The work on mobile datasets has included both theory and prac-tice. From a theoretical point of view, the analysis of how predictablehumans are [5, 11]. A practical work has been the prediction of tran-sient OD matrices by analyzing transitions of connections betweencell antennas. These transitions are the basis of many methods,which employ techniques from optimization and temporal associa-tion rules [4], to transport-based rules to accept or discard transitions[1, 7]. Our approach is also based on antenna transitions. However,the main difference with previous work is that we reconstruct indi-vidual daily journeys, an extended version of the travel diaries usedto build travel surveys. These journeys are built using a geometricapproach based on graphical timelines [12], which are time-spacediagrams displaying timelines according to time and distance cov-ered. On these timelines we estimate the “turning points” of thedaily journey, which serve to differentiate daily activities into tripsand non-trips. These timetables are regularly used in transport to1http://www.subtel.gob.cl/estudios-y-estadisticas/telefonia/

arX

iv:1

602.

0900

0v1

[cs

.SI]

29

Feb

2016

Figure 1: Graphical timetable of a train schedule, by E.J. Marey,1885.

display and analyze schedules, as well as phenomena associatedwith vehicle behavior. For instance, Figure 1 shows a train scheduledesigned in 1885 by Étienne-Jules Marey. On it, passengers can seearrival and departure times, as well as the duration of stops and thevelocity of trains. Transport planners can study vehicle behavior–since all vehicles share the same origin, if we were visualizing theirtrue trajectories instead of the scheduled ones, vehicle bunching canbe recognized immediately [8].

3. METHODSOur methods consider mobile user traces estimated from anonymized

CDR data. CDRs are comprised of logged data extracted fromcell-phone antennas, and are used to bill customers. They containthe following events: calls, SMS (text messages), and data events(triggered every 15 Megabytes or every 15 minutes of an activeconnection). The following features are common between all events,and thus are considered for analysis: the anonymized user ID, theantenna ID, and time of the day. The antenna ID is used to determinea likely position for the user at the corresponding time of the day.Note that this position is assumed to be the same for all phonesconnected to the same antenna, i. e., we do not perform triangulationbased on signal strength with nearby antennas.

Problem Definition and Proposal. The problem we propose tosolve is the estimation of a diary of activities for a day, using CDRsfrom mobile data. A daily journey J for user u is defined as follows:

Ju = {(Ai, (tiO, tiD), (piO, piD)}.

Where Ai is an activity, pi (and ti) are the positions (and times)associated to the start/origin and end/destination of Ai. Activitiescan be of types trip, non-trip, and unknown.

To solve this problem we propose a two-step algorithm: first,we define the candidate turning points of a day, where turningpoint is a moment in the day, at a specific position, where the userstarted to perform an activity (and, by definition, ends performing aprevious activity). The second step is to detect activities based onthe candidate turning points.

Graphical Timetables and Turning Points. For all users, we buildthe following vector:

~u = [(t0, p0), (t1, p1), . . . , (tn, pn)],

Where each element in ~u corresponds to an event in the CDR eventsof u in a day, with the corresponding timestamp t and the antennaposition p. These vectors can be transformed into graphical timeta-bles (see Figure 1), where the x-axis is the elapsed time duringthe day, and the y-axis is the traveled (accumulated) distance from

Figure 2: Illustration of the Ramer-Douglas-Peucker algorithm.Source: Wikipedia Commons.

the starting point. The different groups of segments present in thetimetable define the activities in Ju.

However, due to the potential amount of CDR events, severalcontiguous segments could be related to the same activity. To tacklethis, we propose to simplify the timetable using the Ramer-Douglas-Pecker line simplification algorithm [3]. This algorithm, depictedon Figure 2, starts with the extremes of the trajectory, and iterativelyadds vertices according to their distance to the candidate simplifiedline. The vertex with greater distance (or error) is added, and theprocess is repeated until the error is less than a given tolerance. Thisresults in a simplified vector ~us, which contains candidate “turningpoints” of daily activity. Those candidate points define the potentialactivity segments that will end defining Ju.

Activity Classification. The second step is to identify which con-tiguous segments in the line defined by the points of ~us can bemerged into a single activity Ai. The most direct approach is tomerge a series of horizontal (or nearly horizontal) segments. How-ever, a continuously moving user (and thus, with non-horizontalsegments) is also feasible. Take the example of a taxi driver, whoseprimary activity during the day is working as a driver. Althoughthese activities are valid, they are not relevant for an OD matrix,as they are not trips in the individual sense. Thus, we define thefollowing classifications: unknown, non-trips, and trips.

Unknown are segments where the total distance covered is greaterthan 100 kilometers. In those cases we cannot distinguish betweentrips and unknown situations. For instance, the mobile number isassociated to a vehicle (e. g., a taxi) or another vehicle/device. Whilethese are indeed displacements in time and space, they do not fallon the daily journey concept we study in this paper. Another case iswhen users switch to WiFi networks, and thus disappear from theevent log for a long period of time. For instance, users who switch tothe wireless network of their work place might not be detected there,and the next logged event could appear after working hours. Onthose cases, we cannot identify trips reliably. To avoid this scenario,in some cases the displacement between events has been limited tospecific time windows (e. g., 10 minutes and 1 hour [7]).

Non-trips are segments that are not unknown. The criteria to as-sign this type to an activity is based on the covered distance and time.On the one hand, the antenna density of origin/destiny locationsof the activity is considered to determine a minimum distance thatcannot be attributed to signalling changes (this is discussed furtherin the next section). On the other hand, some activities involvedisplacements (e. g., working/studying in a big campus), but thespeed of movement is much slower than when performing a trip.Thus, if there is a distance displacement, but the time is greater than180 minutes, we still consider a non-trip activity.

Figure 3: The 752 zones from the OD Survey zoning. Colorsindicate the 35 municipalities under consideration.

Finally, trips are segments that are not unknown and are not non-trip, but, in addition to the 180 minute rule, we enforce a minimumduration of 15 minutes for a segment, due to the granularity of theCDR data. Any other activity that does not fall into the rules definedabove is considered is considered unknown.

Activity Merging. Having assigned an activity to each segmentbuilt from ~us, we procede to merge contiguous segments that havethe same activity. Two or more segments are merged into an activityAi by considering the first time and position in the segment as origin,and the last time and position in the segment as destination. Wealso consider an additional case: when two trip activities surround anon-trip activity, and the duration of the latter is lesser or equal than15 minutes, its activity is changed to trip. This scenario correspondto situations when users in public transport make a connection, orusers in vehicles face congested traffic. In this way, after mergingall activities, the daily journey Ju is built.

4. CONTEXT AND DATASETWe work with a dataset from Santiago, the capital of Chile. San-

tiago is a city with almost 8 million inhabitants, and it has an inte-grated public transport system named Transantiago. The Metropoli-tan Area of Santiago is composed of 35 independent administrativeunits named municipalities. We work with this set of municipalities,depicted on Figure 3.

Santiago 2012 Travel (OD) Survey. The Santiago 2012 travelsurvey (ODS hereafter) was performed during 2012–2013.2 It tookalmost one year to finish, and contains 96,013 trips (from 40,889users). The information of trips is obtained through the travel diariesfulfilled by the surveyed persons. The ODS, used to define publicpolicy related to public and private transport in the city, as well asgeneral urban mobility, is performed every 10 years due to its costsand its difficulty.

The survey considers other municipalities outside the area, aswell as cities in other regions, due to the characteristics of the surveyprocedure. Additionally, the survey also defines a zoning of the city,with 752 zones within the considered municipalities. Each zone

2http://www.sectra.gob.cl/biblioteca/detalle1.asp?mfn=3253.

Figure 4: Distributions of OD Survey 2012 trip data from Santiago,Chile. The matrix has been L2 normalized on rows (origins).

intends to control for land-use and population density. Even thougheach trip in the survey is associated to a zone and a municipality,the survey is only representative at municipal level. At its currentgranularity, the data available is not enough to calculate reliablemobility patterns at zone level.

In this paper, without losing generality, we focus on the 51,819trips performed on working days, from 22,541 users, performed onthe inner 35 municipalities of the Metropolitan Area. We aggregatedtrips according to municipality into an OD matrix, shown in Figure4. One can see that the most common trips are intra-municipalities,but there are still municipalities that tend to receive more trips thanothers. This is because most of the commercial and working land-use is on the municipalities of Santiago, Providencia, Las Condesand Vitacura. Note that the municipality of Santiago is at the centerof the Santiago Metropolitan Area. In the rest of this paper, whenwe mention Santiago we refer to the Metropolitan Area.

In terms of trip variables, Figure 5 shows the distributions oftrip start time, trip duration, and approximated trip distance (i. e.,euclidean distance). One can see that trip start time follows anexpected pattern of two high peaks (one in the morning and one inthe afternoon), with a third smaller peak at lunch time. With respectto trip duration, the mean duration is 41 minutes. Note that theself-reported nature of the survey is evident, due to the several peakspresent on the distribution in 15 minute periods (e. g., 30 and 45minutes). Finally, with respect to travel distance, the mean euclideandistance is 6.05 kilometers.

Mobility Data. We analyze mobile data using CDRs from one ofthe largest telecommunications company in Chile. The CDR datacontains events for all Mondays and Tuesdays of June 2015.In the 35 municipalities under consideration, the company has12,936 antennas, with 98% of the zones having at least one an-tenna. Figure 6 displays the antenna territorial density. Note thatthe antenna distribution is not homogeneous on the city, nor at anylevel. For instance, antenna distribution is correlated with the ODS,considering aggregated destinations at municipal level (ρ = 0.91,p < 0.001).

To avoid artifacts in the trip detection introduced by connectivitychanges due to signal strength (i. e., mobile phones looking for thebest signal) and signal balancing (i. e., mobile antennas pointing

Figure 5: Distributions of EOD Survey 2012 trip data.

devices to connect to other antennas due to saturation), we estimateda distance matrix Dzi for all antennas in each zone. Then, wedefined that the minimum distance for an activity to be considered atrip between zones zo and zd is:

dmin = max{Q(Dzo), Q(Dzd)}.

Where Q is the quantile function (by manual experimentation wehave found that 0.8 is a good value). The mean value of Q(Dz) is732 meters (minimum of 45 m., and maximum of 14.1 km.).

While we do not disclose the exact number of users in each daydue to confidentiality and commercial issues, Figure 7 displays thedistributions of event frequency and hourly entropy for all days inthe dataset. In the left chart, each dot is a minute in a specific day.The frequency encodes the fraction of events that the dot containsper day. One can see that the distribution of frequency of eventscan be approximated by a cubic linear regression, with a higherfrequency of events in the afternoon.

The right chart of Figure 7 displays the distribution of user entropywith respect to hours of the day, for each day. This entropy is definedas the Shannon entropy of user u:

Hu = −∑

pi,u ln pi,u.

Where pi,u is the probability that user u has a CDR event in theith hour of the day. The purpose of estimating this entropy is tohave a measure of diversity with respect to time for each user. Thus,we discard users who do not have enough diversity to be able toestimate their journeys, or that have too much diversity to be normalusers (e. g., they could be SIM cards associated to machines). Wediscarded users in the first quartile (Hu < 0.4) and in the last decile(Hu > 0.9).

We apply our methods to this dataset on the following section. Tobe able to compare with the ODS, we employ a similar sample sizeof N = 100, 000 randomly selected mobile users (before filteringby entropy).

Figure 6: Zoning of Santiago from the OD Survey. Colors indicateantenna density in each zone.

Figure 7: Distributions of CDR event frequency (left) and entropywith respect to hours of the day (right).

5. DAILY JOURNEYS AND OD MATRIXWe applied our method to generate daily journeys Ju to the CDR

dataset. Figure 8 shows the result of randomly selected samplesfrom the dataset. Each colored line is a different user, and thegrey line underneath each colored line is the original timetablebefore simplification. The remaining “turning points” are renderedin purple, with a bigger size for easy identification on the image.Those points define the activity segments which we classify asunknown, non-trip and trip according to the definitions providedearlier.

After estimating the daily journeys from the graphical timetables,we discarded users without trip and non-trip activities. Table 1 showsthe final number of users considered (after filtering by entropy, andafter filtering without valid trips/non-trips). From these activities,we built a transient OD matrix for each day, using the initial and finalpositions of each trip, which were assigned to their correspondingmunicipalities. Since we estimated matrices for many days, weaveraged the number of trips for each OD pair of origin/destinymunicipalities (mo,md). Figure 9 shows the resulting matrix.

We also estimated the trip variables analyzed before for the ODS.Figure 10 shows the kernel density estimations of each distributionfor each day. One can see that, overall, the distributions are mostlysimilar for all days, with the following exceptions: start time dis-tribution is different on June 29th, and trip duration distributionis different on June 8th and 9th. On June 9th there was a strikein the Santiago public transport system, which explains partly theincreased trip time in comparison to the other days. Additionally, onJune 29th the semi-final of the latin-american soccer championshipCopa América, where Chile was a contender, was played at 8pm.One can see that the number of trips in the afternoon was muchhigher than in the morning, making the morning peak to shrink in

Figure 8: Sample graphical timelines of mobile user traces with ourmethod applied, including the RDP line simplification algorithm.

Date N # Trips Mean Time (m) Mean Distance (km)

2015-06-01 37,037 56,148 78.38 7.682015-06-02 37,622 57,764 77.78 7.742015-06-08 34,272 50,017 83.86 7.802015-06-09 34,611 50,778 83.03 7.772015-06-15 36,274 53,642 79.45 7.682015-06-16 36,804 56,046 78.16 7.682015-06-22 34,288 49,434 80.05 7.782015-06-23 36,239 54,983 77.62 7.712015-06-29 22,770 31,963 74.78 7.502015-06-30 35,181 52,597 78.94 7.66

Table 1: Number of users and trips for each day analyzed, as wellas mean trip duration and mean euclidean distance.

the density estimation. Moreover, the distribution highlights thatthe afternoon peak was earlier than usual, and that there was a nightpeak after the soccer match.

Table 1 shows the number of users (and their trips), the meantrip duration, and the mean euclidean distance of trips for eachday. Mean times vary within 74.78 and 83.86 minutes, and meandistances vary within 7.5 and 7.8 kilometers. One can see thatboth distance and time are over-estimated in comparison to theODS (mean time 41 minutes; and mean distance 6.05 kilometers).The differences in time could appear due to the latency in antennachanges and to CDR event granularity (which happens every 15minutes in the case of data connections). The differences in distancecould be explained by the approximation of each individual positionto the corresponding antennas. However, even though the meansare different, the distributions have similar shapes to those from theODS, as displayed on Figure 10.

Comparisons with ODS. A key question is how much differentour results are with respect to the ODS. First, we estimated theSpearman rank-correlation between our results and the ODS atmunicipal level, obtaining ρ = 0.89 (p < 0.001). The correlationis very high, which means that our averaged matrix reflects the flowof people in the city very well. To compare to what extent ourresult is good, we refer to a 2013 OD matrix estimated from thepublic transport smart-card data [9].3 This matrix has a correlationof ρ = 0.3 (p < 0.001) with the ODS, a result that indicates thatthe ODS captures a diversity of trip modes, not only public transport.We also tested what happened if we did not consider the matrixdiagonal (i. e., without intra-municipal trips), and we observed thatwe mantain our correlation, while the public transport OD increased(ρ = 0.32). This makes sense–short trips are less likely to usepublic transport (e. g., if the destiny is at walking distance).

3http://www.dtpm.cl/index.php/2013-04-29-20-33-57/matrices-de-viaje

Figure 9: Distributions of CDR trip data from Santiago, Chile. Thematrix has been L2 normalized on rows (origins).

Finally, we visually compare whether the distributions of starttime, trip duration and trip distance are similar to those of theODS. To do, Figure 12 shows the Cumulative Density Functionsof these variables. We observe that, in terms of start time andeuclidean distance, the CDFs are very similar, although in terms oftrip duration there is a noticeable difference, as mentioned earlier.

6. DISCUSSION AND CONCLUSIONSIn this paper we predicted OD matrices from daily journeys built

using mobile data. We did so by proposing a geometric approachbased on graphical timelines, and found that, using a sample frommobile data connectivity, we were able to reconstruct the OD pairsin a city, at municipal level, with a very high correlation. Moreover,we were able to estimate start time, trip time and trip distance, ina way that resembled the ODS results, with exception of trip time,which needs calibration to account for the delay in antenna changewith respect to the moment in which each trip started.Implications. On the one hand, our algorithm is very simple andcan work with streaming data if what matters is the number oftrips. Consider the day in which a soccer match was played, andpeak hours shifted. Reportedly, the transport authorities did notaccount for this shift, and instead they only created ad-hoc routes inpublic transport.4. Thus, our results could support urban computingapplications [13] which need almost real-time transport data. Onthe other hand, our method complements the ODS in importantways. The ODS is performed every 10 years, but, as with any staticpicture, it does not capture rich context-dependant dynamics of thecity. Conversely, our approach is more dense, as it can be appliedeven at daily scale to observe differences that the ODS does not.In this paper we have shown that, even working with a sampleof anonymous data from mobile data, it is possible to reconstructpart of the ODS, as well as finding diverging days from the typicalpatterns, at a fraction of its costs. This can be used to measure, forinstance, what are the effects in urban flows caused by transportmeasures like road space rationing.Limitations and Future Work. Our geometric algorithm, witharguably reasonable transport-based constraints, does not consider4http://www.mtt.gob.cl/copaamerica/santiago

Figure 10: Distributions of CDR trip data for each day.

information that could be useful in determining the turning pointsof the day (e. g., land-use or census information). Additionally, thedistance and time estimation need to be corrected using scalingfactors. Then, in addition to address our limitations, the main lineof research for future work will be the characterization of non-tripactivities. One way of performing such characterization is throughthe analysis of land-use derived from mobile data analysis [6, 10].Acknowledgements. We thank Oscar Peredo, José García andPablo García for valuable discussion.

References[1] Lauren Alexander, Shan Jiang, Mikel Murga, and Marta C

González. “Origin–destination trips by purpose and time ofday inferred from mobile phone data”. In: TransportationResearch Part C: Emerging Technologies (2015).

[2] Ennio Cascetta. “Estimation of trip matrices from trafficcounts and survey data: a generalized least squares estima-tor”. In: Transportation Research Part B: Methodological18.4 (1984), pp. 289–299.

[3] David H Douglas and Thomas K Peucker. “Algorithms forthe reduction of the number of points required to representa digitized line or its caricature”. In: Cartographica: TheInternational Journal for Geographic Information and Geo-visualization 10.2 (1973), pp. 112–122.

[4] Vanessa Frias-Martinez, Cristina Soguero, and Enrique Frias-Martinez. “Estimation of urban commuting patterns usingcellphone network data”. In: Proceedings of the ACM SIGKDDInternational Workshop on Urban Computing. ACM. 2012,pp. 9–16.

[5] Marta C Gonzalez, Cesar A Hidalgo, and Albert-LaszloBarabasi. “Understanding individual human mobility pat-terns”. In: Nature 453.7196 (2008), pp. 779–782.

[6] Eduardo Graells-Garrido and José García. “Visual Explo-ration of Urban Dynamics Using Mobile Data”. In: Proceed-ings of Ubiquitous Computing and Ambient Intelligence. Sens-ing, Processing, and Using Environmental Information: 9thInternational Conference, UCAmI 2015. Springer Interna-tional Publishing. 2015, pp. 480–491.

Figure 11: Comparison of OD matrices: OD Survey, our method,and public transport [9].

Figure 12: Comparison between OD Survey and CDR-based data.

[7] Md Shahadat Iqbal, Charisma F Choudhury, Pu Wang, andMarta C González. “Development of origin–destination ma-trices using mobile phone call data”. In: Transportation Re-search Part C: Emerging Technologies 40 (2014), pp. 63–74.

[8] Luís Moreira-Matias, Carlos Ferreira, João Gama, João Mendes-Moreira, and Jorge Freire de Sousa. “Bus bunching detectionby mining sequences of headway deviations”. In: Advances inData Mining. Applications and Theoretical Aspects. Springer,2012, pp. 77–91.

[9] Marcela A Munizaga and Carolina Palma. “Estimation of adisaggregate multimodal public transport Origin–Destinationmatrix from passive smartcard data from Santiago, Chile”. In:Transportation Research Part C: Emerging Technologies 24(2012), pp. 9–18.

[10] Kentaro Nishi, Kota Tsubouchi, and Masamichi Shimosaka.“Extracting land-use patterns using location data from smart-phones”. In: Proceedings of the First International Confer-ence on IoT in Urban Space. ICST (Institute for ComputerSciences, Social-Informatics and Telecommunications Engi-neering). 2014, pp. 38–43.

[11] Chaoming Song, Zehui Qu, Nicholas Blumm, and Albert-László Barabási. “Limits of predictability in human mobility”.In: Science 327.5968 (2010), pp. 1018–1021.

[12] Edward R Tufte. Envisioning information. Graphics Press,1993.

[13] Yu Zheng, Licia Capra, Ouri Wolfson, and Hai Yang. “Urbancomputing: concepts, methodologies, and applications”. In:ACM Transactions on Intelligent Systems and Technology(TIST) 5.3 (2014), p. 38.

a day of your days: estimating individual daily journeys ... · through daily journeys, as we...

Documents