generating map-based routes from gps … trajectories and their compact ... 18 chapter 4 - compact...

41
GENERATING MAP-BASED ROUTES FROM GPS TRAJECTORIES AND THEIR COMPACT REPRESENTATION Research Thesis In Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Ranit Gotsman Submitted to the Senate of the Technion Israel Institute of Technology Tevet 5773 Haifa January 2013 Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

Upload: dinhkhuong

Post on 26-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

GENERATING MAP-BASED ROUTES FROM

GPS TRAJECTORIES AND THEIR COMPACT

REPRESENTATION

Research Thesis

In Partial Fulfillment of the Requirements for the

Degree of Master of Science in Computer Science

Ranit Gotsman

Submitted to the Senate of

the Technion – Israel Institute of Technology

Tevet 5773 Haifa January 2013

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

The Research Thesis Was Done Under the Supervision of Prof. Yaron Kanza

in the Faculty of Computer Science at the Technion

The Generous Financial Help of the Technion Is Gratefully Appreciated

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

Table of Contents

Abstract ........................................................................................................................................... 1

CHAPTER 1 - Introduction ............................................................................................................. 2

CHAPTER 2 - Related Work .......................................................................................................... 4

CHAPTER 3 - The Map-Matching Algorithm ................................................................................ 8

3.1 Algorithm Details ............................................................................................................. 8

3.2 Experimental Results ...................................................................................................... 14

3.3 Discussion ...................................................................................................................... 18

CHAPTER 4 - Compact Representation of Routes ....................................................................... 19

4.1 Greedy Path Coding ....................................................................................................... 19

4.2 Shortest Path Coding ...................................................................................................... 24

4.3 Experimental Results ...................................................................................................... 27

4.4 Discussion ...................................................................................................................... 29

CHAPTER 5 - Conclusion and Future Work ................................................................................ 30

Bibliography .................................................................................................................................. 32

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

Table of Figures

Figure 1: A trajectory of noisy GPS readings and its associated polyline on the background of a

digital map.

Figure 2: Each of the GPS trajectory points is “snapped” to the closest map edge, leading to an

incorrect map-match.

Figure 3: Comparison of the algorithm of Huabei and Wolfson to our map-matching algorithm.

Figure 4: The associated trellis used in the algorithm of Hummel and the associated trellis used in

the algorithm of Newson & Krumm.

Figure 5: GPS trajectory simplification using the Douglas-Peuker (DP) algorithm.

Figure 6: Extraction of map edges relevant to a given GPS trajectory edge.

Figure 7: Values used in the computation of the weight of the edge ((ti,e), (x,y)) in the trellis graph.

Figure 8: Our map-matching algorithm with its associated trellis.

Figure 9: Possible choices of starting edges for a single GPS trajectory start point.

Figure 10: Map-matching example.

Figure 11: Map-matching example.

Figure 12: Map-matching example.

Figure 13: Sparse GPS trajectory with map-matched route.

Figure 14: The SPT problem and solution.

Figure 15: Greedy paths.

Figure 16: Simple greedy path code vs. optimal greedy path code.

Figure 17: Illustration of proof of optimal shortest path coding theorem.

Figure 18: Three different codings of a given path.

Figure 19: Three different codings of a given path.

Figure 20: Coding ratios of the three algorithms.

Figure 21: Coding running times of the three algorithms.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

1

Abstract

Digital maps are now an integral part of our lives, on computers, tablets and smartphone hardware

platforms and they are integrated into many software applications, static and mobile, from location-

based services to navigational aids. GPS receivers, capable of measuring location instantaneously

and accurately, are also ubiquitous in most mobile platforms, enabling applications relating and

combining GPS-measured locations and routes with digital maps.

This thesis deals with two practical problems related to GPS trajectories and digital maps. The first

is the classical problem of map-matching, namely matching a given (possibly noisy or sparse) GPS

trajectory to the sequence of roads traversed by the GPS receiver (typically in a navigation system)

in reality. We provide a novel solution to this problem, an extension of an existing method based

on Hidden Markov Models (HMM). Our algorithm works well also in scenarios where the GPS

measurements are very sparse and noisy, which is lacking in the existing HMM approaches. We

show the connection between our approach and the seemingly unrelated problem of the Shortest

Path Tour (SPT) on graphs.

The second problem we deal with is the compact coding of routes on digital maps. A route is a

sequence of vertices in the embedded planar graph representing a map, typically describing a route

taken by a mobile vehicle. An emerging problem is how to efficiently code these data sets in a

world where millions of these routes are generated each day, and all have to be stored and/or trans-

mitted for future processing in large databases. We provide two methods to code digital routes. The

first method represents the given route as a sequence of so-called greedy paths, where a greedy

path between vertex s and vertex t is one where the Euclidean distance to t is minimized as each

edge of the path is traversed. We provide two algorithms to generate a greedy path code for a route

containing n vertices. The first algorithm is fast – O(n), and the second one slower – O(n2) – but

optimal, meaning that it generates the shortest possible greedy path code. Decoding a greedy path

code can be done in O(n) time. The second method codes a route as a sequence of (classical) short-

est paths. We provide a simple algorithm to generate a shortest path code in O(kn2logn) time, where

k is the length of the (output) code, and prove that this code is optimal. Decoding a shortest path

code also requires O(kn2logn) time. Experimentally, we observed that shortest path codes are much

more compact than greedy path codes, justifying the larger time complexity.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

2

CHAPTER 1

Introduction

Digital maps are now an integral part of our lives, on computers, tablets and smartphone hardware

platforms and they are integrated into many software applications, static and mobile, from location-

based services to navigational aids. The basic sensor which determines our position on a digital

map is the GPS receiver – which is capable, under quite mild conditions, of measuring the coordi-

nates of the device in a global geographical coordinate system.

The GPS receiver provides two-dimensional coordinates of the unique point where the receiver

happens to be. This very low-level information is useful, but frequently a sequence of GPS meas-

urements over time – a trajectory – provides much more information about the movement of the

user holding the device. When the user is constrained to move on mapped roads, it is important to

be able to match the GPS readings to a digital map and specify the roads the user moved on. Since

GPS readings are typically noisy because of electronics, view blockage, or multiple object hitting

before reaching the GPS receiver, this is also a good way to denoise the GPS readings, as these two

information sources are independent. See Fig. 1 for an illustration of typical GPS readings on the

background of a digital map.

Determining which road(s) a device is moving along based on its GPS trajectory is commonly

called map-matching. While the problem becomes quite easy if the GPS readings are very accurate

and obtained at a high frequency, it becomes more challenging when noise is present and the read-

ings are obtained at a low frequency. For example, for the input in Fig. 1, it is quite difficult to tell

which road(s) the user is moving on based on just these few GPS readings.

Since the advent of the consumer-level GPS over 8 years ago, and especially with its widespread

proliferation on smartphones over the last few years, it has become extremely important to be able

to match a GPS trajectory to a given digital map, which is represented as a planar directed graph

embedded in the plane. Vehicle tracking on a digital map is essential “raw” material for a broad

range of applications such as traffic management, control, assessment and prediction, routing, and

navigation. To be reliable, hence useful, the data has to be related to the underlying road network

by means of map-matching algorithms. Chapter 2 describes some of the classical map-matching

algorithms and their shortcomings. In Chapter 3 of this thesis we describe a novel map-matching

algorithm which overcomes some of these shortcomings, especially for the scenario of sparse GPS

trajectories. This scenario arises when the GPS is activated at low frequency, which is desirable

because the GPS system is notoriously power-consuming, easily emptying a typical cellphone bat-

tery within less than an hour if not used sparingly. It also arises when a dense GPS trajectory is

diluted before the map-matching operation, which is a common preprocessing step in many sys-

tems, significantly reducing the time and space complexity of subsequent processing.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

3

Matching a GPS trajectory to a map results in a route which is a path (sequence of adjacent vertices)

in the map. Long trajectories will naturally result in long routes, and many such trajectories, e.g.

those collected over a long period of time, will require significant storage space. Compounded with

the fact that modern online database systems accumulate enormous quantities of these datasets to

support a variety of data-mining applications, it is therefore important to be able to represent these

routes in as compact a manner as possible. This can be thought of as a coding operation, for which

the corresponding decoding operation, when applied to the “code”, recovers the original route. In

Chapter 4 we describe three route coding algorithms, with a tradeoff between coding efficiency and

encoding/decoding time complexity. All algorithms are based on the principle of representing a

route as a subsequence of its vertices, such that the sub-routes between any two successive vertices

in this subsequence are uniquely recoverable, given the map. The original route is the concatenation

of these subroutes. For example, a route may be represented as a sequence of shortest paths in the

map. The coding algorithm is required to find the endpoints of this sequence of subroutes, and the

decoder to compute shortest paths between these endpoints. In Chapter 5 we conclude and describe

some possible future work.

Figure 1: A trajectory of noisy GPS readings (green points) and its associated (blue) polyline on the background

of a digital map. Note how difficult it is to decide which roads the points are on without looking at the entire

trajectory. Black arrows show road matching options for the GPS points.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

4

CHAPTER 2

Related Work

Map-matching has been studied for more than a decade, and the algorithms have evolved from very

simple to quite sophisticated. A complete review of all the existing algorithms would be impossible

in a reasonable amount of space, thus we refer the interested reader to the comprehensive surveys

of (White, Bernstein, & Kornhauser, 2000), (Quddus M. A., Ochieng, Zhao, & Noland, 2003) and

(Quddus, Ochieng, & Noland, 2007), and we mention here just those closely related to our work.

The first algorithms for map-matching were based on geometric proximity alone. Given a sequence

of GPS readings X = (x1, …, xn), where each xk is a two-dimensional coordinate, and a road map

(network) represented as a (planar) graph, each of the points xk is “snapped” (i.e. projected) to the

closest map edge. See Fig. 2, which shows the result of this naïve algorithm applied to the GPS

trajectory of Fig. 1. Since each point is snapped independently of the others, the results could be

quite inconsistent and confusing. An obvious improvement to this naïve approach is to take ad-

vantage of the global map topology (e.g. (Greenfeld, 2002)), i.e. the fact the user must travel along

roads in a continuous manner, and cannot “bounce” between roads. The first such topological map-

matching algorithms operated in the regime where GPS readings were few and far between, thus

had to assume a user model, namely that the user had certain well-defined behavior patterns, e.g.

that a user would move along the shortest path between two given points.

Figure 2: Each of the (green) GPS trajectory points is “snapped” (orange points) to the closest map edge, leading

to an incorrect map-match.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

5

A variation on this theme is the algorithm of (Huabei & Wolfson, 2004), who assigned a weight to

each map edge based on geometric properties of the edge such as its distance from the polyline

defined by X and the shape of the polyline, and then solve a shortest path algorithm on the map

using these new weights. The distance of a map edge to the polyline may be based on Euclidean

distance, difference in orientation, edge directionality, and a variety of other factors. Unfortunately,

this algorithm may err by preferring “short-cuts” in the map, as illustrated in Fig. 3. This is primar-

ily due to the fact the algorithm first “colors” the map based on the given GPS trajectory, then

“forgets” the trajectory and continues to operate on the colored map alone (e.g. with a shortest path

algorithm).

Figure 3: (Left) Shortcut in map-matched route preferred by the (Huabei & Wolfson, 2004) algorithm. (Right)

Desired result, as produced by our map-matching algorithm (to be described later). Blue polyline with green

points is the GPS trajectory. Red polyline is map-matched route.

More sophisticated map-matching algorithms are based on a Hidden Markov Model (HMM) prob-

abilistic approach (Hummel, 2006). Treating a GPS trajectory of edges T = (t1, t2, …, tn) as a se-

quence of empirical observations (i.e. measurements), they attempt to compute the most likely se-

quence of map edges traversed given that sequence of observations.

The key observation in the HMM approach is that the algorithm must work simultaneously on the

two inputs: the map and the GPS trajectory, hence operates in a state space consisting of states

which are pairs of entities, one from the map and one from the GPS trajectory. Thus solving the

HMM involves building a trellis, which is a replication of the map n times (one per each GPS

trajectory point). Each replica is a layer of the trellis, containing all map vertices. Most implemen-

tations work with the graph edges instead of the graph vertices, so the trellis layers consist of a

node per graph edge. Similarly, each trellis layer may represent either a GPS trajectory point or

edge. We prefer to use GPS edges. Thus, in this layered trellis graph, each trellis node represents a

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

6

pair: an edge from the GPS trajectory and an edge from the map, and each trellis edge represents a

connection between two map edges relevant to that edge of the trajectory. A trellis node (ti,ej) is

connected to a trellis node (ti+1,ek) iff the two map edges ej and ek are relevant (i.e. sufficiently close)

to the GPS trajectory edges ti and ti+1 and connected one to the other. Note that trellis edges exist

only between two adjacent layers of the trellis. Each trellis node (ti,ej) has an emission probability

that estimates the correlation between the GPS measurement ti and the edge ej based on (Euclidean)

distance between them. The trellis edge connecting node (ti,ej) to node (ti+1,ek) has a transition

probability that estimates the “distance” between the two map edges ej and ek. There are no edges

between trellis nodes within the same layer. See Fig. 4 for an example of a graph and GPS trajec-

tory, and the corresponding trellis. The HMM algorithm attempts to find a path of trellis edges from

the first layer to the last. This sequence of edges represents the map-matched route. In essence, the

original HMM algorithm (Hummel, 2006) proceeds monotonically along the temporal axis de-

scribed by T, namely, along the horizontal dimension of the trellis, essentially traversing the map

edges while traversing the trajectory, following the shortest weighted path through the trellis. The

weight of a path is derived from the emission and transition probabilities of the vertices and edges

along that path. The fact that there are no edges within layers allows efficient computation of this

shortest path using the Viterbi dynamic programming algorithm (Viterbi, 1967). The result is a list

of map edges, which is the map-matched route.

The original HMM algorithm was designed primarily for the scenario of dense (but perhaps noisy)

GPS trajectories. By “dense”, we mean that, on the average, there are many GPS points per map

edge. This means that the horizontal dimension of the trellis will be much larger than the vertical

dimension, and there will be many edges in the shortest path computed through the trellis which

will “march” along the same map edge. This precludes the opposite scenario – that of sparse GPS

trajectories. In this case, the trellis has a very small horizontal dimension, and many map edges

should be traversed for a single trajectory edge. Since there are no edges within a trellis layer, this

is not supported well, and the shortest path through the trellis is meaningless.

One of the more recent variants of the HMM algorithm (Newson & Krumm, 2009) for map-match-

ing, attempts to modify the algorithm to deal also with the case of sparse GPS trajectories. For each

trajectory point/edge, all the map edges in its vicinity – those that are not further away than some

radius R (typically R=200m in real life applications) – are considered. An edge is added between

two adjacent layers of the trellis corresponding to explicit shortest paths computed between any

pair of map edges in adjacent vicinities. This way there are still no edges within trellis layers, but

it is possible to “jump” between layers, each layer corresponding to a GPS trajectory point, even if

these points are quite far apart. See Fig. 4 (left) for an example of a simple map and a GPS trajectory

consisting of four readings, thus three edges. Fig. 4 (center) shows the associated trellis used in the

algorithm of (Hummel, 2006). Each of the three layers consists of a replica of the 12 graph edges.

The blue edges between layers correspond to relationships between the three edge vicinities, essen-

tially representing adjacent edges in the input graph. For example, the blue edge between (A,e1)

and (B,e3) corresponds to the path (e1,e3). Nodes are color-coded according to vicinities, and cor-

respond to graph edges in the vicinity associated with that layer. Each trellis node is weighted using

an emission probability and each trellis edge is weighted using a transition probability. No shortest

path exists between e1 and e12, so the algorithm will not generate a correct result. Fig. 4 (right)

shows the associated trellis used in the algorithm of (Newson & Krumm, 2009), which contains all

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

7

that was in the previous trellis, and additional red edges. As before, the blue edges in the trellis

represent adjacent edges in the input graph. The additional red edges in the trellis represent non-

trivial shortest paths in the input graph. For example, the red edge between (A,e1) and (B,e4) cor-

responds to the shortest path (e1,e2,e4) between e1 and e4 in the input graph. The bold (blue and

red) path is the shortest path between e1 and e12 through the trellis, corresponding to bold red path

in the input graph, which is the resulting map-match of the GPS trajectory.

While this modified HMM algorithm is now capable of map-matching sparse trajectories, the main

problem is that it requires the computation of many shortest paths on the map related to many of

the trajectory edges in order to construct the trellis in the first place. This can be very time-consum-

ing.

Figure 4: (Left) Sparse (blue) GPS trajectory with four readings (thus three edges) on background of a simple

graph containing 12 edges. The vicinity of each GPS trajectory edge is color-coded. (Middle) Trellis used in orig-

inal HMM map-matching algorithm of (Hummel, 2006). Each of the three layers consists of a replica of the 12

graph edges. Edges between layers correspond to relationships between the three edge vicinities. Nodes are color-

coded according to vicinities, and correspond to graph edges in the vicinity associated with that layer. Each trellis

node is weighted using an emission probability and each trellis edge is weighted using a transition probability. No

shortest path exists between e1 and e12, so the algorithm will not generate a correct result. (Right) Trellis used in

HMM map-matching algorithm of (Newson & Krumm, 2009). The red edges represent non-trivial shortest paths

in the input graph. Bold (blue and red) path is the shortest path between e1 and e12 through the trellis, corre-

sponding to bold red path in the input graph, which is the resulting map-match of the GPS trajectory.

e1 e2

e4 e3

e6 e5

e7 e8 e9 e10 e11

A B C

e12

e1 e2

e4 e3

e6 e5

e7 e8 e9 e10 e11

A B C

e12

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

8

CHAPTER 3

The Map-Matching Algorithm

We now describe our map-matching algorithm, also based on a trellis graph, which deals correctly

and naturally with sparse GPS trajectories. In contrast to the HMM algorithm of (Newson &

Krumm, 2009), it does not require to construct all the explicit shortest paths between map edges.

The key idea behind our algorithm is to allow the map and the GPS trajectory to play completely

symmetric roles. The algorithm advances along the trajectory T and map edges in parallel, allowing

each to advance at the correct speed, slowing down if necessary by staying put at a specific trajec-

tory edge or map edge. This is ultimately formulated as a shortest path problem on the same type

of trellis graph used by all HMM algorithms, whose nodes are pairs of edges – one from the GPS

trajectory and one from the map. An edge exists between two trellis nodes, (i, j) and (k, l) (i and k

are indices of GPS trajectory edges and j and l are indices of map edges) iff edge k is a successor

of edge i in the trajectory and l is a neighboring edge of j on the map. The main difference between

our trellis and the standard HMM trellis is that ours contains edges within layers. The weight of

this edge is a combination of the directionality of the edges and the Euclidean distance between

them. Note that the trellis graph is very sparse. A solution to the map-matching problem is the

minimum of the shortest paths between (t1, ei), where edge ei is an edge within a radius r (we found

that r = 20m gives good results) of the edge t1 and (tn, ej), where edge ej is an edge within radius r

of the edge tn. If there are no edges within this radius r, then r will be doubled, and so on, until there

is some minimal number (typically 5) of edges to consider (both for the starting edges and for the

ending edges).

3.1 Algorithm Details

Our algorithm proceeds in four stages: in Stage 1 (Trajectory Dilution, described in Section 3.1.1)

we optionally preprocess the GPS trajectory to remove redundancy. This is especially effective for

our algorithm, since a sparse GPS trajectory does not significantly harm the quality of our results,

but significantly reduces its run time. In Stage 2 (Extraction of Relevant Data, described in Section

3.1.2), we extract the relevant portion of the data from the map. In Stage 3 (Construction of the

Trellis Graph, described in Section 3.1.3), we create a trellis graph which represents the relationship

between the given GPS trajectory and the relevant part of the map. Finally, in Stage 4 (Computing

the Map Match, described in Section 3.1.4) we compute the map-matched route as a shortest path

through the trellis graph.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

9

3.1.1 Trajectory Dilution

Since the complexity of our algorithm is dependent on the number of points recorded in the GPS

trajectory, a first step is to dilute (sometimes also called to “simplify”) the trajectory by removing

redundant points. A redundant point is a point that is “almost” on the line connecting the points

before and after it, since it does not add much new information about the location of the vehicle.

Since our algorithm is not very sensitive to differences in the density of the GPS trajectory vs. that

of the map (as opposed to the previous HMM algorithms), dilution is quite safe.

Given a trajectory of points X = (x1, x2, …, xn), removal of redundant points can be done using the

Douglas-Peucker (DP) (Douglas & Peucker, 1973) polyline-simplification algorithm which has

O(n2) time complexity. The DP algorithm is controlled by a single parameter - the distance a point

is allowed to deviate from a straight line. The algorithm discards most of the points and marks just

those to be kept. The algorithm proceeds recursively as follows: Initially it starts with the pair of

indices (1, n), representing the sequence of all the points x1, x2, …, xn of the trajectory. It automat-

ically marks the indices 1 and n to be kept. It then finds the index i of the point xi that is furthest

from the line segment between x1 and xn. If the point is closer than ε to that line segment, then all

points with indices 2 ,.., n-1 may be discarded without the diluted trajectory being further than ε

from the line segment, and the recursion terminates. If the point is further than ε, then index i is

marked to be kept. The algorithm then calls itself twice recursively, first with the pair (1,i) and then

with the pair (i,n). When the procedure is complete, an output trajectory is generated consisting of

all (and only) those points whose indices have been marked as kept.

Simplifying a trajectory could typically reduce the number of points from 1,000 in an extremely

dense trajectory to a mere 30 points while preserving the geometric integrity of the trajectory. See

Fig. 5 for some examples.

Figure 5: GPS trajectory simplification using the Douglas-Peuker (DP) algorithm. The input trajectory is the blue

polyline with green points and the simplified trajectory is the magenta polyline with black points. The leftmost

example corresponds to an extremely dense input. The Orange line is the scale of the input GPS trajectory.

The DP simplification algorithm also helps in removing redundant trajectory points which accu-

mulate while a vehicle stops in a traffic jam or just at a traffic light. These points contain no addi-

tional information and just introduce noise because of GPS inaccuracy. See the third (rightmost)

example in Fig. 5.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

10

3.1.2 Extraction of Relevant Data

Since a digital map is typically an enormous database, we would like to extract from it only the

relevant portion, before any processing is done. We extract from the map only the edges that cor-

respond to the region where the GPS trajectory’s edges are located - those that intersect a bounding

buffer of offset R (typically R=200m as in (Newson & Krumm, 2009)) from some trajectory edge

– the trajectory edge vicinity.

The extracted map edges are the only edges the map-matching algorithm needs to know about.

These edges are identified using a standard (pre-computed) grid-based spatial index which lists the

edges of the map intersecting each spatial grid cell. For each trajectory edge, we compute the cells

of the grid which intersect the bounding box of the edge (inflated by one grid cell in each direction),

retrieve the indices within those cells from the database, and from the map edges indexed therein

we select those within radius R of the trajectory edge. See example in Fig. 6.

Figure 6: Extraction of map edges relevant to a given GPS trajectory edge (bold blue). Black squares on map

represent the grid cells. Relevant cells are bounded by the dashed black line. Relevant map edges indexed within

the cells are those within the dashed orange region (200 meter radius from trajectory edge).

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

11

3.1.3 Construction of the Trellis Graph

Given a map M with m edges (the relevant edges are extracted as described in Section 3.1.2) and a

GPS trajectory of edges T = (t1, t2, …, tn), we build a trellis graph G, with O(nm) nodes. As men-

tioned before, each node is a pair of edges, one (t) from T, and one from the edges in the vicinity

of t in M. As we will see, G is very sparse since every node is connected to very few other nodes.

G has the same trellis structure as the graph used by the standard HMM algorithms, namely, can be

viewed as n “layers” of the edges of the map M. Trellis edges within a layer correspond to neigh-

boring edges (i.e. two edges where the tail vertex of the first edge coincides with the head vertex

of the second edge) within a single vicinity in the map, and edges between layers correspond to

graph edges connecting between the vicinities of trajectory edges. Thus movement within each

layer corresponds to movement within the map at a given trajectory edge, and movement

between layers corresponds to movement along the trajectory. Algorithm 3.1 describes this

construction in detail.

Algorithm 3.1 (Trellis Graph Construction)

Input: GPS trajectory T = (t1, t2, …, tn),

Map edge adjacency table Neighbors

Output: Trellis graph G

1 for i=1 to n

2 J is the group of relevant edges from the map in the vicinity of ti

(obtained following the extraction of relevant data in Section 3.1.2).

3 for each edge e J

4 for each edge x { ti, ti+1 }

5 if x == ti

6 N = Neighbors(e)

7 else

8 N = { e } Neighbors(e)

9 for each edge y N

10 add edge ((ti,e), (x,y)) to G

11 weight edge with:

( ( d1+d2 ) * ( tLen1 + tLen2 + mLen1 + mLen2 ) ) / ( dir1 * dir2 )

The relative direction between the map edges and the trajectory edges provides important infor-

mation for the matching process. The relative direction is computed as follows: <ti, e> / (||ti||||e||).

This will give a value in the range [-1,1], where -1 means that the two edges are parallel in com-

pletely opposite directions and 1 means that the two edges are exactly parallel. The quantity dir1 is

the direction of edge ti relative to edge e plus 1.1 (so the number is always positive), and dir2 is the

direction of edge x relative to edge y plus 1.1 (so the number is always positive). These two quan-

tities are multiplied, since we want both to have a direct effect on each other; if one pair is going in

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

12

the opposite direction then it should override the correspondence of the other pair, and if both are

going in the right direction, then it amplifies these weights. The product features in the denominator

to express the inverse relationship between the two quantities and the resulting edge weight.

d1 is the minimum of the two following distances: the distance between ti’s starting point and e and

the distance between e’s starting point and ti. d2 is defined similarly as the minimum between the

following two distances: the distance between x’s starting point and y and the distance between y’s

starting point and x.

Figure 7: Values used in the computation of the weight of the edge ((ti,e), (x,y)) in the trellis graph.

d1, d2, tLen1, tLen2, mLen1, and mLen2 measure the distances between all the edges, as illustrated in

Fig. 7. The dominant weight is the distance between the map edge and the trajectory edge since if

this distance is large then there is less chance that the true route passed through that edge. Using

these weights allows the algorithm to take into account how much we have traveled along the edges

and how far the map edges and the trajectory edges are from each other.

We have now constructed the trellis graph G with all the relevant information about the map and

trajectory. Now we choose a couple of choices for the source edge on the map and a couple of

choices for the target edge on the map. This is done by taking all the map edges that fall within a

small radius r from the first and last point of the trajectory. See Fig. 9.

Fig. 8 shows the trellis graph constructed by Algorithm 3.1 on the input map graph and GPS tra-

jectory of Fig. 4.

ti

x

y

mLen1

d1 d

2

mLen2

e

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

13

Figure 8: Our map-matching algorithm. (Left) The same input (map and GPS trajectory) as in Fig. 4. (Right)

Trellis graph constructed by our algorithm from the map and trajectory. In a real implementation, the uncolored

nodes (which participate in no edges) do not actually appear in the trellis. Note edges between trellis nodes within

each layer. Bold blue path is the shortest path between e1 and e12 through the trellis, corresponding to bold red

path in the input graph, which is the resulting map-match of the GPS trajectory.

3.1.4 Computing the Map-Match

The last step of the algorithm is to find the weighted shortest path from the pair (t1, e) where e {

all choices of starting edges } (illustrated by dashed lines in Fig. 9) to the pair (tn, e') where e' {

all choices of ending edges } in G. The resulting path P will consist of pairs of (t, e'') where t T

and e'' edges of map. The map-matched route of the GPS trajectory to the map will be the ordered

map edges of P after deleting consecutive duplicates of map edges. For example, in Fig. 8, P (the

bold red path) is ((A, e1), (B, e3), (B, e10), (C, e11), (C, e12)), corresponding to the map-matched route

is (e1 e3 e10 e11 e12).

The algorithm fails if no shortest path can be found. This usually means that either the map is not

connected in the region we are working on, or that we did not extract enough map edges to support

such a path during the extraction of relevant data (Section 3.1.2). In this case, we may run the

algorithm again on larger trajectory edge vicinities.

e6

e1 e2

e4 e3

e6 e5

e7 e8 e9 e10 e11

A B C

e12

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

14

Figure 9: Possible choices of starting edges (dashed lines) for a single GPS trajectory start point.

3.2 Experimental Results

We implemented our map-matching algorithm in an interactive browser-based system, using the

Google Maps Javascript API (GoogleMaps) and the OpenStreetMap digital database

(OpenStreetMap). The system was scripted in Javascript on the client side and JSP/Servlets on the

server side. The algorithms were implemented in MATLAB and compiled to run independently on

the server by JSP/Servlet calls. The machine we used contained an Intel i7 CPU with 8GB RAM.

We used the dataset of GPS trajectories in the 2012 GIS Cup dataset and the GPS trajectory dataset

used in (Newson & Krumm, 2009), recorded in the Seattle area, to test our algorithms. These tra-

jectories consist of GPS recording at a frequency of 1Hz through urban and rural areas (highways,

small streets and intersections), which translates to a recording every 5-20 meters, depending on

the vehicle velocity. These are considered dense recordings. The noise level was = 10m. A typical

GPS trajectory contained 500 points.

We also used a number of GPS trajectories recorded by a smartphone application while driving in

Haifa. These trajectories recordings were made such that at least 10 seconds and at least 10 meters

elapsed between two successive recordings. These are quite sparse recordings. Here too the noise

level was = 10m.

To provide a controlled environment for some of the experiments, our system also allowed manual

user input of “synthetic” GPS trajectories (by interactively clicking a sequence of points on a map).

Our experiments were designed to show that the map-matching algorithm is accurate and insensi-

tive to sparse trajectories, including those after dilution using the DP algorithm.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

15

We start off by showing how our map-matching algorithm deals with the input of Fig. 1. Fig. 10

shows that the correct loop of map edges is matched to the entire given trajectory despite the am-

biguity in the individual trajectory points.

Figure 10: Map-matching. The GPS trajectory are the green points connected by the blue polyline. The orange

points connected by the red polyline is the corresponding map-matched route computed by our algorithm.

Fig. 3 (right) shows that our algorithm is not tempted to take shortcuts through a map when the

GPS trajectory indicates that this is not the case.

A slightly more complicated input is shown in Fig. 11, where the fact that a car exits Highway I5

onto a service road and then leaves the highway completely is not discovered until after the turn,

due to inaccuracy of the GPS readings on the service road.

Fig. 12 shows that despite a significant dilution factor of 20 of the dense input GPS trajectory (taken

from the Seattle set), using the DP algorithm with =20m, the map-matching algorithm was able to

perfectly recover the correct route on the map.

Fig. 13 shows that even if the GPS trajectory was very sparse (average edge length = 150m) and

quite noisy ( = 50m) in the first place, even to the point that it is not clear where the GPS points

came from at different portions of the trajectory, the map-matching algorithm produces a complete

and very plausible route. This “synthetic” trajectory was generated manually.

The core map-matching algorithm requires less than 30 seconds to compute the matching for the

data sets we tested. We do not quote concrete measured running times since our unoptimized

MATLAB implementation (at the backend of our Web-based front end) incurred overheads due to

I/O and data structure access in MATLAB.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

16

Figure 11: (Top Left) Original GPS trajectory (recorded by (Newson & Krumm, 2009)) was diluted by a factor of

55 to the (Bottom) sparse trajectory. (Top Right) Zoom into the red rectangular region. Resulting map-match is

superimposed on top right and bottom images. The continuation of the GPS trajectory in the north-east direction

guides the map-match on the noisy southern part of the trajectory to the service road off the I-5 highway.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

17

Figure 12: (Top) Dense input GPS trajectory (recorded by (Newson & Krumm, 2009)). (Bottom) Diluted GPS

trajectory (factor of 20) and map-matched result on the sparse trajectory.

Figure 13: “Synthetic” sparse (average edge = 150m) and noisy ( = 50m) GPS trajectory in blue with map-

matched route in red. Just looking at the points locally makes the matching ambiguous.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

18

3.3 Discussion

It is worth mentioning the interesting connection between our map-matching algorithm and the

solution to the Shortest Path Tour (SPT) problem: finding a shortest path through a graph such that

the path is constrained to pass through a given sequence of subsets of vertices (Festa, 2009). Alt-

hough neither of the two problems may be reduced to the other, both problems can be solved using

a similar trellis construction. The SPT problem may be formalized as follows: given a sequence of

disjoint vertex subsets T1, …, Tn of G, solve a shortest path problem, such that the path is con-

strained to pass through some vertex in each of the sets T1, …, Tn, in the correct order. The weight

assigned to an edge is the Euclidean length of that edge. See Fig. 14 for an example. Solving the

SPT involves building a trellis, which is a replication of the graph n times. Each replica is a layer

of the trellis, containing all graph vertices and edges, except for the edges connecting vertices in Ti

with vertices in Ti+1, which connect between these two layers of the trellis. The final step is to find

the shortest path from the first layer to the last layer. See Fig. 14 (right) for the corresponding SPT

trellis to an example input graph in Fig. 14 (left) and the resulting SPT solution. The trellis edges

are weighted to reflect Euclidean edge lengths in the map.

The SPT algorithm works with “hard” constraints, namely that the computed path is forced to pass

through very specific disjoint sets of map vertices (or edges), and may pass through other vertices

of the graph not in the constraint sets in order to achieve its objective. In contrast, in our map-

matching algorithm, the analogous constraint sets (which we call vicinities) are derived from the

GPS trajectory T = (t1, t2, …, tn), thus, by definition, there will be some overlap between consecutive

constraint sets (as is evident in Fig. 8). Consequently, no graph edges outside the vicinities will be

needed to form the solution, thus there is no need to include them in the trellis, which is therefore

quite sparse.

Figure 14: The SPT problem and solution. (Left) A graph with three vertex subsets (A, B, C) that the SPT is

constrained to pass through in that order. (Right) Corresponding trellis and (bold blue) shortest path through

trellis. This shortest path corresponds to the (bold red) SPT in the input graph.

v1

v2

v3

v6 v7 v8

v9

v10

v11

v12

v5

v4

v1 v2

v4 v3

v6 v5

v7 v8 v9 v10 v11

A B C

v12

A

B

C

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

19

CHAPTER 4

Compact Representation of Routes

Once a route is generated based on a GPS trajectory, it may be represented as a path in a planar

graph, namely, a sequence of vertices in the graph, implying edges between every two consecutive

vertices, which translates to a sequence of vertex IDs. In a typical map/graph, each vertex ID can

consume up to 32 bits. Thus storing (or transmitting) long paths could be quite costly. In applica-

tions which involve building large databases of user paths, these costs could be prohibitive.

Our contribution is two novel ways to compactly represent a path in a planar graph, and efficient

algorithms to compute these compact representations. Both consist of representing the path as a

subsequence of vertices from which the path can be uniquely reconstructed as a sequence of well-

defined paths between each two consecutive vertices. For example, given a path, we seek to de-

compose it into the smallest possible sequence of shortest paths. Then, given the subsequence of

vertices and the graph, the route may be recovered by generating a shortest path between each two

consecutive vertices in the “code”.

4.1 Greedy Path Coding

Our first method of representing a path in a graph is as a sequence of consecutive greedy paths.

The following definition of a greedy path is used in the routing literature (Bose & Morin, 2004):

Definition: A path P = (i1, i2, …, im) is a greedy path from vertex i1 to vertex im in the planar graph

G = {V, X, E} (X is the geometry of the vertices) iff the sequence of Euclidean distances (in the

plane) || X(i1) - X(im) ||, || X(i2) - X(im) ||, …, || X(im) - X(im) || is monotonically decreasing.

Intuitively, a greedy path between vertex v and vertex u is one where each vertex w along the path

is closer to u than pred(w) (the predecessor of w).

We call this definition a greedy path in the weak sense, and add another condition to define a greedy

path in the strong sense:

Definition: A path P = (i1, i2, …, im) is a greedy path from vertex i1 to vertex im in the planar graph

G = {V, X, E} (X is the geometry of the vertices) iff the sequence of Euclidean distances (in the

plane) || X(i1) - X(im) ||, || X(i2) - X(im) ||, …, || X(im) - X(im) || is monotonically decreasing and for all

1 ≤ k < m, ik+1 = argmin jneighbors(i_k)(|| X(j) - X(im) ||).

The extra condition implies that not only is each vertex w along the path closer to u than pred(w),

but is the closest to u among all neighbors of pred(w). A greedy path in the strong sense can be

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

20

viewed as the discrete equivalent of a gradient descent path from v to u when considering the Eu-

clidean distance function from u. The motivation for this extra condition is that under mild condi-

tions on the graph, the greedy path in the strong sense will be unique, as opposed to the greedy path

in the weak sense, which is typically not unique. As we will see later, uniqueness is important for

the path coding application.

Note that a greedy path (in the weak sense, and certainly in the strong sense) between two given

vertices in a planar graph is not always guaranteed to exist, even if the graph is connected. This can

happen, for example, if a greedy walk from v to u gets stuck at a vertex w from which no neighbors

are closer to u than w. This is the equivalent of getting stuck at a local minimum when performing

gradient descent in the continuous case. For some specific planar graphs, the situation is better, for

example, it is known (Bose & Morin, 2004) that a greedy path in the weak sense exists between

any two vertices of a Delaunay triangulation. Such greedy paths are used extensively for routing in

embedded networks, where messages are greedily forwarded towards their destination. Fig. 15

shows some examples of greedy paths in the weak and strong senses in a planar graph. Sometimes

both exist and sometimes only a greedy path in the weak sense. From this point onwards, we will

use just the term greedy path to mean greedy in the strong sense.

Figure 15: Greedy paths. (Left) The green path is a greedy path in the weak sense between A and B1, and the

orange path is the greedy path in the strong sense. (Right) A (green) greedy path in the weak sense exists between

A and B2, but no greedy path in the strong sense exists. This is evident from the fact that a greedy walk proceeds

along the orange path and reaches a dead end (i.e. a local minimum of the Euclidean distance function from B2).

It is easy to decide whether a given path is a greedy path by simply checking the definition. It is

not too difficult either to compute a greedy path (if it exists) between vertex i1 and vertex im using

the following greedy algorithm. Start from vertex i1, when at ik, choose as ik+1 the neighbor of ik

which is the closest to the final destination im and also closer than ik to im (if the latter condition is

not satisfied, then the algorithm is stuck at a local minimum and fails). Then continue in the same

manner from ik+1.

Given a path P = (i1, i2, …, im), a greedy path code of P is a subsequence Q = (j1, j2, …, jk) of P

such that i1 = j1, im = jk, and P is identical to the concatenation of the greedy paths between jt and

jt+1 for 1 ≤ t < k, namely if jt = ir and jt+1 = is then the subpath (ir , ..., is) of P is a greedy path. An

optimal greedy path code of P is a shortest possible Q (as measured by k). The objective is to

produce a code such that greedy paths indeed exist between the code vertices. These greedy paths

will be unique because of the extra (strongness) condition.

A

B1

A

B2

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

21

We now describe two algorithms to compute a greedy path code of a path in a graph. The first is

the simplest possible, running in linear time, but not necessarily generating an optimal greedy path

code. The second algorithm is less efficient, but optimal. Note that in the worst case, the greedy

path code of a path is the path itself.

Both algorithms take advantage of the fact that greedy paths have the suffix property, namely, any

suffix of a greedy path is also a greedy path, which is a trivial consequence of the definition of a

greedy path. It also means that given a graph G and a target vertex t, the uniqueness of the greedy

paths implies that all greedy paths from all other vertices of G to t (if they exist) form a greedy tree

rooted at t (after reversing the direction of the edges). This tree does not span the entire vertex set

of G, rather only those vertices from which a greedy path to t exists. This subset of the vertices of

G is called the “basin of attraction” of t.

Given a greedy path code of a path (i1, i2, …, im), it may be decoded in time complexity O(m) by

simply computing the greedy paths in the graph between each two consecutive vertices in the code.

The uniqueness of the greedy path guarantees that the decoding operation is correct, i.e. indeed

recovers the original path. The linear complexity assumes that all vertices have a bounded valence,

thus computing the correct neighbor of a vertex in a greedy path requires O(1) time.

4.1.1 Simple Greedy Path Coding Algorithm

The simple greedy path coding algorithm proceeds by starting from im, checking backwards if the

path is greedy. A codeword (an index of a vertex in the graph) is generated when the path ceases

to be a greedy path, and the procedure repeats from there.

Algorithm 4.1.1 (Simple Greedy Path Coding)

Input: Path P = (i1, i2, …, im) in the planar graph G = {V, X, E}

Output: Greedy path code of the path P

1 C = [im]

2 t = m; s = m-1

3 while t > 1

4 while s > 1 and is == argminjneighbors(i_(s-1))(|| X(j) - X(it) ||)

and ||X(is)-X(it)|| < ||X(is-1)-X(it)|| // check (strong) greedy path condition

5 s = s-1

6 insert is at the beginning of C

7 t = s

8 return C

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

22

The suffix property of the greedy paths allows to check greediness in Line 4 by checking just the

current s at each step, saving checking the greediness of the entire subpath between s and t.

This algorithm has O(m) time complexity, where m is the number of vertices in the input path. The

linear complexity assumes that all vertices have a bounded valence, thus checking the greediness

of an edge in the path requires O(1) time. Unfortunately, this algorithm is not guaranteed to find

the shortest possible greedy path code. See Fig. 16 for an example, where a path of 6 vertices

(which is also the entire graph G) is coded into 5 points using the simple greedy path coding algo-

rithm, but using the optimal algorithm to be described next results in a greedy path code of 3 points.

4.1.2 Optimal Greedy Path Coding Algorithm

The next greedy path coding algorithm that we describe is more sophisticated. It computes the

optimal greedy path code, namely that with a minimal number of points. It is similar to the Imai-

Iri algorithm (Imai & Iri, 1986) for simplifying a polyline, proceeding by building a graph on the

input points where an edge (v,u) represents the existence of a greedy path between v and u. Com-

puting a shortest path in this graph between the first and last vertices generates a greedy path code

with the minimal number of vertices.

Algorithm 4.1.2 (Optimal Greedy Path Coding Algorithm)

Input: Path P = (i1, i2, …, im) in the planar graph G = {V, X, E}

Output: Optimal greedy path code of P

1 Create a graph, R, with m nodes and no edges

2 for t = 2 to m

3 s = t

4 while s > 1 and is == argmin jneighbors(i_(s-1))(|| X(j) - X(it) ||)

and ||X(is)-X(it)|| < ||X(is-1)-X(it)||

5 add the edge (s, t) to R

6 s = s-1

7 add the edge (s, t) to R

8 Find the shortest path, S, from node 1 to node m in R

9 return S

The time complexity of this algorithm is O(m2), since the outer “for” loop (on t) iterates m times,

and the inner “while” loop can add up to t edges, resulting in a graph R containing m vertices and

O(m2) edges. Thus the shortest path computation in Line 8 also requires O(m2) time when using

Djikstra’s algorithm with Fibonacci heaps (Fredman & Tarjan, 1984).

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

23

Figure 16: Greedy path coding in a graph G consisting of a single path. (Top) Simple greedy path code (5 points).

Path in orange points and code in purple points. (Bottom Left) Graph R of Algorithm 4.1.2. Shortest path between

vertex 1 and vertex 6 marked in purple. (Bottom Right) Optimal greedy path code (3 points).

4.1.3 Uniqueness

The optimal greedy path coder relies on finding a shortest path in the graph R (Line 8 of Algorithm

4.1.2). In order to guarantee a unique coding (e.g. in order to determine if two paths are identical

based only on their codes), this shortest path of R must be unique, i.e. independent of the shortest

path algorithm (e.g. Dijkstra, Bellman-Ford) used by the encoder. Since apriori there is no reason

that the shortest path should be unique, we achieve this by slightly modifying the content of the

graph R, as described by (Mehlhorn, 2009).

While building the graph R, all edges are created with the same unit weight. This may cause mul-

tiple shortest paths to be present in R. To avoid this, we add to each edge weight a small value

which will allow distinguishing between the different edges, yet without compromising the true

shortest path.

The simplest way to achieve this is to perturb the edge weights. Define ɛ = m-2, where m is the

number of points in the path P, and for every edge (ir, is), a weight of ɛrs = ɛ(s-r)2 is added to its

original unit weight. Thus, the weight of edge (ir, is) will be wrs = 1 + m-2(s-r)2.

Using these perturbed weights will have the effect of separating multiple shortest paths with the

same number of edges (i.e. greedy path codes of the same length). Among all such codes, it will

prefer those whose greedy path segments (i.e. the greedy subpaths that the code partitions the input

path to) have approximately the same number of edges. This is because all candidate codes have

the same number k of greedy path segments, representing the same total number of edges m (as in

t

s

simple

1 2 3 4 5 6

t

s

optimal

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

24

the input path). Denoting by xi the number of edges in the i-th greedy path segment, minimizing

the sum of the squares 2

1

k

i

i

x

prefers uniform distribution of the xi’s, as the following lemma for-

malizes:

Lemma: The solution to min 2

1

k

i

i

x

subject to 1

k

i

i

x m

(m is a positive constant) is xi = m/k (i

= 1, …, k).

Proof: Straightforward using Lagrange multipliers.

A different way to perturb the edge weights is to add a large enough variety of small enough pseudo-

random values to the edge weights. (Mehlhorn, 2009) proves that in this case, the probability of

uniqueness of the shortest path is very high.

4.2 Shortest Path Coding

Greedy path coding seeks to find the subsequence of points of P that segments P into a number of

subpaths, which are greedy paths between consecutive points of the subsequence. Greedy path cod-

ing is relatively simple and decoding is extremely fast. It relies on the extrinsic geometry (i.e. co-

ordinates of the embedding) of the graph. However, more compact codes are possible. In this sec-

tion we explore shortest path coding, i.e. representing P as the subsequence of points of P which

segments P into a number of subpaths which are shortest paths between consecutive points of the

subsequence. As we will see, these codes will be more difficult to compute and decoding them will

be slower, but they will be more compact.

Define the length of a path to be the sum of the Euclidean lengths of the edges in the path. A shortest

path between vertex i and vertex j is the path between the two vertices whose length is the shortest

possible. This path can be computed using Dijkstra’s algorithm and its many variants (Dijkstra,

1959) (Bellman, 1958). As such, it relies only on the intrinsic geometry (edge lengths) of the graph.

In contrast with the greedy path coding algorithms, shortest path coding requires considering a

larger portion of the graph than just the given path P and its neighboring edges – an entire bounding

box of the path. Since the algorithm relies on computation of shortest paths between vertices we

need a much broader view of the region.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

25

4.2.1 Optimal Shortest Path Coding Algorithm

Shortest paths have the subpath property, namely, any subpath between vertex u and vertex v within

a shortest path is necessarily also a shortest path between u and v. In particular, this implies the

prefix property and the suffix property, that any prefix or suffix of a shortest path is a shortest path.

The prefix property implies the well-known fact that given a graph G and a source vertex s all

shortest paths from s to all other vertices form a spanning tree of G rooted at s. Using the suffix

property, it is possible to prove that the following simple (i.e. greedy in the algorithmic sense)

shortest path coding algorithm is in fact optimal. The algorithm is essentially identical to the simple

greedy path coding algorithm, except that it proceeds in the forward direction, as opposed to the

reverse direction. It checks incrementally whether subpaths of the input path are shortest paths,

taking advantage of the suffix property to save computations.

Algorithm 4.2.1 (Optimal Shortest Path Coding Algorithm)

Input: Path P = (i1, i2, …, im) in the planar graph G = {V, X, E}

Output: Optimal shortest path code of P

1 C = [i1]

2 s = 1

3 while s < m

4 compute the shortest path tree, SP, rooted at is whose leaves are all it, s < t ≤ m

5 t = s+1

6 while t ≤ m and SP{t}(end-1) = it-1 // the subpath (is, …,it) of P is the shortest path

// between is and it

7 t = t+1

8 append it-1 to C

9 s = t-1

10 return C

In Line 6, SP{t} denotes the shortest path between is and it. SP{t}(end-1) is the vertex before it in

this shortest path. We assume that all path lengths are different real numbers. This is needed to

guarantee that the shortest path tree computed in Line 4 is unique, in order that the decoder is able

to recover the original path from the code.

The optimality of Algorithm 4.2.1 follows from the following theorem:

Theorem: Any shortest path code C’ of a path P in graph G will have length greater or equal to the

length of C - the output of Algorithm 4.2.1.

Proof: (See Fig. 17) Let C = [i1,…, ik] be the output of Algorithm 4.2.1 and C’ = [j1, …, jr] be the

output of any other shortest path coding algorithm. It suffices to prove that each of the k-1 segments

[is, …, is+1) contains at least one element of C’ for all 1 ≤ s ≤ k-1, since then k ≤ r.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

26

Note that the claim holds trivially for the first segment (s=1) since i1 = j1. So assume 1 < s < k. Now

assume by way of contradiction that the segment [is, …, is+1) does not contain any element of C’.

Let jp be the largest element of C’ such that jp < is and jp+1 the next element of C’ (in the “worst

case” p=1). By the assumption, jp+1≥ is+1. Now, by definition, [jp, …, jp+1] is a shortest path, so the

suffix property implies that [is, …, jp+1] is also a shortest path, in contradiction to the fact that [is,

…, is+1] is the longest possible shortest path starting at is.

Figure 17: Illustration of proof of optimal shortest path coding theorem. All points are the given path. Purple

points are the optimal coding C. Cyan points are some of the code C’. The path [is, …, is+1) does not contain any

element of C’. If the path [jp, …, jp+1] is a shortest path, then the suffix property implies that [is, …, jp+1] is also a

shortest path.

Note that this proof does not hold for the simple greedy path coding algorithm (Algorithm 3.1.1),

because the algorithm does not guarantee the final contradiction – that [is ,.., is+1] is the longest

possible greedy path starting at is, since the algorithm operates in reverse.

The complexity of Algorithm 4.2.1 is O(k(n+nlog n+m)) where n is the number of edges/nodes in

the effective graph M (the path bounding box) and k is the number of points in the code. In general,

n is O(m2), since this is the relationship between the number of edges in a one-dimensional path

and the number of edges in a two dimensional region whose boundary length is O(m), giving a

complexity of O(km2log m).

The time complexity of decoding is also O(km2log m) – namely computing the shortest path be-

tween each consecutive pair in the code (there are k-1 pairs).

is is+1 i1 = j1 jp+1 jp ik = jr

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

27

4.3 Experimental Results

We ran experiments on routes map-matched (using our algorithm as described in Chapter 3) from

the GPS trajectories in the 2012 GIS Cup dataset and the GPS trajectory dataset used in (Newson

& Krumm, 2009), and GPS trajectories recorded in Haifa. Thus all experiments were ran on paths

in real maps. Our objective was to compare the coding efficiency achievable by the three algorithms

described in this chapter. As with the map-matching algorithm, the coding algorithms were imple-

mented in MATLAB and called by our interactive Web-based front-end.

Figures 18 and 19 compare the different types of codes. Both figures show the simple greedy path

code, the optimal greedy path code, and the optimal shortest path code. In general, the difference

between the two greedy path codes is relatively small, but the shortest path code is typically much

more compact.

Figure 18: Three different codings (purple points) of a given path (orange points). (Top Left) Simple greedy path

code (19 points out of 214 original points). (Top Right) Optimal greedy path code (15 points). (Bottom) Optimal

shortest path code (5 points).

simple greedy path code

19 / 214 points

optimal greedy path code

15 / 214 points

optimal shortest path code

5 / 214 points

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

28

Figure 19: Three different codings (purple points) of a given path (orange points). (Left) Simple greedy path code

(7 points out of 77 original points). (Middle) Optimal greedy path code (6 points). (Right) Optimal shortest path

code (4 points).

We ran statistics on a set of 33 routes map-matched (using our algorithm) from the GPS trajectories

in the 2012 GIS Cup dataset and the GPS trajectory dataset used in (Newson & Krumm, 2009) to

determine the average coding ratio and running time of the various algorithms. A typical route/path

contains approximately 125 vertices. The results are shown in Fig. 20 and Fig. 21. As evident there,

the simple greedy path coding algorithm reduces the number of vertices to 7.3% of the original on

the average, the optimal greedy path coding algorithm reduces slightly more, to 7.1%. The shortest

path coding algorithm reduces to 4.5% on the average.

Figure 20: Coding ratios of the three algorithms. Each data point represents a path in the dataset.

7.3% 7.1%

4.5%

0

0.1

0.2

averages simple greedy coding

optimal greedy coding shortest path coding

optimal shortest path code

4 / 77 points

simple greedy path code

7 / 77 points

optimal greedy path code

6 / 77 points

Codin

g r

atio

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

29

Figure 21: Coding running times of the three algorithms. Each data point represents a path in the dataset.

4.4 Discussion

In terms of time complexity, it is most important that the decoder to be extremely fast since the

process of decoding is done many times (essentially every time a route is queried from a database)

and in real-time, as opposed to the encoding process which usually happens only once, and is typ-

ically done in an offline process anyway. Decoding of the greedy path codes takes O(m) time and

decoding of the more compact shortest path code takes O(km2log m) time (k is the length of the

code).

In some applications it is important to code a path online (as it is being generated). This would

seem to be impossible for the two greedy path coding algorithms, since they operate in reverse.

Nonetheless, it is possible to modify these algorithms to run in forward order, paying a penalty in

time complexity. In contrast, the optimal shortest path encoding algorithm may be run online with

a lag of just one path vertex, i.e. it is possible to decide whether a path vertex is part of the shortest

path code only after the next route vertex has been seen. There will also be a runtime penalty to

implement this in practice.

0.0040.052

0.120

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

averages simple greedy coding

optimal greedy coding shortest path coding

Runnin

g t

ime (

sec)

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

30

CHAPTER 5

Conclusion and Future Work

This thesis dealt with two practical problems related to GPS trajectories and digital maps. The first

is the classical problem of map-matching, namely matching a given (possibly noisy or sparse) GPS

trajectory to the sequence of roads traversed by the GPS receiver (typically in a navigation system)

in reality. We provided a novel solution to this problem, which works well also in scenarios where

the GPS measurements are very sparse and noisy, which is lacking in existing approaches. We

demonstrated the effectiveness of the algorithm on real-world data sets which were significantly

diluted using the DP polyline simplification algorithm.

As with most map-matching algorithms, the strength of our algorithm is derived from its global

optimization approach. This is possible if the entire GPS trajectory is available, and the map-match-

ing performed as an offline process. Frequently, map-matching is required to be computed online,

as the trajectory is being generated (e.g. while the vehicle is driving). This requires a local optimi-

zation approach. As future work, it remains to be investigated if our algorithm can be generalized

to an online version.

The second problem we dealt with is the compact coding of routes on digital maps. A route is a

sequence of vertices in the embedded planar graph representing a map, typically describing a route

taken by a mobile vehicle. An emerging problem is how to efficiently code these data sets in a

world where millions of these routes are generated each day, and all have to be stored and/or trans-

mitted for future processing in large databases. We provided two methods to code digital routes.

The first method represents the given route as a sequence of so-called greedy paths, where a greedy

path between vertex s and vertex t is one where the Euclidean distance to t is minimized as each

edge of the path is traversed. We provided two algorithms to generate a greedy path code for a route

containing n vertices. The first algorithm is fast – O(n), and the second one slower – O(n2) – but

optimal, meaning that it generates the shortest possible greedy path code. Decoding a greedy path

code can be done in O(n) time. The second method codes a route as a sequence of (classical) short-

est paths. We provide a simple algorithm to generate a shortest path code in O(kn2logn) time, where

k is the length of the (output) code, and prove that this code is optimal. Decoding a shortest path

code also requires O(kn2logn) time. Experimentally, we observed, when applying our algorithm to

real-world datasets, that shortest path codes are much more compact than greedy path codes, justi-

fying the larger time complexity.

Compact coding of routes on a map, coupled with a very fast decoding algorithm, is important for

storage and transmission of this type of data from large (online) databases, especially as these da-

tabases become more and more widespread in the connected mobile world. An important related

question is when is it possible to perform computations on routes in their coded form, i.e. without

explicitly decoding them. For example, is it possible to intersect two routes by intersecting their

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

31

greedy path or shortest path codes without decoding the two routes first ? Similarly, is it possible

to determine proximity of a given map vertex to a coded route, without decoding the route ? These

questions remain as future work.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

32

Bibliography

Bellman, R. (1958). On a routing problem. Quarterly of Applied Mathematics, 16(1), 87 - 90.

Bose, P., & Morin, P. (2004). Online routing in triangulations. SIAM Journal of Computing, 33,

937 - 951.

Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische Mathematik,

1(1), 269 - 271.

Douglas, D. H., & Peucker, T. K. (1973, 10 01). Algorithms for the reduction of the number of

points required to represent a digitized line or its caricature. Cartographica: The

International Journal for Geographic Information and Geovisualization, 10(2), 112 - 122.

Festa, P. (2009). The shortest path tour problem: problem definition, modeling and optimization.

In Proceedings of INOC.

Fredman, M. L., & Tarjan, R. E. (1984). Fibonacci heaps and their uses in improved network

optimization algorithms. 25th Annual Symposium on Foundations of Computer Science

(IEEE), (pp. 338 – 346).

GoogleMaps. https://developers.google.com/maps/documentation/javascript.

Greenfeld, J. S. (2002). Matching GPS observations to locations on a digital map. Proc. of the 81st

Annual Meeting of the Transportation Research Board.

Huabei, Y., & Wolfson, O. (2004). A weight-based map matching method in moving objects

databases. Scientific and Statistical Database Management, 2004. Proceedings. 16th

International Conference, 437 - 438.

Hummel, B. (2006). Map matching for vehicle guidance, in dynamic and mobile GIS: Investigating

space and time. (J. Drummond, Ed.) Florida.

Imai, H., & Iri, M. (1986). Computational-geometric methods for polygonal approximations of a

curve. Computer Vision, Graphics, and Image Processing, 36(1), 31 - 41.

Mehlhorn, K. (2009). Selected Topics in Algorithms Course Notes - Unique Shortest Paths.

Newson, P., & Krumm, J. (2009). Hidden Markov map matching through noise and sparseness.

Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in

Geographic Information Systems, 336 - 343.

OpenStreetMap. http://wiki.openstreetmap.org/wiki/Downloading_data.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

33

Quddus, M. A., Ochieng, W. Y., & Noland, R. B. (2007). Current map-matching algorithms for

transport applications: State-of-the art and future research directions. Transportation

Research Part C: Emerging Technologies, 312 - 328.

Quddus, M. A., Ochieng, W., Zhao, L., & Noland, R. B. (2003). A general map matching algorithm

for transport telematics applications. GPS Solutions, 7(3), 157 - 167.

Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum

decoding algorithm. Transactions on Information Theory, 13(2), 260 - 269.

White, C. E., Bernstein, D., & Kornhauser, A. L. (2000). Some map matching algorithms for

personal navigation assistants. Transportation Research Part C: Emerging Technologies,

8, 91 - 108.

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

מפה מרצפי-יצירת מסלולים מבוססי

GPS וייצוגם הקומפקטי

חיבור על מחקר

התואר לשם מילוי חלקי של הדרישות לקבלת

מדעי המחשבלמדעים ב מגיסטר

רנית גוטסמן

מכון טכנולוגי לישראל –מוגש לסנט הטכניון

1023 ינואר חיפה תשע"ג טבת

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

המחקר נעשה בהנחיית פרופ' ירון קנזה בפקולטה למדעי המחשב

אני מודה לטכניון על התמיכה הכספית הנדיבה בהשתלמותי

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

I

תקציר

משולבותמפות אלו .חכמיםטלפונים וב טבלטים,ב במחשבים,במכוניות, ק בלתי נפרד מחיינו, חלמפות דיגיטליות הן כיום,

מדידת מיקוםיכולת שלהם , GPSלעזרי ניווט. מקלטי ועד מיקום -ים, משירותים מבוססיניידו יםיישומים, סטטיבהרבה

ומסלולים עם מפות GPS לשלב בין מדידותמאפשרים ליישומים נמצאים בפלטפורמות ניידות, הבאופן מיידי ומדויק,

דיגיטליות.

ומפות דיגיטליות. GPSתזה זו עוסקת בשתי בעיות מעשיות הקשורות למסלולי

רצף של – GPS מסלולבין התאמה ,, כלומרMap Matching)התאמה למפה ) לסית שאא בעיה קלנה היראשובעיה הה

)בדרך המקלט נע שעליוכבישים הלרצף -רועש או דליל( יכול להיות ) GPS-ה מקלט נקודות במישור שהוקלטו על ידי

מערכת הכבישים כולה נתונה כמפה דיגיטלית.(. במכונית מערכת ניווט בתוך כלל

GPS , …, 1x= ( Xאלגוריתמים קלסיים לפתרון בעיית ההתאמה התבססו על קרבה גיאומטרית בלבד. בהינתן מסלול

nx,) כאשרkx מימדית, כל אחת מהנקודות -היא קואורדינטה דו kx ביותר במפה. מאחר וכל הטלה המוטלת לצלע הקרוב

הוא לנצל את , שנעשה מאוחר יותר,בלתי תלויה באחרות, התוצאה עלולה להיות לא עקבית ומבלבלת. שיפור מתבקש

של כבישים, ואינו יכול "לקפוץ" רצףהטופולוגיה של המפה, כלומר, העובדה שמכונית נוסעת על גבי מסלול המורכב מ

בין כבישים שונים.

שדומה בשיטת הפעולה שלו לאלגוריתמים המפה, תטופולוגיהמנצל את ,ית ההתאמהאנו מספקים פתרון חדשני לבעי

הםם שלנו עובד היטב גם בתרחישים שב(. האלגוריתHMM) Hidden Markov Models קודמים המבוססים על

תרחישים של מסלולי. נכשלות הקיימות HMM-הגישות תרחישים שבהםרועשות, /או ו מאוד דלילות GPS-ה מדידות

GPS מאוד דלילים מתעוררים כאשר מקלט ה- GPS בתדירות נמוכה כדי לחסוך בצריכת החשמל שלה, צריכה ופעלמ

כה חזקה עד כדי כך שהפעלת המקלט בתדירות רגילה עלולה לרוקן סוללה של טלפון סלולרי טיפוסי תוך שעה. תרחיש

באופן אגרסיבי על מנת לחסוך בעיבוד החישובי של המסלול )שהסיבוכיות GPSדומה מתעורר כאשר מדללים מסלולי

.חסי למספר הנקודות במסלול(כמובן י

ביניהם. מפה טיפוסית יכולה לתאר עיר שלמה, צלעותדקודים וקגרף של אנו משתמשים במפה דיגיטלית הנתונה על ידי

מימדיות.-עם קואורדינטות דונקודות כסדרת נתון GPS-ה מסלולהמכילה סדר גודל של מאות אלפי קדקודים וצלעות.

לדלל כמה שיותר יש Map-Matching-ית היבע לפני שניגשים לפתרון כיל אלפי נקודות.טיפוסי יכול לה GPSמסלול

-Douglasאלגוריתם ים וכדי שלא נעשה חישובים יתירים(. תהליך זה נעשה על ידי רעשע"מ להוציא את המסלול )

Peucker להחלקתpolyline. אלגוריתם זה מסלק מהסדרה נקודות שנמצאות קרובות לקו ישר בין נקודות אחרות שלא

ים ואלו שקרובים לקו) הנתון GPS-למסלול הים ינטמהמפה רק את הנתונים הרלוו אחזרל לאחר מכן, יש מסולקות.

אנו לשם כך .במפה רק עם החלק הרלוונטישנעבוד , על מנת מסלול המוקלט(הישרים המחברים בין נקודות עוקבות ב

.של הצלעות החותכות את התא אינדקסיםה ריג מכיל אתשכאשר כל תא בשריג בצורת מבנה נתונים מרחבי שתמשים במ

GPS-צלע ממסלול ה המכיליםיו הם זוגות דדקושק (trellisהשלב החשוב ביותר באלגוריתם הנו בניית גרף טרליס )

ונטי ו, שכל שכבה היא שכפול של הקטע הרלשכבותניתן לחשוב על גרף הטרליס כעל גרף וצלע מגרף המפה הדיגיטלית.

דקודים אך מעט מאוד הטרליס הוא מאוד דליל )יש בו הרבה קגרף . GPS-של המפה עבור צלע מסוים של מסלול ה

על משולבת עה. הוא גם מבטא תנוGPS-ה מסלולהמפה )כבישים( לבין צלעות קשר בין צלעות ומבטא את הצלעות(

השלב האחרון סלול, או שמתקדמים על המפה, או שמתקדמים על שניהם.מ: או שמתקדמים על הזמנית-בו המפה והמסלול

מימשנו ., המייצג את המסלול המותאם למפה הדיגיטליתהוא מציאת מסלול קצר בתוך גרף הטרליסשל האלגוריתם

ומערכת המפות הדיגיטליות של Google Maps של API-מעל האלגוריתם זה במערכת אינטראקטיבית מבוססת דפדפן

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013

II

OpenStreetMap של מסלולי על קלטים מבסיסי נתונים פופולריים עם המערכת. ניסוים שערכנוGPS קיבלנו התאמות ,

מאוד טובות וטבעיות למפה.

Shortest Path Tour (SPT) -שלנו לבין הפתרון לבעיית הההתאמה בנוסף, אנו מראים את הקשר בין האלגוריתם

קבוצות נתונות של קדקודים.-זוהי בעיה של מציאת מסלול קצר בגרף שמאולץ לעבור, לפי הסדר, דרך מספר תתי בגרפים.

רצף ינומפות דיגיטליות. מסלול ה הנתונים עלהקומפקטי של מסלולים בה הינה בעיית הקידוד נוקעסשהבעיה השנייה

נלקח על ידי רכב ניידשמסלול בדרך כלל מתארשמייצג מפה, המשוכן יבגרף מישורצלעות המחוברים על דקודים של ק

את הכמויות האדירות של לשדר ולאחסן הצורך. (למפה דיגיטלית GPS)כדוגמת הפלט מאלגוריתם ההתאמה של מסלול

אנו מציעים לקודד את סדרת .מציבה את הבעיה כיצד לקודד מסלולים אלו ביעילות, יום בעולם מדימוקלטים מסלולים ה

סדרה זו.-סדרה שלה, כך ששחזור המסלול המקורי יוכל להתבצע ביעילות ובאופן יחיד מתת-הנקודות הנתונה ע"י תת

כרצף והשיטה השנייה "יםיחמדנ מסלולים"א כרצף של יה השיטה הראשונהד מסלולים. ידואנו מספקים שתי שיטות לק

ככל ממוזערקטן ו tאל שבו המרחק האוקלידי הינו מסלול t דקודקל s דקודקבין ניחמד סלולמ של מסלולים קצרים.

.שמתקדמים על המסלול

. האלגוריתם (greedy path code)נתון מסלול לשני חמדמסלול ידודקלגוריתמים אנו מספקים שני אלבשיטה הראשונה,

של מסלול חמדני, דהיינו שכל סיפא של מסלול חמדני הינו חמדני. לפיכך, האלגוריתם עיקרון הסיפאהראשון מתבסס על

הינו מספר nכאשר , O(n)בסיבוכיות זמן ריצה הירומקודד אותו במעבר אחד מ להתחלה מהסוף סורק את המסלול הנתון

הקצר ביותר האפשרי עבור חמדני המסלול הקוד -השני מוצא קידוד אופטימלי האלגוריתם .הנקודות במסלול הקלט

מסלול את העובדה שהבונה גרף עזר שקדקודיו הם נקודות מסלול הקלט וצלעותיו מייצגות המסלול הנתון. אלגוריתם זה

מציאת המסלול הקצר ביותר בין הנקודה י האופטימלי נעשה ע"החמדני המסלול . מציאת קוד הינו חמדני בין נקודות אלו

היתרון . 2n(O(עם סיבוכיות זמן ריצה מהאלגוריתם הקודם, איטי יותר אלגוריתם זה זה. הראשונה לאחרונה בגרף עזר

.O(n)שהוא ,של הקוד המהיר חמדני הינו זמן הפיענוחמסלול בקידוד

קצר מסלול ידודלגוריתם פשוט ליצירת ק. אנו מספקים אבגרףמסלול כרצף של מסלולים קצרים מקודדת ההשיטה השניי

(shortest path codeומוכיחים שהינו ) קומפקטיים יותר ידודיםק מייצרהקצר המסלול ידוד קם אופטימלי. אלגורית

הינו nכאשר n log2nk(O(על תהעומד יותר גדולריצה זמןסיבוכיות לו יש ך, אניחמדההמסלול קידוד אלגוריתם מ

על היא גם תעומד סיבוכיות זמן הריצה לפענוח .)בפלט( דהינו כמות הנקודות בקו k-ו במסלול )בקלט(מספר הנקודות

)n log2nk(O.

מניבים תוצאות מרשימות על המערכת האינטראקטיבית שהוזכרה לעיל. האלגוריתמים אלגוריתמי הקידודמימשנו את

מסלולים הלקוחים ממאגרי מידע 33-ניסויים שערכנו על כ .יתירות גבוהה במסלולים טיפוסייםהמעידות על במיוחד

באלגוריתם 2..2יחס דחיסה של המהיר, ניחמדה המסלולקידוד באלגוריתם 23.1פופולריים הצביעו על יחס דחיסה של

קידוד המסלול הקצר.באלגוריתם 11.1יחס דחיסה של האופטימלי ו ניחמדההמסלול קידוד

(, דהיינו שהמסלול onlineלתרחיש מקוון )השיטה המוצעת ההתאמה למפה ניתן להרחיב את כעבודת המשך לאלגוריתם

מהתרחיש שבו אנו טיפלנו, של הקלטת כל המסלול מראש, בשונה , GPS-המותאם מתפתח תוך כדי קליטת נקודות ה

הגלובלי בשלב מאוחר יותר.ועיבודו

. לדוגמה, הדחוסהיתמים לביצוע פעולות על מסלולים בצורתם ל אלגורכוללת פיתוח שעבודת המשך לאלגוריתמי הקידוד

, או איחזור של המסלולים בבסיס אותם ששלהם, מבלי לפרו ודיםחישוב של חיתוך של שני מסלולים כאשר נתונים רק הק

.נתונים גדול הקרובים לנקודה נתונה ללא פרישת הקודים של המסלולים בבסיס הנתונים

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-06 - 2013