smarter outlier detection and deeper understanding of large-scale taxi trip records: a case study of...

Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip

Records: A Case Study of NYC

Jianting Zhang

Department of Computer Science

The City College of New York

http://images.google.com/imgres?imgurl=http://www.ccny.cuny.edu/public_safety/LOGO-NEW-2.gif&imgrefurl=http://www.ccny.cuny.edu/public_safety/&h=1518&w=1670&sz=419&hl=en&start=5&usg=__RzwNqxW2tlvxCgyaIMmAFcqnUOo=&tbnid=LQS2VxavYEbsuM:&tbnh=136&tbnw=150&prev=/images%3Fq%3Dccny%2Blogo%26gbv%3D2%26hl%3Den%26sa%3DX

http://www.google.com/imgres?imgurl=http://www.dln.cuny.edu/it/images/cuny_logo_blue.gif&imgrefurl=http://www.dln.cuny.edu/it/cfp.html&usg=__BxUFrh8fbpc2QuPJ392UNNFwkf4=&h=86&w=177&sz=3&hl=en&start=5&um=1&itbs=1&tbnid=ajHY5OdlYfw1TM:&tbnh=49&tbnw=101&prev=/images%3Fq%3Dcuny%2Blogo%26um%3D1%26hl%3Den%26imgtbs%3Ds%26tbs%3Disch:1

http://en.wikipedia.org/wiki/File:Ccnymedal.jpg

Outline

•Introduction

•Background and Related Work

•Method and Discussions

•Experiments and Results

•Summary

Introduction

3

Taxi trip records•~300 million trips in about two years•~170 million trips (300 million passengers) in 2009•1/5 of that of subway riders and 1/3 of that of bus riders in NYC •The dataset is not perfect...

•13,000 Medallion taxi cabs

•License priced at $600, 000 in 2007

•Car services and taxi services are separate

•Only taxis with Medallion license are for hail (the rule could be under changing outside Manhattan...)

Introduction

Medallion#Shift#Trip#

Trip_Pickup_DateTimeTrip_Dropoff_DateTime

Trip_Pickup_LocationTrip_Dropoff_Location

Start_LonStart_LatEnd_LonEnd_Lat

Payment_TypeSurchargeTotal_AmtRate_Code

Passenger_CountFare_AmtTolls_AmtTip_AmtTrip_Time

Trip_Distance

vendor_namedate_loadedstore_and_forward

time_between_servicedistance_between_service

Start_Zip_CodeEnd_Zip_Code

start_xstart_yend_xend_y (local projection)

1

2

3

4

5

6

78 9

1110

Meshed up on purpose due to privacy concerns

Introduction• In addition:

– Some of the data fields are empty– Pickup and drop-off locations can

be in Hudson River – The recorded trip distance/duration

can be unreasonable– ...

Outlier detections for data cleaning are needed

Mission can be easier to handle 170 million trips with the help of U2SOD-DB

Background and Related Work

•Existing approaches for outlier detection for urban computing

•Thresholding: e.g. 200m < dist < 30km

•Locating in unusual ranges of distributions

•Spatial analysis: within a region or a land use type

•Matching trajectory with road segments – treat unmatched ones as outliers

•Some techniques require complete GPS traces while we only have O-D locations

•Large-scale Shortest path computing has not been used for outlier detection

Background and Related Work• Shortest path computation

– Dijkstra and A* – New generation algorithms– Contraction Hierarchy (CH) based

•Open source implementations of CH:

•MoNav

•OSRM

•Much faster than ArcGIS NA module

Background and Related Work

• Network Centrality (Brandes, 2008)

Vtvs st

stB

vvC

)(

)(Node based

Edge based

Vtvs st

stB

eeC

)(

)(

•Can be easily derived after shortest paths are computed

•Mapping node/edge between centrality can reveal the connection strengths among different parts of cities

Method and DiscussionsRaw Taxi trip data

Match pickup/drop-off point locations to street

segments within Distance D0

CD >D1 AND CD>W*RD?

Assign pickup/drop-off nodes by picking closer ones

Type I outlier (spatial analysis)

Compute shortest path

CD: Compute shortest distanceRD: Recorded trip distance

Type II outlier (network analysis)

Successful?

Aggregate unique (sid,tid) pairs

Update centrality measurements

Method and Discussions

• The approach is approximate in nature– Taxi drivers do not always follow shortest path– Especially for short trips and heavily congested areas – But we only care about aggregated centrality

measurements and the errors have a chance to be cancelled out by each other

• Increasing D0 will reduce # of type I outliners, but the locations might be mismatched with segments

• Reducing D1 and/or W will increase # of type II outliers but may generate false positives.

Experiments and Results

Count-Distance Distribution

0

5000000

10000000

15000000

20000000

<= 0

.0

( 0

.8,

1.0]

( 1

.8,

2.0]

( 2

.8,

3.0]

( 3

.8,

4.0]

( 4

.8,

5.0]

( 5

.8,

6.0]

( 6

.8,

7.0]

( 7

.8,

8.0]

( 8

.8,

9.0]

( 9

.8, 1

0.0]

( 10

.8, 1

1.0]

( 11

.8, 1

2.0]

( 12

.8, 1

3.0]

( 13

.8, 1

4.0]

( 14

.8, 1

5.0]

( 15

.8, 1

6.0]

( 16

.8, 1

7.0]

( 17

.8, 1

8.0]

( 18

.8, 1

9.0]

( 19

.8, 2

0.0]

Trip Distance (mile)

Cou

nt

Count-Time Distribution

0

5000000

10000000

15000000

20000000

<=

0

.0

( 2

.0,

3.0

]

( 5

.0,

6.0

]

( 8

.0,

9.0

]

( 1

1.0

, 1

2.0

]

( 1

4.0

, 1

5.0

]

( 1

7.0

, 1

8.0

]

( 2

0.0

, 2

1.0

]

( 2

3.0

, 2

4.0

]

( 2

6.0

, 2

7.0

]

( 2

9.0

, 3

0.0

]

( 3

2.0

, 3

3.0

]

( 3

5.0

, 3

6.0

]

( 3

8.0

, 3

9.0

]

( 4

1.0

, 4

2.0

]

( 4

4.0

, 4

5.0

]

( 4

7.0

, 4

8.0

]

> 5

0.0

TripTime (Minute)

Co

un

t

Count-Speed Distribution

0

5000000

10000000

15000000

20000000

<= 0

.0( 1

.0, 2

.0]( 3

.0, 4

.0]( 5

.0, 6

.0]( 7

.0, 8

.0]( 9

.0, 10

.0]( 1

1.0, 1

2.0]

( 13.0

, 14.0

]( 1

5.0, 1

6.0]

( 17.0

, 18.0

]( 1

9.0, 2

0.0]

( 21.0

, 22.0

]( 2

3.0, 2

4.0]

( 25.0

, 26.0

]( 2

7.0, 2

8.0]

( 29.0

, 30.0

]( 3

1.0, 3

2.0]

( 33.0

, 34.0

]( 3

5.0, 3

6.0]

( 37.0

, 38.0

]( 3

9.0, 4

0.0]

( 41.0

, 42.0

]( 4

3.0, 4

4.0]

( 45.0

, 46.0

]( 4

7.0, 4

8.0]

( 49.0

, 50.0

]

Speed (MPH)

Coun

t

Count-Fare Distribution

05000000

1000000015000000200000002500000030000000

<= 0

.0(

1.0,

2.0

](

3.0,

4.0

](

5.0,

6.0

](

7.0,

8.0

](

9.0,

10.

0]( 1

1.0,

12.

0]( 1

3.0,

14.

0]( 1

5.0,

16.

0]( 1

7.0,

18.

0]( 1

9.0,

20.

0]( 2

1.0,

22.

0]( 2

3.0,

24.

0]( 2

5.0,

26.

0]( 2

7.0,

28.

0]( 2

9.0,

30.

0]( 3

1.0,

32.

0]( 3

3.0,

34.

0]( 3

5.0,

36.

0]( 3

7.0,

38.

0]( 3

9.0,

40.

0]( 4

1.0,

42.

0]( 4

3.0,

44.

0]( 4

5.0,

46.

0]( 4

7.0,

48.

0]( 4

9.0,

50.

0]

Fare ($)

Coun

t

Over all distributions of trip distance, time, speed and fare


Mapping of Computed Shortest Paths Overlaid with NYC Community Districts Map

•D0=200 feet, D1=3 miles, W=2

•166 million trips, 25 million unique

•~2.5 millions (1.5%) type I outliers

•~ 18,000 type II outliers

•Shortest path computation completes in less than 2 hours (5,952 seconds) on a single CPU core (2.26 GHZ)


Examples of Detected Type II Outliers

Experiments and ResultsMapping Betweenness Centralities (All hours)


00H 02H 04H 06H 08H 10H

12H 14H 16H18H

20H 22H

Legend:

Mapping Betweenness Centralities (bi-hourly)

Summary

• Large-scale taxi trip records are error-prone due to a combination of device, human and information system induced errors – outlier detection and data cleaning are important preprocess steps.

• Our approach detects outliers that can not be snapped to street segments (through spatial analysis) and/or have significant differences between computed shortest distances and recorded trip distances (network analysis)

• The work is preliminary - a more comprehensive framework is needed (e.g., incorporating pickup and drop-off times, trip duration and fare information)

• It would be interesting to generate dynamics of betweenness maps at different traffic conditions, e.g., peak/non-peak, morning/afternoon and weekdays/weekends, and explore connection strengths among NYC regions.

Q&A

[email protected]

smarter outlier detection and deeper understanding of large-scale taxi trip records: a case study of...

Documents

amt trip

datetime trip

location trip

outlier detection slide

recorded trip distance

recorded trip distanceduration

distance d

shortest distance rd