smarter outlier detection and deeper understanding of large-scale taxi trip records: a case study of...
TRANSCRIPT
Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip
Records: A Case Study of NYC
Jianting Zhang
Department of Computer Science
The City College of New York
Outline
•Introduction
•Background and Related Work
•Method and Discussions
•Experiments and Results
•Summary
Introduction
3
Taxi trip records•~300 million trips in about two years•~170 million trips (300 million passengers) in 2009•1/5 of that of subway riders and 1/3 of that of bus riders in NYC •The dataset is not perfect...
•13,000 Medallion taxi cabs
•License priced at $600, 000 in 2007
•Car services and taxi services are separate
•Only taxis with Medallion license are for hail (the rule could be under changing outside Manhattan...)
Introduction
Medallion#Shift#Trip#
Trip_Pickup_DateTimeTrip_Dropoff_DateTime
Trip_Pickup_LocationTrip_Dropoff_Location
Start_LonStart_LatEnd_LonEnd_Lat
Payment_TypeSurchargeTotal_AmtRate_Code
Passenger_CountFare_AmtTolls_AmtTip_AmtTrip_Time
Trip_Distance
vendor_namedate_loadedstore_and_forward
time_between_servicedistance_between_service
Start_Zip_CodeEnd_Zip_Code
start_xstart_yend_xend_y (local projection)
1
2
3
4
5
6
78 9
1110
Meshed up on purpose due to privacy concerns
Introduction• In addition:
– Some of the data fields are empty– Pickup and drop-off locations can
be in Hudson River – The recorded trip distance/duration
can be unreasonable– ...
Outlier detections for data cleaning are needed
Mission can be easier to handle 170 million trips with the help of U2SOD-DB
Background and Related Work
•Existing approaches for outlier detection for urban computing
•Thresholding: e.g. 200m < dist < 30km
•Locating in unusual ranges of distributions
•Spatial analysis: within a region or a land use type
•Matching trajectory with road segments – treat unmatched ones as outliers
•Some techniques require complete GPS traces while we only have O-D locations
•Large-scale Shortest path computing has not been used for outlier detection
Background and Related Work• Shortest path computation
– Dijkstra and A* – New generation algorithms– Contraction Hierarchy (CH) based
•Open source implementations of CH:
•MoNav
•OSRM
•Much faster than ArcGIS NA module
Background and Related Work
• Network Centrality (Brandes, 2008)
Vtvs st
stB
vvC
)(
)(Node based
Edge based
Vtvs st
stB
eeC
)(
)(
•Can be easily derived after shortest paths are computed
•Mapping node/edge between centrality can reveal the connection strengths among different parts of cities
Method and DiscussionsRaw Taxi trip data
Match pickup/drop-off point locations to street
segments within Distance D0
CD >D1 AND CD>W*RD?
Assign pickup/drop-off nodes by picking closer ones
Type I outlier (spatial analysis)
Compute shortest path
CD: Compute shortest distanceRD: Recorded trip distance
Type II outlier (network analysis)
Successful?
Aggregate unique (sid,tid) pairs
Update centrality measurements
Method and Discussions
• The approach is approximate in nature– Taxi drivers do not always follow shortest path– Especially for short trips and heavily congested areas – But we only care about aggregated centrality
measurements and the errors have a chance to be cancelled out by each other
• Increasing D0 will reduce # of type I outliners, but the locations might be mismatched with segments
• Reducing D1 and/or W will increase # of type II outliers but may generate false positives.
Experiments and Results
Count-Distance Distribution
0
5000000
10000000
15000000
20000000
<= 0
.0
( 0
.8,
1.0]
( 1
.8,
2.0]
( 2
.8,
3.0]
( 3
.8,
4.0]
( 4
.8,
5.0]
( 5
.8,
6.0]
( 6
.8,
7.0]
( 7
.8,
8.0]
( 8
.8,
9.0]
( 9
.8, 1
0.0]
( 10
.8, 1
1.0]
( 11
.8, 1
2.0]
( 12
.8, 1
3.0]
( 13
.8, 1
4.0]
( 14
.8, 1
5.0]
( 15
.8, 1
6.0]
( 16
.8, 1
7.0]
( 17
.8, 1
8.0]
( 18
.8, 1
9.0]
( 19
.8, 2
0.0]
Trip Distance (mile)
Cou
nt
Count-Time Distribution
0
5000000
10000000
15000000
20000000
<=
0
.0
( 2
.0,
3.0
]
( 5
.0,
6.0
]
( 8
.0,
9.0
]
( 1
1.0
, 1
2.0
]
( 1
4.0
, 1
5.0
]
( 1
7.0
, 1
8.0
]
( 2
0.0
, 2
1.0
]
( 2
3.0
, 2
4.0
]
( 2
6.0
, 2
7.0
]
( 2
9.0
, 3
0.0
]
( 3
2.0
, 3
3.0
]
( 3
5.0
, 3
6.0
]
( 3
8.0
, 3
9.0
]
( 4
1.0
, 4
2.0
]
( 4
4.0
, 4
5.0
]
( 4
7.0
, 4
8.0
]
> 5
0.0
TripTime (Minute)
Co
un
t
Count-Speed Distribution
0
5000000
10000000
15000000
20000000
<= 0
.0( 1
.0, 2
.0]( 3
.0, 4
.0]( 5
.0, 6
.0]( 7
.0, 8
.0]( 9
.0, 10
.0]( 1
1.0, 1
2.0]
( 13.0
, 14.0
]( 1
5.0, 1
6.0]
( 17.0
, 18.0
]( 1
9.0, 2
0.0]
( 21.0
, 22.0
]( 2
3.0, 2
4.0]
( 25.0
, 26.0
]( 2
7.0, 2
8.0]
( 29.0
, 30.0
]( 3
1.0, 3
2.0]
( 33.0
, 34.0
]( 3
5.0, 3
6.0]
( 37.0
, 38.0
]( 3
9.0, 4
0.0]
( 41.0
, 42.0
]( 4
3.0, 4
4.0]
( 45.0
, 46.0
]( 4
7.0, 4
8.0]
( 49.0
, 50.0
]
Speed (MPH)
Coun
t
Count-Fare Distribution
05000000
1000000015000000200000002500000030000000
<= 0
.0(
1.0,
2.0
](
3.0,
4.0
](
5.0,
6.0
](
7.0,
8.0
](
9.0,
10.
0]( 1
1.0,
12.
0]( 1
3.0,
14.
0]( 1
5.0,
16.
0]( 1
7.0,
18.
0]( 1
9.0,
20.
0]( 2
1.0,
22.
0]( 2
3.0,
24.
0]( 2
5.0,
26.
0]( 2
7.0,
28.
0]( 2
9.0,
30.
0]( 3
1.0,
32.
0]( 3
3.0,
34.
0]( 3
5.0,
36.
0]( 3
7.0,
38.
0]( 3
9.0,
40.
0]( 4
1.0,
42.
0]( 4
3.0,
44.
0]( 4
5.0,
46.
0]( 4
7.0,
48.
0]( 4
9.0,
50.
0]
Fare ($)
Coun
t
Over all distributions of trip distance, time, speed and fare
Experiments and Results
Mapping of Computed Shortest Paths Overlaid with NYC Community Districts Map
•D0=200 feet, D1=3 miles, W=2
•166 million trips, 25 million unique
•~2.5 millions (1.5%) type I outliers
•~ 18,000 type II outliers
•Shortest path computation completes in less than 2 hours (5,952 seconds) on a single CPU core (2.26 GHZ)
Experiments and Results
Examples of Detected Type II Outliers
Experiments and ResultsMapping Betweenness Centralities (All hours)
Experiments and Results
00H 02H 04H 06H 08H 10H
12H 14H 16H18H
20H 22H
Legend:
Mapping Betweenness Centralities (bi-hourly)
Summary
• Large-scale taxi trip records are error-prone due to a combination of device, human and information system induced errors – outlier detection and data cleaning are important preprocess steps.
• Our approach detects outliers that can not be snapped to street segments (through spatial analysis) and/or have significant differences between computed shortest distances and recorded trip distances (network analysis)
• The work is preliminary - a more comprehensive framework is needed (e.g., incorporating pickup and drop-off times, trip duration and fare information)
• It would be interesting to generate dynamics of betweenness maps at different traffic conditions, e.g., peak/non-peak, morning/afternoon and weekdays/weekends, and explore connection strengths among NYC regions.