clustering uber pickups using apache spark's mllib

Clustering Uber Pickups using Apache Spark’s MLlibVincent Trost & Kevin Baik

April 28, 2017

Introduction

With the rise of technology in society, disruptive innovation is all around us. Taxis used to be a staple oflife in the city, but the development of the smart phone changed all of that. Along came a ride-sharing appcalled “Uber” that allows users to hail a ride from their exact location using only their smart phone. Notonly that, but the car one gets picked up in can be a regular old car instead of a bright yellow taxi. Peoplecan drive for Uber whenever they want, and all payment is handled digitally. Drivers are able to registertheir car and drive whenever they want by just switching a button in the app. The scientists, Kevin andVincent, were interested in running a k-means clustering analysis using Spark’s MLlib to identify the mostdense pickup locations in a certain area (which for this study, is New York City and its surrounding area)while also utilizing the parallelization methods provided by the Apache Spark framework.

These data are freely available on GitHub via fivethirtyeight.com. The company submitted a Freedomof Information Law (FOIL) request to the NYC Taxi & Limousine Commission in order to obtain these dataon July 20, 2015 [1]. They were required by law to comply, and after obtaining the data, they cleaned it andput it on GitHub for the public.

The Data

As mentioned, the data were obtained from fivethirtyeight’s GitHub page. The data were from themonths of April 2014 through August 2014, and contained pickup locations in and around New York City.The files contained five columns: ID, DateTime, Latitude, Longitude, and Base.

Figure 1: Snapshot of the data

These are all relatively self explanatory except for the Base variable. In New York City, some taxi andlimousine companies that own their own fleet of cars will allow people to rent them out and use them forUber. This allows the companies to make money on their assets around the clock and also allows the driversto drive for Uber without using their own car. The Base variable explains which base the car performing

1

that pickup came from. For the purposes of this project, it was not very relevant.Overall, the data was small in size- only around 48 MB. This made exploring scalability measures

challenging, but that will be covered later on in the report. Though it was small in size, there were still wellover one million unique pickups represented in the data (1048575 to be exact).

Performing K-Means Clustering

The first objective in performing K-Means clustering was to clean the data. To do that, the scientistsused the following command:val parsedData = rdd.map{ line => Vectors.dense(

line.split(",").slice(3, 5).map(_.toDouble))}.cache()

More will be explained on why cache() was chosen later on. This command split the .csv file up by commas,only kept the latitude and longitude, and mapped every value to type double. That step is important becausethe kmeans implementation only accepts numeric vectors. The next step is to train the K-Means model.

The most important metric in the K-Means algorithm is the choice for K. The scientists want to optimizeK for their use-case, as it represents the number of clusters. One way to evaluate this metric is by computingthe “cost”. This cost represents the sum of square distance of each pickup from its cluster center. The lowerthe cost, the more it explains the data. Obviously, the higher of a choice for K would converge to fitting thedata perfectly. There is no perfect way to choose K, but one widely used way is called the “Elbow Method”.The idea behind this method is to choose K at a point beyond which increasing K would have a negligiblereturn.

Figure 2: Kmeans Cost as K Increases

2

The scientists saw this visualization and decided that k = 200 was a good place to agree on a choice of k,because the gain between 200 and 1000 clusters provided no significant additional benefit.

Scalability Tests

In order to evaluate how efficient the program is, different configurations of spark-submit commandsmust be tested. Along with that, different code implementations utilizing commands like cache() andpersist() with different flags must be evaluated as well. The following tables summarize those tests. TheK-Means run-time is clocked before and after training the model as shown below:val iterationCount = 200 //held constantval clusterCount = 200 //held constantval start = System.nanoTimeval model = KMeans.train(parsedData, clusterCount, iterationCount)val end = System.nanoTimeprintln("KMeans Run-Time: " + (end - start) / 10e9 + "s")

The current rdd storing command used in the code for these tests is cache(). Each parameter in thespark-submit configuration was tested by holding two constant and varying one at a time. An average wastaken across three trials to help account for the variability in cluster use affecting the run-time.

Varying the number of cores within the executorsspark-submit --master yarn-client

--driver-memory 2g //Always constant--executor-memory 2g //constant--num-executors 4 //constant--executor-cores (2, 8, 16)

Number of Cores Trial 1 Trial 2 Trial 3 Average2 13.9541 10.2777 16.5499 13.59408 13.9311 13.3010 15.8827 14.371616 11.5307 13.8614 16.1953 13.8625

Here we observe a small difference between the tests. It seems as though since it is processing a smallerdataset, the fewer number of cores probably fits the use-case best.

Varying the number of executors usedspark-submit --master yarn-client

--driver-memory 2g //Always constant--executor-memory 2g //constant--num-executors (3, 15, 30)--executor-cores 2 //constant

Number of Executors Trial 1 Trial 2 Trial 3 Average3 11.1275 15.8867 9.4692 12.161115 14.7817 13.8307 15.0752 14.562530 15.3249 15.8669 23.6597 18.2838

Here we observe that the fewer executors, the better. This is most likely due to the excessive number ofexecutors being used on such a small dataset creating overhead in the aggregation stage.

3

Varying executor memoryspark-submit --master yarn-client

--driver-memory 2g //Always constant--executor-memory (512MB, 2g, 8g)--num-executors 4 //constant--executor-cores 2 //constant

Executor Memory Trial 1 Trial 2 Trial 3 Average512MB 15.2041 18.2490 17.5473 17.00012g 11.3612 12.9340 12.8551 12.38348g 9.2242 10.8015 14.0420 11.3559

Here we observe the higher the memory, the better- though the gain is small between such a steep increasein memory (from 2g to 8g). It makes sense that at a certain amount of memory for such a small dataset,performance would plateau. Nonetheless, 8g ran the fastest.

Based on the findings, we know that the executor cores did not make much of a difference but the numberof executors and the executor memory did. We can now try an ensemble command for our use case. Since3 executors performed best, and 8g executor memory performed best, they should make a fast combination.spark-submit --master yarn-client

--driver-memory 2g //Always constant--executor-memory 8g--num-executors 3--executor-cores 2

Trial 1 Trial 2 Trial 3 Average11.8125 12.9690 10.8062 11.8626

The ensemble of best performers just missed out-performing the command with one more executor, whichis interesting because the trials suggested less executors perform better, but with more trials this might nothold true. The performance times were very close nonetheless, and the variability was much less too.

In-Code Scalability Measures

It was noted that cache() was the rdd storing method used in the code. Other implementations ofpersist() were tested with different flags. They were run using the “ensemble” spark-submit configurationbecause even though the 8g memory with 4 executors configuration was ultimately the fastest, the variabilitywas much higher with a range of 9 to 14, which was concerning.

parsedData.persist(StorageLevel.MEMORY_ONLY)


parsedData.persist(StorageLevel.MEMORY_AND_DISK)


4

parsedData.persist(StorageLevel.MEMORY_AND_DISK_SER)


Here we see MEMORY_ONLY performing the best, followed by MEMORY_AND_DISK, and then MEMORY_AND_DISK_SER.The reason the serializable format comes in last is because though the serializable format is more space-efficient, it is more CPU intensive to read. Comparing the persist() methods to cache():

parsedData.cache()


Not surprisingly, the cache() performs very similarly to the persist(StorageLevel.MEMORY_ONLY).

Visualizations

The scala program [3] output the cluster centers and cluster sizes to text files, which were then takenoff the cluster and loaded into R for visualization. Using an R package called leaflet, an interactive mapwas made to visualize the cluster centers and their sizes [2]. The color scale goes from white to blue, whitemeaning the least dense cluster and dark blue meaning the most dense cluster.

Figure 3: Interactive version: https://vjtrost88.github.io/uberViz.html

5

https://vjtrost88.github.io/uberViz.html

Exploration

The scientists initially were hoping their cluster centers would correspond to much more specific locationson the map. This was not the case. They found that they correlated to popular pickup areas instead. Anotheridea the scientists entertained was subsetting the data by time, to see if maybe there was a difference betweenlocations at night versus overall, but this proved unfeasible as this subset of data were too small.

Top 15 most dense clusters

Cluster Rank Cluster ID Cluster Size Latitude Longitude1 2 29255 41.06940278 -73.840389582 70 26224 40.69540156 -73.813386773 174 22890 40.7431948 -73.986342594 93 22433 41.2932 -74.090225 54 22301 40.64510801 -73.971850066 95 20533 40.74824038 -73.939741217 122 20483 40.72990924 -73.986683178 190 20181 40.7449109 -73.954349989 47 19897 40.59828986 -73.9794490610 68 19372 40.67323883 -73.9751734311 114 19340 41.0281712 -73.622441612 13 19140 40.69084395 -73.9874645813 141 19117 40.76327588 -73.9767315414 157 19097 40.82725843 -74.0828962515 32 18669 40.77653512 -73.95471698

Anyone who wishes can plot the latitude/longitude coordinates to see what is around each cluster center.The centers of interest to the scientists were the ones in Manhattan, because there was suspicion that theymight correlate to popular destinations. It was found that they correlate to popular areas above all else.Ahead are the top three cluster centers in Manhattan.

Figure 4: The third most dense cluster center - First in Manhattan - near Madison Square Park

All the orange dots on the map are either a bar or restaurant. This was a common theme amongst theclusters in Manhattan.

6

Figure 5: The seventh most dense cluster - Second in Manhattan - at 2nd Ave and 10th Street

Figure 6: The thirteenth most dense cluster - Third in Manhattan - at 56th and 6th

7

We still see a high amount of orange dots, but also a bit more hotels as well.These findings can help suggest popular places to eat and drink to anyone who might be interested. It

also can help Uber drivers know where to linger in order to maximize thier probability of landing a ride.There are plenty of other hypotheses that can be posed to explain these results. The scientists, however,were more pleased that they correlated to areas that had similar characteristics.

Another interesting discovery was that Uber was significantly popular in Long Island City. The 6th and8th most dense clusters encompass the entire city. There were over forty thousand pickups in this small area.Might be a good place to start driving for Uber!

Figure 7: The sixth and eighth most dense clusters- Long Island City

Conclusion

The findings for this project were interesting, but the scalability measures were the focus of the class.In testing the algorithm further with different spark-submit configurations, in conjunction with provingthat cache() was the most effective at cutting down run-time, the scientists were able to pinpoint theiroptimal configuration. With more data, many things could be improved upon. The scalability measurescould be further refined, the scientists would have more of an opportunity to implement techniques suchas mapPartition() to further parallelize the data, and more specific locations could be pinpointed with ahigher number of cluster centers. As well, Date.Time could be added as an extra dimension upon which tocluster, adding more complexity into the algorithm. With more data, the idea of subsetting to spot trendsduring the night-time would be more feasible. The scientists could have randomly generated data usingBayesian statistics. If they had enough time, they could have even submitted another FOIL request to the

8

NYC Taxi and Limousine Commission. But time was of the essence, and the potential this project hasis exciting. The scientists are considering submitting a FOIL request to Uber for data on other cities toutilize this algorithm to visualize similar results in different places. Though the dataset was small, plenty ofknowledge was gained on how Spark goes about parallelizing its tasks.

References

[1] Fivethirtyeight, uber-tlc-foil-response, (2014) GitHub repository.https://github.com/fivethirtyeight/uber-tlc-foil-response

[2] Vincent Trost, vjtrost88.github.io/uberViz.html, (2017)https://vjtrost88.github.io/uberViz.html

[3] DS 410.Kevin Baik,(2017), GitHub repository.https://github.com/Konnoke/DS410

9

clustering uber pickups using apache spark's mllib

Documents