Friday, December 19, 2014
Anyone who’s tried to put a large amount of data points on a map knows about the difficulties one faces when working with big geolocation data. That's why I want to share with you how I used Cloud Dataflow to spatially aggregate every single pick-up and drop-off location with the objective of painting the whole picture on a map. For background info, Google Cloud Dataflow is now in alpha stage and can help you gain insight into large geolocation datasets. You can try experimenting with it by applying for the alpha program or learn more with yesterday's update.
When I first sat down to think through this data visualization, I knew I needed to create a thematic map, so I built a simple pipeline that was able to geofence all the 340 million pick-up and drop-off locations against 342 different polygons that resulted from converting the NYC neighbourhood tabulation areas into single-part polygons. You can find the processed data in this public BigQuery table. (In order to access BigQuery you need to have at least one project listed in your Google Developers Console. After creating a project you can access the table by following this link.)
|Thematic map showing the distribution of taxi pick-up locations in NYC in 2013. Midtown South is New Yorkers’ favourite area to get a cab with almost 28 million trips starting there, which is roughly 1 trip per second. You can find an interactive map here.|
This open data, released by the NYC Taxi & Limo Commission, has been the foundation for some beautiful visualizations. By utilizing the power of Google Cloud Platform's tools, I’ve been able to spatially aggregate the data using Cloud Dataflow, and then do ad hoc querying on the results using BigQuery, to gain fast and comprehensive insight into this immense dataset.
With the Google Cloud Dataflow SDK, which parallels the data transformations across multiple Cloud Platform instances, I was able to build, test and run the whole processing pipeline in a couple of days. The actual processing, distributed across five workers, took slightly less than two hours.
The pipeline’s architecture is extremely simple. Since Cloud Dataflow offers a BigQuery reader and writer, most of the heavy lifting is already taken care of. The only thing I had to provide was the geofencing function that could be parallelised across multiple instances. For a detailed description on how to do complex geofencing using open source libraries see this post on the Google Developers Blog.
When executing the pipeline, Cloud Dataflow automatically optimizes your data-centric pipeline code by collapsing multiple logical passes into a single execution pass and deploys the result to multiple Google Compute Engine instances. At the time of deploying the pipeline you can read in files from Google Cloud Storage that contain data you need for your transformations, e.g., shapefiles or GeoJSON formats. Alternatively you can call an external API to load in the geofences you want to test against.
I utilized an API I built on App Engine which exposes a list of geofences stored in Datastore. Using the Java Topology Suite I created a spatial index maintained in a class variable in the memory of each instance for fast querying access.
Distributed across five workers, Cloud Dataflow was able to process an average of 25,000 records per second, each record having two locations, ploughing through more than 170 million table rows in just under two hours. The amount of workers can be flexibly assigned at the time of deployment. The more workers you use, the more records can be processed in parallel, the faster the execution of your pipeline.
|The interactive Cloud Dataflow graph of your Pipeline, helping you to monitor and debug your Pipeline in your Google Developer Console in the browser.|
Unsurprisingly they start from JFK airport with an average fare of $46 and an average tip of 20.7%*. Okay, this is probably not a secret, but did you know that, even though the average fare from LGA airport is $15 less, there are roughly 800,000 trips more starting from LGA? And with 22.2%*, passengers from LGA airport actually tip best. *As cash tips aren’t reported, only 52% of trips have a tip noted, therefore the values regarding tips could be inaccurate.
Most of the taxi trips start in Midtown-South (28 million) with an average fare of $11. Carnegie Hill in the Upper East Side comes fourth with 12 million pick-ups, however these trips are fairly short. Journeys that start there mostly stay in the Upper East Side and therefore only generate an average fare of $9.80. Here's an interactive map visualizing where people went to, what they paid on average and how they tipped at and some other visualizations of of how people tip from where:
The processed data is publicly available in this BigQuery table. You can find some interesting queries to run against this data in this gist.
Though NYC taxi cab journeys may not seem to amount to much, they actually that conceal a ton of information, which Google Cloud Dataflow, as a powerful big data tool, helped reveal by making big data processing easy and affordable. Maybe I'll try London's black cabs next.
- Posted by Thorsten Schaeff, Sales Engineer Intern