Each day, Chicago’s utilitarian design enables an incredible flow of people. Perhaps nothing bears this out better than the ridership of millions of bikes over several years. In this article, I analyze 17 million bikeshare rides and animate the most representative day of Chicago bikeshare. Skip ahead for the pretty pictures.
To play with this data yourself, install my divvy-data
package from PyPI.
Chicago Biking
Bicycles are the best way to see a city, and yes, I am including Chicago’s sometimes brutal winter months. Even as temperatures approached -50°F last week, our bikeshare still received ridership.
Our blissfully flat terrain, lakefront paths, and bike infrastructure investments landed us Bicycling Magazine’s most bike-friendly city in the country (6th in 2018). This past December, Chicago completed $12 million and $60 million projects to secure 18 continuous miles of dedicated lakefront bike path through the heart of Chicago.
Data Sourcing and Python Package
With mounting excitement for Chicago’s cyclist future, I devoted my past week to a thorough shakedown of our bikeshare’s public datasets. You can find the code for this analysis on GitHub, including a notebook that steps through my process.
My first step was writing a python package to load the data neatly. Here’s the result, pulling in ~2 GB of data:
import pandas as pd
import divvy
rides, stations = divvy.historical_data.get_data(
year=[str(_) for _ in range(2013,2019)],
rides=True,
stations=True
)
rides.to_pickle('data/rides.pkl')
stations.to_pickle('data/stations.pkl')
Divvy data come in zip
files grouped by quarter. Columns and date formats are not standardized across files, so I manually wrote them into the loading process.
Most challenging, the ride tables don’t include geographic coordinates for ride origin and destination. That information is tied to stations, which are contained in their own files. Divvy also didn’t provide station information at all for 2018, so I integrated data from their live JSON station feed to get the most recent locations.
I also discovered stations had been physically moved while maintaining the same row-level ID. This greatly reduces the certainty of geographic analysis, as I can only be certain of a station’s location at the end of the quarter on which the data were published. I calculated the geographic distance between these movements over time and dropped movements below a ~50 meter precision level.
A representative station:
id | latitude | longitude | online_date | source |
---|---|---|---|---|
2 | 41.872293 | -87.624091 | NaT | 2015 |
2 | 41.881060 | -87.619486 | 2013-06-10 10:43:46 | 2017_Q1Q2 |
2 | 41.876393 | -87.620328 | 2013-06-10 10:43:00 | 2017_Q3Q4 |
All duplicate stations and their movements:
Ultimately I decided to average the remaining station coordinates. I didn’t suspect the difference would make a meaningful impact on my analysis.
Descriptives
With cleaned data, the beauty of the information could finally shine. There really are countless, never-ending lines of inquiry one may follow in this data set, and I suspect this won’t be my final analysis. Since June 2013 there have been 17.5 million trips, totaling 5.3 million hours, and covering between 21.7 (haversine) and 27.1 (taxi cab) million miles. That’s 56 trips to the moon and back!
The typical ride covered 1.0 to 1.2 miles in 11.7 minutes. The longest ride covered 22.9 to 26.2 miles in 2.9 hours.
The typical bike covered 4.3 thousand miles over 2.8 thousand rides, lasting a total of 820 hours.
The farthest ridden bike covered between 6.5 and 8.1 thousand miles. Here’s all the routes it covered.
Time Series Analysis
Ultimately I set out to characterize and visualize Chicago’s bikeshare usage. Due to Chicago’s layout, I suspected this line of analysis would return fascinating results.
To investigate, I took all 2,017 days of data, cut them into 30 minute segments, and calculated statistics for each station within those timeframes across all days. In particular, I was interested in each station’s net arrivals and departures. For each station, I added those numbers and divided by the number of days the station had been active.
This gave me 48 ‘snapshots’ of Chicago bikeshare activity, each representing a separate typical 30 minute period of the day. To animate this data more smoothly, I interpolated up to 720 frames to give ‘snapshots’ representing 2 minutes of the day.
Here’s the result. Best viewed at 4K resolution, and right-click to loop. Each dot represents a bike station. Circle size corresponds to station usage. Orange stations are radiating bikes. Blue stations are collecting bikes. Gray stations are neutral.
Finally, here are two interactive frames from the animation. Click or tap the circles for more detail. Let me know what you discover. (full screen here)
8:00am - 8:30am | 5:00pm - 5:30pm |
---|