The hitchhiker’s guide to GTFS with python

Don’t panic, it’s just public transit data

Guilherme M. Iablonovski
Analytics Vidhya

--

In the beginning the GTFS standard was created by Google. This has made a lot of people very angry and has been widely regarded as a bad move.

Since its release in 2004, the General Transit Feed Specification (GTFS) has revolutionized the way we work with public transit data. It defines a common format for public transportation schedules and associated geographic information. A GTFS feed is composed of a series of text files collected in a ZIP file. Each file models a particular aspect of transit information: stops, routes, trips, and other schedule data.

The txt mandatory and optional files that compose a GTFS feed.

GTFS feeds let public transit agencies publish their transit data and developers write applications that consume that data in an interoperable way. Many of these feeds are openly available at the Open Mobility Data website.

But let’s face it, if you ever worked with a GTFS dataset you know there will always be a missing field or skipped stop, making it hard to put it to actual use.

For that very reason, there are many modules in various programming languages available for us to correct errors, fill in gaps and generate missing files for that broken GTFS dataset from your local transit authority.

I recently took on the task of doing just that for many transit feeds and thought I would share with the world what I learned. So whether you're trying to get routes in OpenTripPlanner, build your own GTFS or just run public transportation analysis, here are some scripts in python for you to enhance your datasets.

In this article we'll cover some of the most common issues you'll find in GTFS feeds, such as:

  1. Validanting and looking for errors in GTFS feeds
  2. Creating a shapes.txt file from scratch
  3. Interpolating blank stop times
  4. Merging two GTFS feeds together
  5. Adding colors to lines in the routes.txt file
  6. Building and modifying the calendar.txt file
  7. Visually checking results in a map

Finding out what’s going on in GTFS feeds

There are a handful of tools out there that are pretty great at identifying what’s up with a GTFS feed. Two of the most popular are WRI Cities' GTFS Manager and Google’s Feed Validator. The problem with those is the difficulty to document and replicate changes made to a feed, although their interfaces are great.

So we’ll go with a python package called gtfs-kit. This package has all the bells and whistles you’ll need to make quick fixes to your feed, but most importantly, it has a .validate() method. Plus you can use it in a Jupyter Notebook, making it that much easier to replicate and share results with others.

Here’s a quick example of how to get started. We’ll be using a GTFS from the city of Salvador, in Brazil. It’s not quite Betelgeuse, but it’s not too far if you consider the size of the Universe. In this gist, we’ll import the gtfs-kit module, declare the path for our zipped GTFS feed, read it with the .read_feed() function and then validate it.

The output should look something like this, telling you where to look for errors or inconsistencies.

One of the most common errors this function will warn us about is the absence of the shapes.txt file. From the diagram specifying each file in a GTFS feed, you’ll notice that this is one of the few files that has spatial information, but also an optional one. Well, not to gtfs-kit, it isn’t. This tool won’t work without the shapes file, so we will resort to the mothership of all shapefiles, Esri itself.

The regular early morning yell of horror was the sound of Arthur Dent waking up and suddenly realising he would have to buy an ArcGIS Desktop license worth several thousand dollars.

Arcpy, the universe and everything

The ArcGIS platform, the supernova sized player in the geospatial software business, has taken some serious hits in the last few years. Platforms like Mapbox and Carto are growing fast, and QGIS is more popular everyday. But Esri has its ways of getting us back to ArcGIS. This time, they managed to include a very useful toolbox exclusively for dealing with GTFS feeds in the ArcGIS Pro Conversion Toolbox.

One of these tools is Generate Shapes from GTFS. This tool relies heavily on ArcGIS Network Datasets to connect stops from a trip along a given road network.

To use it, you’ll need a valid GTFS feed that has no shape.txt file and spatial data for a road network. The easiest way to get the network is to make an OSM extract with HOTOSM Export Tool and then create and build a Network Dataset from scratch. Once you have your geodatabase ready, you can do something like this:

Make sure you leave the travel mode “1 – Subway” unchecked so the tool will calculate shapes for this mode as straight lines, and not as if subway cars ran on the surface.

The result will look something like this in ArcGIS Pro

Time is an illusion. Stop times doubly so.

Apart from the native tools for GTFS in ArcGIS, we can also count with the scripts developed by the GIS community. This is the case of the Interpolate Blank Stop Times tool, a set of python scripts that use arcpy to estimate the stop times in trips where you have only the initial departure and final arrival time – a very common issue in GTFS datasets. For instance, let’s use the gtfs-kit module to check this trip from the Salvador Subway system’s Line 1.

Gtfs-kit assigns each txt file to a pandas DataFrame — which is very practical — . Here, we are filtering the stop_times DataFrame by trip_id (that is, one single trip from the L1 route), and to no one's surprise, all of the intermediate stops are blank.

Only the initial departure and final arrival times are filled.

There's really no catch in using these scripts, so I won't include a gist here as they're rather lenghty. You can find my own Jupyter Notebook version of them in the repository below. Just run them from ArcGIS or in a python environment with arcpy.

After running the script, when we check that very same Line 1 subway trip, we'll find that all stop times have been filled in. Take this with a grain of salt though, because this is a simple linear interpolation, and not actual times that would take the road network into account.

Now all arrival and departure times are filled in. Magic!

Java. Mostly harmless.

Now that we’re done using arcpy to enhance our GTFS feed, let's go back to gtfs-kit and see what it thinks of if. The gtfs-kit module will recognize our GTFS as valid, but will still point out errors and warnings.

I was hoping there would be an error 42 so I could make a pun here. Sigh.

This is perfectly fine, but it gets tricky when you find yourself having to validate more than one feed at once. This is not uncommon as many transit agencies provide more than one feed to same city, divided by basin, operator or even transportation mode.

So grab on to your towell, because we're gonna have to make a detour from python and take a Java bypass. This is because the best tool currently available to merge two or more GTFS feeds is the onebusaway gtfs modules, available exclusively in Java.

In order to use it,you’ll need to have Java 1.6 runtime installed to run the client. Once you have that, place the .jar file and the two GTFS feeds you want to merge in the same folder, access it and run something like this from you Terminal window:

java -jar onebusaway-gtfs-merge-cli.jar --duplicateDetection=fuzzy gtfs1.zip gtfs2.zip gtfsmerge.zip

Once you hit enter, there’s gonna be a lot going on under the hood.

The gtfs-merge tool will be making decisions like "Should two very close stops be treated as one and the same?", "Should routes with identical ID's be treated as duplicates?", and so on.

The good news is: the fuzzy duplicate detection option ensures that if two entries have common elements (eg. stop name or location, route short name, trip stop sequence), then they are considered the same. This is the more lenient matching policy, and is highly dependent on the type of GTFS entry being matched.

Alright, Mr. Wiseguy … if you’re so clever, you tell us what colour it should be.

Now that we have ourselves a single and functional GTFS feed to work with, let's look at some very important yet minor details. One that comes to mind is colors for the bus routes. Almost every transit agency in the world identifies its routes with specific colors. And almost every one of them don't bother filling that information in their GTFS feeds. So let's write a script for that.

GTFS stores colors associated to routes, in HEX format. So we'll start by calling a Color Picker widget to select whichever color we want.

Now that you know the right code for the colors you want, all you need to do is apply the right filters to your routes DataFrame. In the example below, we'll filter by route name, agency ID and route type.

Don't forget not to include the # from the hex code.

This will reflect on routing software, like OpenTripPlanner. Here you see that Line 1 really shows in red when it's part of an itinerary. We also have the option to change the text_color field to white, and have a better constrast between the main color and the superposed text.

Anything that happens, happens. It doesn’t necessarily do it in chronological order, though.

It might be that gtfs-kit is happy with results and don't see any more errors, but if you by chance are looking into using GTFS feeds to generate routes using OpenTripPlanner, then you got yourself a much more picky client for your feeds.

OpenTripPlanner — if you don’t know what this is, see it as an open source google maps routing api that you can install in your machine — will not generate travels for trips that don’t have calendars.

Wait until you have to determine if a bus route runs on Sundays, Arthur Dent.

That is, all trips should be in the calendar.txt file, stating days of the week for which those particular stop times are valid. This is useful to separate buses that run special times on weekends, for example.

So to create a calendar file, we'll use the trips DataFrame as a base, since we need all of the trips IDs in there, and there work our way towards deleting unnecessary fields and adding mandatory ones. You can find which are which by checking Google's GTFS Reference page.

The ultimate answer

Changing colors and adding the calendar DataFrame to our feed is just the tip of the iceberg of what can be done with gtfs-kit. If you decide to use it, you're bound to make minor changes to some fields, delete duplicate entries, and a lot more. Along this process, you may want to use the module's mapping functions to check if your stops and trips are making any sense.

The output from the map_routes() function will create a Leaflet map right into your notebook.

So long and thanks for all the fish

That’s it for this piece! We have successfully enriched and cleaned a GTFS feed with python in a way we can replicate in the future. You can find this entire workflow in my Github page. If you have questions or suggestions, don’t hesitate to drop me a line anytime!

Also, if you enjoyed this article, consider buying me a coffee so I keep writing more articles like this!

I bet you saw this one coming.

--

--

Guilherme M. Iablonovski
Analytics Vidhya

Geospatial Data Scientist and Developer. GIS , sustainable urban planning, environment, data science and software development. linkedin.com/in/guilhermeiablo