GTFS Public Transit Data¶
The source data used to calculate the Transit Accessibility Index is General Transit Feed Specification (GTFS), made publicly available through the Mobility Database Catalog by contributing public transit agencies. Conceived by Portland TriMet, popularized by Google, and now adopted by the US Department of Transportation (National Transit Database Reporting Changes, 2022), GTFS is a standardized data schema for transit agencies to provide access to information about their services.
Fig. 1 details the GTFS resources used to calculate the Transit Accessibility Index. Traversing this schema can be daunting at first. This diagram is nowhere near all inclusive, but it does help to understand and navigate the relationships between GTFS tables used to cal create the Transit Accessibility Index. The GTFS schema documentation provides details on the full set of optional columns and tables.

Fig. 1 GTFS Schema Entity Relationship Diagram¶
To get the routes serving a stop, it is necessary to traverse through stop_times
and trips
to
retrieve related routes. Similarly, to know whether a stop has service on a weekday, it is necessary to traverse
stop_times
and trips
to retrieve if service is available on a given day from calendar
. When initially
attempting to understand the input GTFS data, Fig. 1 is extremely useful.
Data Inferencing¶
The GTFS specification is deliberately very flexible to meet the varied needs of transit agencies. When attempting to
create a consistent measure of transit accessibility, the Transit Accessibility Index, this is somewhat challenging
since data may be presented differently depending on the agency providing the data. Calculating the Transit
Accessibility Index requires mitigating challenges presented by these variations in two specific cases, a missing
calendar
file and missing stop_times
arrival times.
Missing Calendar File¶
The Transit Accessibility Index assesses transit quality by determining service offered by day of the week. This is
derived from a file defined in the GTFS specification, the calendar
file (Table 1). It is quite simple,
a boolean column for each day of the week offering service. Each row can be associated back to routes, trips and stops
using a unique identifier, service_id
(Fig. 1).
service_id |
monday |
tuesday |
wednesday |
thursday |
friday |
saturday |
sunday |
start_date |
end_date |
---|---|---|---|---|---|---|---|---|---|
AFA23GEN-1038-Sunday-00 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
20231217 |
20240616 |
AFA23GEN-1039-Saturday-00 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
20231223 |
20240622 |
AFA23GEN-1092-Weekday-00_C45 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
20231218 |
20240621 |
The GTFS specification also includes another file detailing service days, the calendar_dates
file. Ideally, only
service exceptions, when service is removed for holidays and other unique circumstances, is detailed in the
calendar_dates
file, but some agencies opt to list all service days in calendar_dates
and omit calendar
completely.
The calendar_dates
file is comprised of only three columns; a column with the unique identifier for the service
offered (relates back to trips,routes, and stops), a column with the date of service, and the exception type (1
for service added and 2 for service removed). In the aforementioned ideal scenario, the calendar
file is
included detailing regular service, and calendar_dates
only details service exceptions (exception type 2) along
with some routes added for special events and holidays (exception type 1) as shown in Table 2.
service_id |
date |
exception_type |
---|---|---|
AFA23GEN-1038-Sunday-00 |
20231225 |
1 |
AFA23GEN-1038-Sunday-00 |
20240101 |
1 |
AFA23GEN-1038-Sunday-00 |
20240527 |
1 |
AFA23GEN-1039-Saturday-00 |
20240219 |
1 |
AFA23GEN-1092-Weekday-00_C45 |
20231225 |
2 |
AFA23GEN-1092-Weekday-00_C45 |
20240101 |
2 |
AFA23GEN-1092-Weekday-00_C45 |
20240212 |
2 |
AFA23GEN-1092-Weekday-00_C45 |
20240213 |
2 |
AFA23GEN-1092-Weekday-00_C45 |
20240219 |
2 |
AFA23GEN-1092-Weekday-00_C45 |
20240527 |
2 |
However, according to the GTFS specification, is allowed to simply list all service explicitly in the
calendar_dates
file and omit the calendar
file. In this case, most of the entries are an exception type 1
(Table 3).
service_id |
date |
exception_type |
---|---|---|
1 |
20240210 |
1 |
1 |
20240211 |
1 |
1 |
20240217 |
1 |
1 |
20240218 |
1 |
1 |
20240219 |
1 |
1 |
20240224 |
1 |
1 |
20240225 |
1 |
1 |
20240302 |
1 |
1 |
20240303 |
1 |
1 |
20240309 |
1 |
In these instances, as part of validation, a calendar
file is constructed by interrogating the calendar_dates
file using the following logic.
exception type
1
records are selectedday of week is calculated from the dates listed offering service
if, by service identifier, any day of the week offers service, then for this service identifier, this day of the week is deemed to be
true
…offering service
This constructed calendar
file is added to the validated data, and enables determining day of week service offered
for routes, trips and stops.
Missing Arrival Times¶
Time of day service (daytime, evening and overnight) for each transit stop is determined based on the stop time for trips at each stop. The GTFS specification allows for null stop times provided there is at least a starting and ending time for each trip. Individual stop times for each stop do not have to be listed…provided the starting and ending time are provided. It is not uncommon to also see every nth stop with an arrival time, every fourth or sixth stop in a trip. This can be problematic when determining daytime, evening and overnight service metrics for the stops (Table 4).
trip_id |
stop_sequence |
stop_id |
arrival_time |
departure_time |
---|---|---|---|---|
PART1-07-02 |
34 |
1a062OC |
10:08:00 |
10:08:00 |
PART1-07-02 |
35 |
ectd |
||
PART1-07-02 |
36 |
PT_113504 |
10:10:00 |
10:10:00 |
PART1-07-01 |
6 |
2729oc |
08:20:00 |
08:20:00 |
PART1-07-01 |
7 |
PT_113543 |
||
PART1-07-01 |
8 |
PT_113535 |
08:23:00 |
08:23:00 |
This is mitigated by inferencing the values between known times. Although inferencing does not take into consideration the route traveled and the potential differences in time due to varied distance between intermediate stops, inferencing arrival times based on known start and end times does ensure each stop has an arrival time. These inferenced arrival times, while not perfect based on distance traveled, since equally distributed between known starting and ending times, do enable accurate determination of daytime, evening and overnight service boolean columns. It also enables calculating headway descriptive statistics for evaluating service quality. Hence, for the purposes of calculating the Transit Accessibility Index, this is more than sufficient.
References¶
Transit Accessibility Score (Excel)
Transit Access Presentation (PowerPoint)
Desmos With Sigmoid Formula (useful for exploring modifying the sigmoid curve)