GTFS Public Transit Data

The source data used to calculate the Transit Accessibility Index is General Transit Feed Specification (GTFS), made publicly available through the Mobility Database Catalog by contributing public transit agencies. Conceived by Portland TriMet, popularized by Google, and now adopted by the US Department of Transportation (National Transit Database Reporting Changes, 2022), GTFS is a standardized data schema for transit agencies to provide access to information about their services.

Fig. 1 details the GTFS resources used to calculate the Transit Accessibility Index. Traversing this schema can be daunting at first. This diagram is nowhere near all inclusive, but it does help to understand and navigate the relationships between GTFS tables used to cal create the Transit Accessibility Index. The GTFS schema documentation provides details on the full set of optional columns and tables.

- GTFS Schema Entity Relationship Diagram

Fig. 1 GTFS Schema Entity Relationship Diagram

To get the routes serving a stop, it is necessary to traverse through stop_times and trips to retrieve related routes. Similarly, to know whether a stop has service on a weekday, it is necessary to traverse stop_times and trips to retrieve if service is available on a given day from calendar. When initially attempting to understand the input GTFS data, Fig. 1 is extremely useful.

Data Inferencing

The GTFS specification is deliberately very flexible to meet the varied needs of transit agencies. When attempting to create a consistent measure of transit accessibility, the Transit Accessibility Index, this is somewhat challenging since data may be presented differently depending on the agency providing the data. Calculating the Transit Accessibility Index requires mitigating challenges presented by these variations in two specific cases, a missing calendar file and missing stop_times arrival times.

Missing Calendar File

The Transit Accessibility Index assesses transit quality by determining service offered by day of the week. This is derived from a file defined in the GTFS specification, the calendar file (Table 1). It is quite simple, a boolean column for each day of the week offering service. Each row can be associated back to routes, trips and stops using a unique identifier, service_id (Fig. 1).

Table 1 Example Calendar File Contents

service_id

monday

tuesday

wednesday

thursday

friday

saturday

sunday

start_date

end_date

AFA23GEN-1038-Sunday-00

0

0

0

0

0

0

1

20231217

20240616

AFA23GEN-1039-Saturday-00

0

0

0

0

0

1

0

20231223

20240622

AFA23GEN-1092-Weekday-00_C45

1

1

1

1

1

0

0

20231218

20240621

The GTFS specification also includes another file detailing service days, the calendar_dates file. Ideally, only service exceptions, when service is removed for holidays and other unique circumstances, is detailed in the calendar_dates file, but some agencies opt to list all service days in calendar_dates and omit calendar completely.

The calendar_dates file is comprised of only three columns; a column with the unique identifier for the service offered (relates back to trips,routes, and stops), a column with the date of service, and the exception type (1 for service added and 2 for service removed). In the aforementioned ideal scenario, the calendar file is included detailing regular service, and calendar_dates only details service exceptions (exception type 2) along with some routes added for special events and holidays (exception type 1) as shown in Table 2.

Table 2 Calendar Dates with Service Exceptions

service_id

date

exception_type

AFA23GEN-1038-Sunday-00

20231225

1

AFA23GEN-1038-Sunday-00

20240101

1

AFA23GEN-1038-Sunday-00

20240527

1

AFA23GEN-1039-Saturday-00

20240219

1

AFA23GEN-1092-Weekday-00_C45

20231225

2

AFA23GEN-1092-Weekday-00_C45

20240101

2

AFA23GEN-1092-Weekday-00_C45

20240212

2

AFA23GEN-1092-Weekday-00_C45

20240213

2

AFA23GEN-1092-Weekday-00_C45

20240219

2

AFA23GEN-1092-Weekday-00_C45

20240527

2

However, according to the GTFS specification, is allowed to simply list all service explicitly in the calendar_dates file and omit the calendar file. In this case, most of the entries are an exception type 1 (Table 3).

Table 3 Calendar Dates with All Service

service_id

date

exception_type

1

20240210

1

1

20240211

1

1

20240217

1

1

20240218

1

1

20240219

1

1

20240224

1

1

20240225

1

1

20240302

1

1

20240303

1

1

20240309

1

In these instances, as part of validation, a calendar file is constructed by interrogating the calendar_dates file using the following logic.

  1. exception type 1 records are selected

  2. day of week is calculated from the dates listed offering service

  3. if, by service identifier, any day of the week offers service, then for this service identifier, this day of the week is deemed to be true…offering service

This constructed calendar file is added to the validated data, and enables determining day of week service offered for routes, trips and stops.

Missing Arrival Times

Time of day service (daytime, evening and overnight) for each transit stop is determined based on the stop time for trips at each stop. The GTFS specification allows for null stop times provided there is at least a starting and ending time for each trip. Individual stop times for each stop do not have to be listed…provided the starting and ending time are provided. It is not uncommon to also see every nth stop with an arrival time, every fourth or sixth stop in a trip. This can be problematic when determining daytime, evening and overnight service metrics for the stops (Table 4).

Table 4 Stop Times with Missing Values

trip_id

stop_sequence

stop_id

arrival_time

departure_time

PART1-07-02

34

1a062OC

10:08:00

10:08:00

PART1-07-02

35

ectd

PART1-07-02

36

PT_113504

10:10:00

10:10:00

PART1-07-01

6

2729oc

08:20:00

08:20:00

PART1-07-01

7

PT_113543

PART1-07-01

8

PT_113535

08:23:00

08:23:00

This is mitigated by inferencing the values between known times. Although inferencing does not take into consideration the route traveled and the potential differences in time due to varied distance between intermediate stops, inferencing arrival times based on known start and end times does ensure each stop has an arrival time. These inferenced arrival times, while not perfect based on distance traveled, since equally distributed between known starting and ending times, do enable accurate determination of daytime, evening and overnight service boolean columns. It also enables calculating headway descriptive statistics for evaluating service quality. Hence, for the purposes of calculating the Transit Accessibility Index, this is more than sufficient.

References