A GTFS file is a ZIP archive containing a number of CSV files. A difference can concern a file, a column in a file or a row in a file.
A “GTFS Diff” file is a CSV file with 8 columns, allowing to express any difference found between two GTFS files. Each row describes a difference found.
Field Name | Type | Required | Description |
---|---|---|---|
id | String | Required | Uniquely identifies a row in the GFTS Diff file. |
file | String | Required | The name of the file in the GTFS archive concerned by the change |
action | String. Enum: add , delete , update |
Required | The type of change. Has something been added, deleted or updated? Files and columns can be added or deleted. Only rows can be updated. |
target | String. Enum: file , column , row |
Required | Specify what is concerned by the “action”. Can be a file, a column in a file or just a row. |
identifier | json | Required | How to uniquely identify what part of the data is concerned by the change. - if target is set to file, we identify the file using the “filename” key. For example {“filename”: “shapes.txt”} - if target is set to column, we identify the column using the “column” key. For example {“column”: “bikes_allowed”} - if target is set to row, we identify the row using a list of keys. Each key being added to the other as a logical “AND”. For example, {"stop_id":"A"} identifies a row having the “A” as a stop_id. {“from_stop_id”: “A”, “to_stop_id”: “B”} identifies the row where “from_stop_id” = “A” AND “to_stop_id” = “B” |
initial_value | json | Conditionally required | Required in case of an updated row. List the initial values. For example {"stop_name": "Xrain station", “stop_lat”: “”} |
new_value | json | Conditionally required | Required in case of an added or updated row. A json where the keys are the column names and the values are the row values. - For an added row, it contains all the column names and values. For example for a new transfer between stations: {“from_stop_id”: “A”, “to_stop_id”: “B”, “transfer_type”: 1, “min_transfer_time”: 2} - For an updated row, it contains only the modified values. For example {"stop_name": "Train station", “stop_lat”: “45.1”} |
note | String | Optional | A free field where explanations about the change can be given. |
The requested order concerns the “target” column. The differences should be listed in the following order:
- “target” of type “file” come first. ie modifications on files in the archive
- then “target” of type “column”, ie modifications on a column in a file
- then “target” of type “row”, ie modifications on a row in a file
This order makes it easier for a human to grasp the differences between files, and for a computer to apply successive patches of changes (first create a file, then populate it).
For each given type of target, the row order is not specified.
Here is an example GTFS diff file, with some explanations in the note column about what each row means.
id | file | action | target | identifier | initial_value | new_value | note |
---|---|---|---|---|---|---|---|
1 | transfers.txt | add | file | {“filename”: “shapes.txt”} | creation of new file | ||
2 | readme.pdf | delete | file | {“filename”: “readme.pdf”} | deletion of a file | ||
3 | shapes.txt | delete | column | {“column”: “internal_id”} | delete the column “internal_id” in the “transfers.txt” file | ||
4 | transfers.txt | add | row | {“from_stop_id”: “A”, “to_stop_id”: “B”, “transfer_type”: 1, “min_transfer_time”: 2} | add a row in the transfers.txt file | ||
5 | stops.txt | delete | row | {“stop_id”: “A”} | {“stop_id”: “A”, “stop_name”: “town center”, …} | delete the row in stops.txt where “stop_id” = “A” | |
6 | stops.txt | update | row | {“stop_id”: “B”} | {“stop_name”: “”} | {“stop_name”: “station”} | in stops.txt update the stop_name of the row identified by “stop_id” = “B”. The stop_name was empty, now it is “station” |
7 | calendar_dates.txt | update | row | {“service_id”: “1”, “date”: “20220928”} | {“exception_type”: “1”} | {“exception_type”: “2”} | in calendar_dates.txt, update the exception_type of the row identified by “service_id” = “1” AND “date” = “20220928”. The exception_type was 1, now it is 2. |
If you shuffle the rows of the stops.txt file in a GTFS archive, the resulting GTFS Diff is empty, as row order is not a relevant information in a GTFS file.
The examples folder contains simple GTFS files and the resulting GTFS Diff listing the differences between them.
- Have a quick overview of the changes made to a GTFS file
- Communicate effectively to someone the changes made to a GTFS file and give an explanation for each change.
- Take two corrected GTFS files and merge them together.
- Using text diff tools. CSV are just text files, so it is possible to use powerful existing tools to compare them. But if the text diff is easily made, the results are harder to interpret. For example if a column is deleted, on a 1000 rows file, text diff will show 1000 differences, whereas the current proposition will just list a single column deletion. Text diff is also order dependent, but GTFS files are not.
- On the complete opposite to the text diff is the use of a GTFS library to load the data in a model. Main advantage is the possibility to interpret the changes with more depth, because the model knows what it is talking about. Could make the difference between changes impacting routing calculations, visual elements (colors, etc). But makes it more difficult to handle wrong data (a pdf file in the archive has been deleted) and needs to constantly keep track of the GTFS extensions (fares V2, pathways, etc)