Skip to content

Latest commit

 

History

History
56 lines (42 loc) · 5.98 KB

specification.md

File metadata and controls

56 lines (42 loc) · 5.98 KB

A GTFS file is a ZIP archive containing a number of CSV files. A difference can concern a file, a column in a file or a row in a file.

A “GTFS Diff” file is a CSV file with 8 columns, allowing to express any difference found between two GTFS files. Each row describes a difference found.

Columns description

Field Name Type Required Description
id String Required Uniquely identifies a row in the GFTS Diff file.
file String Required The name of the file in the GTFS archive concerned by the change
action String. Enum: add, delete, update Required The type of change. Has something been added, deleted or updated? Files and columns can be added or deleted. Only rows can be updated.
target String. Enum: file, column, row Required Specify what is concerned by the “action”. Can be a file, a column in a file or just a row.
identifier json Required

How to uniquely identify what part of the data is concerned by the change.

- if target is set to file, we identify the file using the “filename” key. For example {“filename”: “shapes.txt”}

- if target is set to column, we identify the column using the “column” key. For example {“column”: “bikes_allowed”}

- if target is set to row, we identify the row using a list of keys. Each key being added to the other as a logical “AND”. For example, {"stop_id":"A"} identifies a row having the “A” as a stop_id. {“from_stop_id”: “A”, “to_stop_id”: “B”} identifies the row where “from_stop_id” = “A” AND “to_stop_id” = “B”

initial_value json Conditionally required

Required in case of an updated row. List the initial values.

For example {"stop_name": "Xrain station", “stop_lat”: “”}

new_value json Conditionally required

Required in case of an added or updated row.

A json where the keys are the column names and the values are the row values.

- For an added row, it contains all the column names and values. For example for a new transfer between stations: {“from_stop_id”: “A”, “to_stop_id”: “B”, “transfer_type”: 1, “min_transfer_time”: 2}

- For an updated row, it contains only the modified values. For example {"stop_name": "Train station", “stop_lat”: “45.1”}

note String Optional A free field where explanations about the change can be given.

Ordering

The requested order concerns the “target” column. The differences should be listed in the following order:

  1. “target” of type “file” come first. ie modifications on files in the archive
  2. then “target” of type “column”, ie modifications on a column in a file
  3. then “target” of type “row”, ie modifications on a row in a file

This order makes it easier for a human to grasp the differences between files, and for a computer to apply successive patches of changes (first create a file, then populate it).

For each given type of target, the row order is not specified.

Example

Here is an example GTFS diff file, with some explanations in the note column about what each row means.

id file action target identifier initial_value new_value note
1 transfers.txt add file {“filename”: “shapes.txt”} creation of new file
2 readme.pdf delete file {“filename”: “readme.pdf”} deletion of a file
3 shapes.txt delete column {“column”: “internal_id”} delete the column “internal_id” in the “transfers.txt” file
4 transfers.txt add row {“from_stop_id”: “A”, “to_stop_id”: “B”, “transfer_type”: 1, “min_transfer_time”: 2} add a row in the transfers.txt file
5 stops.txt delete row {“stop_id”: “A”} {“stop_id”: “A”, “stop_name”: “town center”, …} delete the row in stops.txt where “stop_id” = “A”
6 stops.txt update row {“stop_id”: “B”} {“stop_name”: “”} {“stop_name”: “station”}

in stops.txt update the stop_name of the row identified by “stop_id” = “B”. The stop_name was empty, now it is “station”

7 calendar_dates.txt update row {“service_id”: “1”, “date”: “20220928”} {“exception_type”: “1”} {“exception_type”: “2”} in calendar_dates.txt, update the exception_type of the row identified by “service_id” = “1” AND “date” = “20220928”. The exception_type was 1, now it is 2.

Example

If you shuffle the rows of the stops.txt file in a GTFS archive, the resulting GTFS Diff is empty, as row order is not a relevant information in a GTFS file.

Full example

The examples folder contains simple GTFS files and the resulting GTFS Diff listing the differences between them.

Possible usages

  • Have a quick overview of the changes made to a GTFS file
  • Communicate effectively to someone the changes made to a GTFS file and give an explanation for each change.
  • Take two corrected GTFS files and merge them together.

Possible alternatives we thought about

  • Using text diff tools. CSV are just text files, so it is possible to use powerful existing tools to compare them. But if the text diff is easily made, the results are harder to interpret. For example if a column is deleted, on a 1000 rows file, text diff will show 1000 differences, whereas the current proposition will just list a single column deletion. Text diff is also order dependent, but GTFS files are not.
  • On the complete opposite to the text diff is the use of a GTFS library to load the data in a model. Main advantage is the possibility to interpret the changes with more depth, because the model knows what it is talking about. Could make the difference between changes impacting routing calculations, visual elements (colors, etc). But makes it more difficult to handle wrong data (a pdf file in the archive has been deleted) and needs to constantly keep track of the GTFS extensions (fares V2, pathways, etc)