Add T021–T023 to clean 3 ADB ATO data sets #83

sandraalnajjar · 2023-12-07T11:24:57Z

No description provided.

khaeru · 2023-12-11T01:03:18Z

Hi @sandraalnajjar —thank you very much for your contributions to the iTEM Open Data code! We appreciate that members of the community, such as yourself, make efforts to expand the scope and quality of the data included in this shared resource.

In this review, I will do a few things:

Summarize what I think is being added by the commits on this branch. (This is information that should go in the description of the pull request. If I am guessing incorrectly, please correct me by providing the correct explanation.)
Identify TODOs to merge this branch/PR.
Identify TODOs for further clean-up (after the branch is merged).

To be clear, I understand from you and @soniayeh that the course in which you did this work is coming to an end, so you may not have dedicated time to do (2) or (3). So those TODOs are mentioned without saying who will do them or when—it could be you, me, or someone else; sooner or later.

What is added

Three data-cleaning modules, each cleaning one data flow from the Asian Development Bank (ADB) Asian Transport Outlook (ATO); I guess the 2023 edition:
- item.historical.T021 → TAS-PAG-005(3)
- item.historical.T022 → TAS-VEP-017
- item.historical.T023 → TAS-FRA-007(2)

TODOs to merge the branch

Deduplicate the code in the 3 added modules. I see that the modules T022 and T023 both contain class AtoWorkbook with many common lines; this makes it difficult to identify which steps are common versus specific to each input data flow. T021 appears to contain a class ItemTransformer with a different structure. See also the second TODO in the next section.
Provide a single entry-point in each module. Existing data modules like T000 do not execute any code when they are imported; they instead provide a function like process() that triggers processing of the data.
Add documentation, for instance in the docstring at the top of each module. Explain, at least, which specific data are used: at what URL can the original input data be found?
Ensure the tests pass. This can be started by rebasing the branch on transportenergy:main.

TODOs for follow-up

Consider overlap with the transport_data.adb code maintained by the Transport Data Commons, here; make use of the TDC code where possible, and keep the features specific to iTEM Open Data.
Consider overlap with e.g. the class AtoWorkbook added in Add T015–T017 to clean three ADB ATO data sets #78 and Add T011–T014 to clean 4 ADB ATO data sets #81 and the class ItemTransformer added in Add T018–T020 to clean three ADB ATO data sets #80; merge common actions to common code.
Add the new modules to the test suite. item.tests.test_historical contains tests of the existing (T000–T010, T012) data modules, and these tests should be expanded to test the modules added in this PR.

khaeru · 2024-09-03T10:40:11Z

Hi @mperezbravo —I see you are making some changes here, but I am not sure why in particular or what the objective is.

Please reach out so we can discuss the best way to move this code forward, especially per the "TODOs for follow-up" I added to the description.

mperezbravo · 2024-09-03T10:55:44Z

Hi Paul! I’m Manu, I have been attending iTEM for the last couple of years remotely, and I am now at Chalmers working with Sonia as part of my PhD stay (I am originally doing my PhD in Comillas, Madrid). I will be attending iTEM and TDC in person this year, so, looking forward to meeting you there 😊 Sonia suggested I could take on these PRs made by students and solve them before the Geneva meeting. I have been working with the code in them, taking databases from ABD. After giving it some thought, I found the students made a merge request of a branch of their own repositories into the iTEM main branch, which is not something I can add commits to. So, as a trial, I created a new branch of the main, and merged one of their PR so that I can add commits to it. The traceability is intact, do you think it’s a good idea? Sorry if I jumped into the code without reaching out first, I was hoping to have some solutions in mind first. I saw your TODOs and I was aiming to implementing these changes (using TDC code for ABD) for one of the PRs, so that you could see the result and discuss whether we can do the same for the 4 of them (and possibly future PRs by students). I haven’t finished working on the first one (I guess the following ones will be easier), I can come back to you with a commit when it’s finished and we can discuss how to proceed. Hopefully in the next few days. Just a question: The OECD SDMX databases don’t seem to work anymore, they don’t pass the pytest. It looks like they changed SDMX API a year ago, and these fetch structures are not working. Do you have a new code snippet, or am I missing something? If this is a TODO, I can try to fix it 😊 Kind regards, Manu De: Paul Natsuo Kishimoto ***@***.***> Enviado el: martes, 3 de septiembre de 2024 12:41 Para: transportenergy/database ***@***.***> CC: Manuel Pérez Bravo ***@***.***>; Mention ***@***.***> Asunto: Re: [transportenergy/database] Add T021–T023 to clean 3 ADB ATO data sets (PR #83) No suele recibir correos electrónicos de ***@***.******@***.***>. Por qué esto es importante<https://aka.ms/LearnAboutSenderIdentification> Hi @mperezbravo<https://github.com/mperezbravo> —I see you are making some changes here, but I am not sure why in particular or what the objective is. Please reach out so we can discuss the best way to move this code forward, especially per the "TODOs for follow-up" I added to the description. — Reply to this email directly, view it on GitHub<#83 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BCQC3FCMSBHFPNKARCMHI6LZUWG2FAVCNFSM6AAAAABAK4SYYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRWGE4TIMBYGE>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>> AVISO: Para más información relativa a la confidencialidad y protección de datos de los correos electrónicos pinche aquí: [“Protección para emails”]<https://www.comillas.edu/proteccion-emails#esp> NOTICE: For further information regarding confidentiality and data protection of emails, please click here: ["Data Protection for Emails"]<https://www.comillas.edu/proteccion-emails#eng>

khaeru · 2024-09-03T11:22:05Z

I saw your TODOs and I was aiming to implementing these changes (using TDC code for ABD) for one of the PRs, so that you could see the result and discuss whether we can do the same for the 4 of them (and possibly future PRs by students).

Okay, thanks for replying and clarifying. It's good to see this, as it would have been my first suggestion. You can see the documentation for the transport_data.adb code here and navigate to the underlying code.

In particular, the code in the transport_data package handles entire ATO files (instead of single sheets) at once, retrieves these from the original location, and retains SDMX metadata (structural and otherwise) as a byproduct. So I think it likely does supersede the low-level data handling tasks in these branches and PRs.

However, as you mention, it will be good to check whether Sonia's prior students had identified any key data quality issues with these data flows and implemented corrections for them. If so it will be good to preserve and merge those.

I will be attending iTEM and TDC in person this year, so, looking forward to meeting you there 😊

Good, I hope we all can have a discussion about how to move this repo forward. The intent is that a lot of the technical and low-level "plumbing" will be handled by transport_data, so that this repo can be more narrowly focused on the content and meaning of data; quality issues and their respective corrections; and things like that.

Just a question: The OECD SDMX databases don’t seem to work anymore, they don’t pass the pytest. It looks like they changed SDMX API a year ago, and these fetch structures are not working. Do you have a new code snippet, or am I missing something? If this is a TODO, I can try to fix it 😊

See #88. Per the label there, help would be welcome. I haven't dug in because I think the new OECD/ITF SDMX web service uses different data structures than the previous one, and thus does not align exactly to the old iTEM "T001" etc. designations. We will need to adapt to follow the upstream source.

This is a fair chunk of work, and is not urgent. When it comes to reviewing and merging your improvements (in whichever PR), we can for instance mark those tests as temporary XFAIL, or something like that. Let's discuss when that becomes relevant.

khaeru · 2024-09-12T13:40:40Z

Hi @mperezbravo —I am seeing via LinkedIn that there was a new version of the iTEM Open Data Zenodo record published. It's not clear whether that was produced using this branch, your other un-merged PR #89, or some other code, and it comes as a bit of surprise, at least to me.

This is partly our fault for not having clearly-described process to prepare and publish updates, but again I hope we can discuss next week so that everyone is on the same page.

Add files via upload

a625317

khaeru changed the title ~~Add Py files via upload~~ Add T021–T023 to clean 3 ADB ATO data sets Dec 11, 2023

khaeru mentioned this pull request Dec 11, 2023

T021-23 ATO Sandra #82

Closed

khaeru mentioned this pull request Feb 6, 2024

Move clients, converters from iTEM Open Data repo transport-data/tools#4

Open

khaeru added enh Enhancements & new features help wanted historical Historical transport statistics labels May 22, 2024

mperezbravo changed the base branch from main to T021-T023-Datasets September 2, 2024 15:15

mperezbravo merged commit 9dbd3db into transportenergy:T021-T023-Datasets Sep 2, 2024
1 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add T021–T023 to clean 3 ADB ATO data sets #83

Add T021–T023 to clean 3 ADB ATO data sets #83

sandraalnajjar commented Dec 7, 2023

khaeru commented Dec 11, 2023

khaeru commented Sep 3, 2024

mperezbravo commented Sep 3, 2024 via email

khaeru commented Sep 3, 2024

khaeru commented Sep 12, 2024

Add T021–T023 to clean 3 ADB ATO data sets #83

Add T021–T023 to clean 3 ADB ATO data sets #83

Conversation

sandraalnajjar commented Dec 7, 2023

khaeru commented Dec 11, 2023

What is added

TODOs to merge the branch

TODOs for follow-up

khaeru commented Sep 3, 2024

mperezbravo commented Sep 3, 2024 via email

khaeru commented Sep 3, 2024

khaeru commented Sep 12, 2024