airflow: operator and dag/tasks to sync NTD data via DOT API and XLSX #3415

charlie-costanzo · 2024-08-06T16:58:18Z

Description

This PR introduces new NTD data sources available through the federal Department of Transportation through their data API as well as XLSX file downloads.

Two Airflow operators were necessary for this work because although a large amount of NTD datasets are now available from the NTD API, there are still important datasets available only in XLSX format (monthly ridership, certain annual reports).

To accomplish this, two new Airflow operators (scrape_ntd_api.py and scrape_ntd_xlsx.py), two associated dags (sync_ntd_data_api and sync_ntd_data_xlsx), and a selection of NTD table endpoints as dag tasks were created.

Both operators utilize the PartitionedGCSArtifact class pattern used elsewhere in the pipeline.

NTD Data Sources scraped and stored in this PR include:

2022 Annual Reporting
Monthly Ridership Data
Safety, service, and security related data

We discovered that these tables are retroactively updated at a regular cadence, including annual reports for previous years, so a schedule has been configured to download from these endpoints on the first day of the month, every month.

Resolves #3402, part of Epic #3401

Type of change

New feature

How has this been tested?

Successful local Airflow runs, publishing to gcs buckets

Post-merge follow-ups

Environment variables need to be added to composer
DAGs need to be manually triggered
observe to verify expected behavior
create exception handling follow-on ticket

airflow/plugins/operators/scrape_ntd_api.py

mjumbewu · 2024-08-20T19:50:18Z

airflow/plugins/operators/scrape_ntd_api.py

+            self.data = response
+            self.logger.info(
+                f"Downloaded {self.product} data for {self.year} with {len(self.data)} rows!"
+            )


I don't know anything offhand about the failure modes of the NTD API, but it might be useful to check response status and log non-200 responses or something.

mjumbewu

@charlie-costanzo I took a first pass through this and left just a couple comments. I'd be down to set up a time to talk through it more.

airflow/data/ntd_endpoints.csv

vevetron · 2024-09-10T17:02:35Z

It's probably okay, but I'm not entirely sure I understand why we are building operators and airflow dags for some of these data entities such as "2022_reporting/2022_capital_expenses_by_mode.yml" - is there an expectation the data will change? Shouldn't it just be a one off data pull?

Edit: Actually i think they keep updating these 2022 datasets for some reason.

airflow/plugins/operators/scrape_ntd_xlsx.py

evansiroky · 2024-09-11T16:41:58Z

...ync_ntd_data_xlsx/ridership/historical_raw_monthly_ridership_no_adjustments_or_estimates.yml

+operator: operators.NtdDataProductXLSXOperator
+
+product: 'raw_monthly_ridership_no_adjustments_or_estimates'
+xlsx_file_url: 'https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-09/S%26S%20Time%20Series-May%202024-Major%20Only_240903.xlsx'


I believe this file changes monthly. Are we planning on accounting for this and creating the needed automation for monthly downloads or is that something for a future effort?

Hey Evan! Yep – we saw that all of these datasets are being updated regularly, even annual reports for previous years, so automation is built into this work. I just pushed changes with a placeholder value for the schedule - the first day of the month, every month. We can update that from here, but figured that was a reasonable suggestion. Let me know if you have any other thoughts.

Hey @evansiroky – this PR is ready for merge, but my last remaining question relates to the scheduling of the DAG tasks in this PR. Based on the frequency that we see a lot of the endpoints updating (~monthly) I configured the DAG tasks to run once a month (first day of the month, every month, 3am PT), but open to suggestion here.

mjumbewu · 2024-09-16T19:49:11Z

airflow/plugins/operators/scrape_ntd_api.py

+            else:
+                logging.info(
+                    f"Downloaded {self.product} data for {self.year} with {len(response)} rows!"
+                )


suggestion (non-blocking): We should probably raise an exception when there is a non-200 response from the API (e.g., a 404 response will still return some JSON, but not in the structure that's valid for the table).

This can be done in a follow-on issue.

caputured in this ticket:

#3474

mjumbewu

I'd say just make a follow-on issue related to the suggestion above, and then LGTM.

…ata sources

charlie-costanzo changed the title ~~first take at getting odata api to work~~ modify ntd scraping script to use DOT API Aug 13, 2024

charlie-costanzo self-assigned this Aug 13, 2024

charlie-costanzo force-pushed the ntd-odata-api branch from fa56a2d to 6ba2ef8 Compare August 16, 2024 14:34

charlie-costanzo changed the title ~~modify ntd scraping script to use DOT API~~ airflow: operator and dag/task to sync ntd data via DOT API Aug 16, 2024

charlie-costanzo changed the title ~~airflow: operator and dag/task to sync ntd data via DOT API~~ airflow: operator and dag/task to sync NTD data via DOT API Aug 16, 2024

mjumbewu reviewed Aug 20, 2024

View reviewed changes

airflow/plugins/operators/scrape_ntd_api.py Show resolved Hide resolved

mjumbewu reviewed Aug 20, 2024

View reviewed changes

erikamov reviewed Aug 23, 2024

View reviewed changes

airflow/data/ntd_endpoints.csv Outdated Show resolved Hide resolved

charlie-costanzo changed the title ~~airflow: operator and dag/task to sync NTD data via DOT API~~ airflow: operator and dag/tasks to sync NTD data via DOT API and XLSX Sep 3, 2024

charlie-costanzo force-pushed the ntd-odata-api branch from 212fb10 to aa421c2 Compare September 10, 2024 15:49

charlie-costanzo marked this pull request as ready for review September 10, 2024 15:51

charlie-costanzo requested review from evansiroky, vevetron and hunterowens as code owners September 10, 2024 15:51

charlie-costanzo force-pushed the ntd-odata-api branch from aa421c2 to b04ece4 Compare September 10, 2024 15:53

charlie-costanzo marked this pull request as draft September 10, 2024 16:53

mjumbewu reviewed Sep 10, 2024

View reviewed changes

airflow/plugins/operators/scrape_ntd_xlsx.py Outdated Show resolved Hide resolved

mjumbewu reviewed Sep 10, 2024

View reviewed changes

airflow/plugins/operators/scrape_ntd_xlsx.py Show resolved Hide resolved

evansiroky reviewed Sep 11, 2024

View reviewed changes

charlie-costanzo marked this pull request as ready for review September 13, 2024 19:45

charlie-costanzo force-pushed the ntd-odata-api branch from 815a5b5 to a66f904 Compare September 13, 2024 19:50

mjumbewu reviewed Sep 16, 2024

View reviewed changes

mjumbewu approved these changes Sep 16, 2024

View reviewed changes

charlie-costanzo added 5 commits September 16, 2024 15:57

first take at getting odata api to work

a821066

small changes to script

2541bd7

got 2022 ntd data scraped

e67267a

fix flake8

dea184d

removing the script files changed

d1edb3e

charlie-costanzo added 16 commits September 16, 2024 15:57

changes for testing

4c40e30

refactor task format, file type

01f67de

remove csv file

f5681c1

ymls for 2022 NTD endpoints

576f35d

fixed faulty dag and file name

a9afff6

fix file name to remove 'raw'

919ec7d

created tasks for 2022 base ntd tables

bc6288b

add new operator, dag, task and testing for XLSX NTD tables

4b46b19

got xlsx airflow operator working, added tasks for 4 important xlsx d…

8864db2

…ata sources

revert changes to docker-compose

8e7eb17

simplified code to accept workbooks with single or multiple tabs

e1ceee5

beginning of adding error/other logging appopriately, to finish tomorrow

281e48f

cleaned XLSX operator, refactored

1db17c6

refactor, get different buckets working

9f5b4ee

converted to use json, fixed gcsfs upload error

aa291c8

added placeholder schedule, first day of the month every month

90ba382

charlie-costanzo force-pushed the ntd-odata-api branch from a66f904 to 90ba382 Compare September 16, 2024 19:57

charlie-costanzo added 6 commits September 16, 2024 16:25

revise to all lowercase for dag tasks

4470315

fix naming, delete to make it work

8626922

renamed with lowercase

a30e29a

removed unnecessary XLSX endpoints

6851d9a

update to requested schedule

ab91693

restore prod variables

47344f5

charlie-costanzo merged commit 65fec8c into main Sep 18, 2024
1 check passed

charlie-costanzo deleted the ntd-odata-api branch September 18, 2024 19:49

This was referenced Sep 18, 2024

correct inaccurate NTD extract object name #3464

Merged

Bring new NTD endpoint sources into the warehouse as staging #3467

Merged

Additional exception handing for new NTD/State Geoportal scraping operators #3474

Closed

This was referenced Dec 12, 2024

NTD: Modify API sync for column name handling, ingest safety and security tables, create external tables #3579

Merged

NTD: scrape DOT for agency and contract XLSX tables and create external tables #3584

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

airflow: operator and dag/tasks to sync NTD data via DOT API and XLSX #3415

airflow: operator and dag/tasks to sync NTD data via DOT API and XLSX #3415

charlie-costanzo commented Aug 6, 2024 •

edited

Loading

mjumbewu Aug 20, 2024

mjumbewu left a comment

vevetron commented Sep 10, 2024 •

edited

Loading

evansiroky Sep 11, 2024

charlie-costanzo Sep 13, 2024

charlie-costanzo Sep 16, 2024

mjumbewu Sep 16, 2024

charlie-costanzo Sep 24, 2024

mjumbewu left a comment

airflow: operator and dag/tasks to sync NTD data via DOT API and XLSX #3415

airflow: operator and dag/tasks to sync NTD data via DOT API and XLSX #3415

Conversation

charlie-costanzo commented Aug 6, 2024 • edited Loading

Description

Type of change

How has this been tested?

Post-merge follow-ups

mjumbewu Aug 20, 2024

Choose a reason for hiding this comment

mjumbewu left a comment

Choose a reason for hiding this comment

vevetron commented Sep 10, 2024 • edited Loading

evansiroky Sep 11, 2024

Choose a reason for hiding this comment

charlie-costanzo Sep 13, 2024

Choose a reason for hiding this comment

charlie-costanzo Sep 16, 2024

Choose a reason for hiding this comment

mjumbewu Sep 16, 2024

Choose a reason for hiding this comment

charlie-costanzo Sep 24, 2024

Choose a reason for hiding this comment

mjumbewu left a comment

Choose a reason for hiding this comment

charlie-costanzo commented Aug 6, 2024 •

edited

Loading

vevetron commented Sep 10, 2024 •

edited

Loading