Feature Request: Published data products should patch in earlier data if it's missing #1225

tiffanychu90 · 2024-09-17T17:08:29Z

Where does your feature apply?
Select from the below, and be sure to affix the appropriate label to this issue (e.g. dataset, jupyterhub, metabase, analysis.calitp.org)

Is your feature request related to a problem? Please describe.
Our single day snapshots that support our analytics pipeline can be subject to missing operators. This is expected, as day to day, feeds can be missing for a short period and come back soon thereafter. For users, this can prove to be frustrating as operators appear and disappear.

Describe the solution you'd like
We'll keep our analytics pipeline as is, pulling the single day and running it through. Except, let's add 2 things to help us fill in the blanks:

a yaml of operators schedule_gtfs_dataset_name and (last available) analysis_date. use this to check to see if we're missing anyone...and if we are, we can pull from an earlier cached date of the processed results.
right now, our published datasets would be named dataset_name_date, and now we'd have a version that is dataset_name_date(patched).

Describe alternatives you've considered

We want to consider the following points:

a list of dates we support in our analytics pipeline (schedule, speeds, RT vs schedule). viewing shared_utils/rt_dates as the list of all dates we support with all the intermediate outputs in gtfs_analytics_data.yaml saved.
filling in missing dates can either take an earlier date nearest to our cached date or an earlier cached date
- if Aug 15, 2024 is missing, then we can either see that Aug 11, 2024 (not a wed) has the data, or Jul 15, 2024 is available. We can patch in an earlier date we've saved (knowing that it may be several months away) or patch in date closer to Aug 15.
Pro of patching in closer date:
- it's actually closer, even if it's not the same weekday.
- if the error is a result of the feed having an expiration date and then a new feed uploaded, then yes, we actually reflect more recent information.
Con of patching in closer date:
- if the error is a bug in our warehouse that we're waiting to resolve (and not inherent to the feed's expiration date), we may be hitting the warehouse repeatedly trying to find something, when actually, we should revert to the last cached date that has full data
- knowing there's a bug in the warehouse doesn't mean we can fix it right away, and we tend not to fix it right away anyway.
- also, how do we keep track of all the intermediate outputs that we are grabbing? are we creating a second list of dates-with-incomplete-information (only schedule tables), no other outputs...no crosswalks, no vp, no RT vs schedule (since we did not grab RT)?
  - this can be increasingly an unwieldly list to hold. it''ll look like collection of dates that do ad-hoc things, but it's not clear what those are?
  - benefit of the gtfs_analytics_data.yml data catalog is to know which dates are fully supported across all the analytics work, and that we can combine all those sources easily for a given day
No matter what, we need an extra heuristic to check that the cached date is "full" data. Aka, if we expect an operator to make 1,000 trips a day, we don't want to grab a cached date if it's 500 trips that day (maybe a bug was beginning to appear in our warehouse), we want to add a rule to get "complete enough" data

Additional context

The text was updated successfully, but these errors were encountered:

edasmalchi · 2024-09-17T17:19:33Z

Thanks for the thorough writeup! My first impression is that going back to the last cached date would be preferable but happy to help brainstorm more.

Stuff like stops/routes are relatively static, and it seems better to have complete data for an operator minus, perhaps, the most recent service change vs. no/incomplete data... For RT, maybe better to have it go a few months stale vs. running an off-cycle date?

Perhaps as part of this tooling we can add a separate alert/reporting mechanism if we have nothing for an operator for, say, 6mos?

tiffanychu90 · 2024-09-17T19:16:54Z

@edasmalchi: Ok! let me try to get this for sep open data + yaml produced to track what's there, and we can iterate from there? I'm curious for how many operators / how far back we'll be patching this, but hopefully this means sep's hqta data will definitely have Long Beach

tiffanychu90 added feature request Issues to request new features admin Administrative work labels Sep 17, 2024

edasmalchi mentioned this issue Sep 17, 2024

Bug: some missing operators in SHS Stops Export #1216

Closed

tiffanychu90 mentioned this issue Sep 17, 2024

Digest by district #1226

Merged

tiffanychu90 added open-data Work related to publishing, ingesting open data and removed admin Administrative work labels Sep 18, 2024

This was referenced Oct 4, 2024

Research Request - create schedule stops table with other GTFS stop metrics #1241

Closed

Schedule stop metrics #1248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Published data products should patch in earlier data if it's missing #1225

Feature Request: Published data products should patch in earlier data if it's missing #1225

tiffanychu90 commented Sep 17, 2024

edasmalchi commented Sep 17, 2024

tiffanychu90 commented Sep 17, 2024

Feature Request: Published data products should patch in earlier data if it's missing #1225

Feature Request: Published data products should patch in earlier data if it's missing #1225

Comments

tiffanychu90 commented Sep 17, 2024

edasmalchi commented Sep 17, 2024

tiffanychu90 commented Sep 17, 2024