Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Published data products should patch in earlier data if it's missing #1225

Open
2 of 5 tasks
tiffanychu90 opened this issue Sep 17, 2024 · 2 comments
Open
2 of 5 tasks
Labels
feature request Issues to request new features open-data Work related to publishing, ingesting open data

Comments

@tiffanychu90
Copy link
Member

Where does your feature apply?
Select from the below, and be sure to affix the appropriate label to this issue (e.g. dataset, jupyterhub, metabase, analysis.calitp.org)

  • Data (the warehouse)
  • JupyterHub
  • Metabase
  • analysis.calitp.org
  • Other (add detail)

Is your feature request related to a problem? Please describe.
Our single day snapshots that support our analytics pipeline can be subject to missing operators. This is expected, as day to day, feeds can be missing for a short period and come back soon thereafter. For users, this can prove to be frustrating as operators appear and disappear.

Describe the solution you'd like
We'll keep our analytics pipeline as is, pulling the single day and running it through. Except, let's add 2 things to help us fill in the blanks:

  • a yaml of operators schedule_gtfs_dataset_name and (last available) analysis_date. use this to check to see if we're missing anyone...and if we are, we can pull from an earlier cached date of the processed results.
  • right now, our published datasets would be named dataset_name_date, and now we'd have a version that is dataset_name_date(patched).

Describe alternatives you've considered

We want to consider the following points:

  • a list of dates we support in our analytics pipeline (schedule, speeds, RT vs schedule). viewing shared_utils/rt_dates as the list of all dates we support with all the intermediate outputs in gtfs_analytics_data.yaml saved.
  • filling in missing dates can either take an earlier date nearest to our cached date or an earlier cached date
    • if Aug 15, 2024 is missing, then we can either see that Aug 11, 2024 (not a wed) has the data, or Jul 15, 2024 is available. We can patch in an earlier date we've saved (knowing that it may be several months away) or patch in date closer to Aug 15.
  • Pro of patching in closer date:
    • it's actually closer, even if it's not the same weekday.
    • if the error is a result of the feed having an expiration date and then a new feed uploaded, then yes, we actually reflect more recent information.
  • Con of patching in closer date:
    • if the error is a bug in our warehouse that we're waiting to resolve (and not inherent to the feed's expiration date), we may be hitting the warehouse repeatedly trying to find something, when actually, we should revert to the last cached date that has full data
    • knowing there's a bug in the warehouse doesn't mean we can fix it right away, and we tend not to fix it right away anyway.
    • also, how do we keep track of all the intermediate outputs that we are grabbing? are we creating a second list of dates-with-incomplete-information (only schedule tables), no other outputs...no crosswalks, no vp, no RT vs schedule (since we did not grab RT)?
      • this can be increasingly an unwieldly list to hold. it''ll look like collection of dates that do ad-hoc things, but it's not clear what those are?
      • benefit of the gtfs_analytics_data.yml data catalog is to know which dates are fully supported across all the analytics work, and that we can combine all those sources easily for a given day
  • No matter what, we need an extra heuristic to check that the cached date is "full" data. Aka, if we expect an operator to make 1,000 trips a day, we don't want to grab a cached date if it's 500 trips that day (maybe a bug was beginning to appear in our warehouse), we want to add a rule to get "complete enough" data

Additional context

@tiffanychu90 tiffanychu90 added feature request Issues to request new features admin Administrative work labels Sep 17, 2024
@edasmalchi
Copy link
Member

Thanks for the thorough writeup! My first impression is that going back to the last cached date would be preferable but happy to help brainstorm more.

Stuff like stops/routes are relatively static, and it seems better to have complete data for an operator minus, perhaps, the most recent service change vs. no/incomplete data... For RT, maybe better to have it go a few months stale vs. running an off-cycle date?

Perhaps as part of this tooling we can add a separate alert/reporting mechanism if we have nothing for an operator for, say, 6mos?

@tiffanychu90
Copy link
Member Author

@edasmalchi: Ok! let me try to get this for sep open data + yaml produced to track what's there, and we can iterate from there? I'm curious for how many operators / how far back we'll be patching this, but hopefully this means sep's hqta data will definitely have Long Beach

@tiffanychu90 tiffanychu90 added open-data Work related to publishing, ingesting open data and removed admin Administrative work labels Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues to request new features open-data Work related to publishing, ingesting open data
Projects
None yet
Development

No branches or pull requests

2 participants