Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backfill data 2022-08-08 through 2022-10-08 into public bucket #27

Merged
merged 6 commits into from
Oct 17, 2022

Conversation

lauriemerrell
Copy link
Member

Description

This PR:

  • Uploads the notebook I used to copy raw data from 5/19 - 10/9 (copies fully raw data from private bucket to public)
    • TODO: I need to do 10/10, I turned on the actual copying in prod today but there are like 10 hours where the data is still only in the private bucket before I turned on writing to the public bucket
  • Refactors rt_daily_aggregation to be a typer command line app that is more oriented towards backfilling data within the buckets (with the deployment of Update scraper script and automate aggregation #11, I kind of feel like the original purpose of rt_daily_aggregation is no longer needed? And at the very least I think it's appropriate to have it call combine_daily_files from scrape_data so that we have only one source of truth for the aggregation code?)

Resolves #17

Type of change

  • Bug fix
  • New functionality
  • Documentation

How has this been tested?

I have already run everything in here to do the backfills today, hope it's ok that I did that unilaterally (did not change any existing data in prod!)

@lauriemerrell lauriemerrell self-assigned this Oct 10, 2022
Copy link
Collaborator

@dcjohnson24 dcjohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. rt_daily_aggregation is probably less relevant now, but we could keep it for access to a sort of local lambda. In that case though, the put operations on lines 84 and 100 would throw errors for the public bucket. This could be avoided by adding an optional save argument.

end_date = "2022-08-07"

data, errors, hourly_summary = compute_hourly_totals(start_date, end_date)
app()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running python data_analysis/rt_daily_aggregations.py from the project root will throw a ModuleNotFoundError: No module named 'scrape_data'. For this to work, you'll have to create a root level module that imports rt_daily_aggregations. You can create a module called application.py, for example, and then add

from data_analysis import rt_daily_aggregations

rt_daily_aggregations.app()

to it. Then running python application.py from the project root will work.

Therefore, I would remove the if __name__ == '__main__' block because this isn't meant to be run from the command line.

Copy link
Member Author

@lauriemerrell lauriemerrell Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running it as python3 -m data_analysis.rt_daily_aggregations arg arg arg from the root directory and it did work... Open to changes but this was convenient for me when I was actually using it to backfill. Does that make sense? Can also try to document this somewhere

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I didn't have the -m flag so the relative imports didn't work. Leaving it as is is fine, but it would be helpful to document the -m flag.

@lauriemerrell
Copy link
Member Author

Addressed first round of comments, will update to take a "save" parameter next time I have a chance to work on this. I didn't do that originally because I was trying to limit changes to the actual combine_daily_files.py, but I guess I can make more changes that don't have to be deployed to Lambda.

Copy link
Collaborator

@dcjohnson24 dcjohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes.

@lauriemerrell lauriemerrell merged commit 474d9ee into main Oct 17, 2022
@lauriemerrell lauriemerrell deleted the update-backfill-aggregations branch October 17, 2022 02:53
haileyplusplus pushed a commit to haileyplusplus/chn-ghost-buses that referenced this pull request Apr 1, 2024
…aggregations

backfill data 2022-08-08 through 2022-10-08 into public bucket
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deploy updated scraper & aggregator
2 participants