Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incremental MPI seeding & time restraints #324

Merged
merged 4 commits into from
Nov 2, 2023

Conversation

m-goggins
Copy link
Collaborator

This PR modifies the convertParquetMPI notebook to incrementally process data for seeding and adds a window of time during which seeding cannot occur.

Seed data will be processed in n_rows increments (set to 1000) and at the end of each 1000-row increment, the notebook writes out a file to the patient-data bucket to keep track of the last row processed. At the start of each 1000-row increment, the is_valid_time_window function checks whether the current time is between 9:30am and 11:30am PT. If yes, the next 1000 rows are processed; if no, the notebook sleeps for 15 minutes before checking the time validity again.

Once the full set of MPI data has been processed, the seed file and the file keeping track of the last row processed for the seed file are moved to the archive directory in patient-data.

Copy link
Collaborator

@DanPaseltiner DanPaseltiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for thinking through how to do this

@m-goggins m-goggins merged commit 5c8d527 into main Nov 2, 2023
@m-goggins m-goggins deleted the seed-MPI-data-incrementally branch November 2, 2023 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants