Add incremental MPI seeding & time restraints #324
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR modifies the
convertParquetMPI
notebook to incrementally process data for seeding and adds a window of time during which seeding cannot occur.Seed data will be processed in
n_rows
increments (set to 1000) and at the end of each 1000-row increment, the notebook writes out a file to thepatient-data
bucket to keep track of the last row processed. At the start of each 1000-row increment, theis_valid_time_window
function checks whether the current time is between 9:30am and 11:30am PT. If yes, the next 1000 rows are processed; if no, the notebook sleeps for 15 minutes before checking the time validity again.Once the full set of MPI data has been processed, the seed file and the file keeping track of the last row processed for the seed file are moved to the archive directory in
patient-data
.