refactor check_job_status #1507

terrazoon · 2024-12-30T16:06:05Z

Originally check_job_status declared that a job that started 30 minutes ago and has not finished is 'incomplete' and kicked off an attempt to finish this incomplete job. For long running jobs, this means a new task would start to run, trying to insert rows into the database that were already there and triggering an avalanche of IntegrityErrors that would last for hours and bog down the whole system.

As an expedient, we changed the timing to allow 4 hours for the job to finish before it is declared incomplete. This is because we're currently only supporting sends of 25k rows and we know we can send those out in less than 2 hours.

But ideally, we should be using job status and not time ranges. What would cause a job to just stop running? Is it possible? Is it for a random reboot in the middle of a task? If it has to be time based, what should the time be. Someone needs to take a look and rethink this.

ccostino · 2024-12-30T16:13:56Z

That's a great point about if the job itself is somehow interrupted - a server reboot, power outage, etc. Perhaps we do need to keep some semblance of whether or not a job itself is successful in that case so we know where to pick it back up from.

Another factor to consider though is tracking strictly at the message send level and the retries configured with those should they fail or get blocked for whatever reason. With these two status checks in play, we need to make sure they're working in tandem with each other and not clobbering or competing with one another, which would also overwhelm the system.

terrazoon · 2025-01-02T19:23:09Z

We need to keep a time-based check to handle interrupted jobs.
We are currently rate-limiting 800 numbers to roughly 10,000 messages per hour and we currently have an informal limit of 25,000 messages per job, so 4 hours is a reasonable time check.
We have resolved all the retry logic so it is well-behaved now. There were a number of PRs for this, including the one that changed the time check from 30 minutes to 4 hours.

I think the current code is fine, provided that we are continuing to run with 800 numbers, which AWS is going to throttle at roughly 10,000 messages per hour anyway.

I think there is nothing to do here, until we introduce short codes or whatever the other thingy is that users theoretically can send with. The current code is correct for 800 numbers.

ccostino · 2025-01-07T21:47:23Z

Okay, sounds good, thanks @terrazoon! If our testing with the other current fixes and improvements shows that we're still in good shape, if not even better shape, then we can close this one out for now.

terrazoon added this to Notify.gov product board Dec 30, 2024

terrazoon converted this from a draft issue Dec 30, 2024

terrazoon mentioned this issue Dec 27, 2024

Message send performance and stability (issues from Census send) GSA/notifications-admin#2181

Open

10 tasks

ccostino mentioned this issue Dec 30, 2024

Message send processing improvements #1508

Open

7 tasks

terrazoon self-assigned this Jan 2, 2025

terrazoon moved this from ⬇ Up-Next to 🏗 In progress (WIP: ≤ 3 per person) in Notify.gov product board Jan 2, 2025

terrazoon moved this from 🏗 In progress (WIP: ≤ 3 per person) to 👀 In review in Notify.gov product board Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor check_job_status #1507

refactor check_job_status #1507

terrazoon commented Dec 30, 2024 •

edited

Loading

ccostino commented Dec 30, 2024

terrazoon commented Jan 2, 2025

ccostino commented Jan 7, 2025

refactor check_job_status #1507

refactor check_job_status #1507

Comments

terrazoon commented Dec 30, 2024 • edited Loading

ccostino commented Dec 30, 2024

terrazoon commented Jan 2, 2025

ccostino commented Jan 7, 2025

terrazoon commented Dec 30, 2024 •

edited

Loading