Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "Pending Forever" forever #1453

Open
terrazoon opened this issue Dec 3, 2024 · 3 comments
Open

Fix "Pending Forever" forever #1453

terrazoon opened this issue Dec 3, 2024 · 3 comments

Comments

@terrazoon
Copy link
Contributor

terrazoon commented Dec 3, 2024

When running jobs for the US Census, we observed that some notifications got stuck in a Pending state after the 4 hour mark when all statuses should be resolved.

We are going to refactor our delivery receipts checking in the near future and allow up to 72 hours for the status to resolve, but we need a "final" way to clear Pending regardless of what happens.

I suggest that we make a task that runs hourly and pushes any notifications more than 72 hours old from Pending to Failed.

@terrazoon terrazoon converted this from a draft issue Dec 3, 2024
@ccostino
Copy link
Contributor

ccostino commented Dec 6, 2024

Copying here for more context, @terrazoon!

Right now with delivery receipts we are getting a message id for each notification, and then we are going and filtering log events for that message id. Which means if we have 28000 notifications we are making 28000 calls, sometimes multiple times.

Maybe it would make more sense when a job is running periodically FilterLogEvents by job_id and "phoneCarrier" and get back 5000, 10000, 20000 statuses at one time and update that way. This might avoid the throttling which is definitely happening, and it might (?) be faster.

There is probably some way to filter like "give me all the new ones since the last time I asked"

@terrazoon
Copy link
Contributor Author

Further load testing indicates that "Pending Forever" may not be happening anymore (at least when we send to the simulated AWS numbers). However, these load tests are supposed to show 5000 delivered and 5000 failed, but they typically show something like 4996 delivered and 5004 failed.

So some messages are failing due to something going wrong in our app, and that is most likely:

  1. The initial problem is throttling from AWS on sns publish. Even though we are now rate limiting ourselves, we still get throttled once in a while for a message or two.

  2. Once we have been throttled, we are in a state where the message has been saved to the db but not sent. This triggers the "IntegrityError" on retry. Cliff is working on a fix for this which hopefully resolves the issue. Basically, if we have already saved the notification to the database we don't want to try to save it again. Instead we just want to send it again.

@ccostino
Copy link
Contributor

Given point number 2, the recent PR merges for #1466 may help resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Up-Next
Development

No branches or pull requests

2 participants