Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research why botocore.errorfactory:NoSuchKey errors continue to be thrown and what can be done to prevent them #1392

Open
3 tasks
ccostino opened this issue Nov 5, 2024 · 0 comments

Comments

@ccostino
Copy link
Contributor

ccostino commented Nov 5, 2024

After making some adjustments to our Celery workers and bumping up the number of worker processes and the RAM available to them, we're seeing a new series of errors continue to spike: botocore.errorfactory:NoSuchKey (An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.)

For some reason, there are still situations where a job doesn't make it into S3 and we need to figure out why this is happening.

There are a few variations of this stack trace depending on what Celery is doing (see New Relic for details), but they all seem to point to the same thing in terms of where the breakdown is happening:

File /home/vcap/app/app/celery/tasks.py, line 471, in process_incomplete_jobs
File /home/vcap/app/app/celery/tasks.py, line 488, in process_incomplete_job
File /home/vcap/app/app/celery/tasks.py, line 104, in get_recipient_csv_and_template_and_sender_id
File /home/vcap/app/app/aws/s3.py, line 307, in get_job_and_metadata_from_s3

Given this, there are a couple of things to look at here and attempt to remediate:

  • Are we not handling this exception properly? Should this be thrown in this fashion or is there a better way of dealing with this scenario?
  • Why are we seeing this exception in the first place? What's happening that's causing this to occur?

Implementation Sketch and Acceptance Criteria

  • Trace the code path as noted in the bits of the stack traces above to see why these errors are occurring and if we're not handling them appropriately.
  • Continue tracing the code path to see what condition(s) even cause this in the first place: was a CSV not uploaded properly? Did the job not get created properly? Was there an error writing to S3?
    • This will require digging through New Relic and the logs to investigate further and catch any kind of patterns.
    • We know this happened on Friday, 11/1, and Monday, 11/4, so there's a definitive time window to look into.
  • Write up any findings/recommendations, and if there's any work to be done note that as well (which could be related to other open tickets)

Security Considerations

  • We want to make sure our application is running properly and stable.
@ccostino ccostino moved this from 🌱 New to ⬇ Up-Next in Notify.gov product board Nov 5, 2024
@ccostino ccostino changed the title Research why botocore.errorfactory:NoSuchKey errors continue to be thrown and what can be done to prevent it Research why botocore.errorfactory:NoSuchKey errors continue to be thrown and what can be done to prevent them Nov 5, 2024
@ccostino ccostino added the bug label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Up-Next
Development

No branches or pull requests

1 participant