Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Exception Handling #3080

Merged
merged 33 commits into from
Aug 14, 2024
Merged

Parse Exception Handling #3080

merged 33 commits into from
Aug 14, 2024

Conversation

elipe17
Copy link

@elipe17 elipe17 commented Jul 16, 2024

Summary of Changes

  • Updated parse.py to more gracefully handle all types of exceptions
  • Added last ditch exception handling that will set the state of DFS to rejected to avoid the "perpetual pending" issue
  • Also cleaned up old PG deps in the dockerfile
    Pull request closes Service timeout blocks parsing completion #3055

How to Test

cd tdrs-frontend && docker-compose up --build
cd tdrs-backend && docker-compose up --build
  1. Open http://localhost:3000/ and sign in.
  2. Submit a datafile and make sure it works as expected.
  3. Now stop the elastic container and submit the file again. If you want to test this in a deployed environment, you will need to unbind the backend from the ES service instance: cf unbind-service <BACKEND_APP_NAME> <ES_SERVICE_NAME>
  4. Verify in the LogEntries in DAC that you see explicit logging surrounding the error and that it was in fact caught
  5. Verify the DFS status is rejected

Deliverables

More details on how deliverables herein are assessed included here.

Deliverable 1: Accepted Features

Checklist of ACs:

  • Backend handles parser exceptions more gracefully

Deliverable 2: Tested Code

  • Are all areas of code introduced in this PR meaningfully tested?
    • If this PR introduces backend code changes, are they meaningfully tested?
    • If this PR introduces frontend code changes, are they meaningfully tested?
  • Are code coverage minimums met?
    • Frontend coverage: [insert coverage %] (see CodeCov Report comment in PR)
    • Backend coverage: [insert coverage %] (see CodeCov Report comment in PR)

Deliverable 3: Properly Styled Code

  • Are backend code style checks passing on CircleCI?
  • Are frontend code style checks passing on CircleCI?
  • Are code maintainability principles being followed?

Deliverable 4: Accessible

  • Does this PR complete the epic?
  • Are links included to any other gov-approved PRs associated with epic?
  • Does PR include documentation for Raft's a11y review?
  • Did automated and manual testing with iamjolly and ttran-hub using Accessibility Insights reveal any errors introduced in this PR?

Deliverable 5: Deployed

  • Was the code successfully deployed via automated CircleCI process to development on Cloud.gov?

Deliverable 6: Documented

  • Does this PR provide background for why coding decisions were made?
  • If this PR introduces backend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces frontend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces dependencies, are their licenses documented?
  • Can reviewer explain and take ownership of these elements presented in this code review?

Deliverable 7: Secure

  • Does the OWASP Scan pass on CircleCI?
  • Do manual code review and manual testing detect any new security issues?
  • If new issues detected, is investigation and/or remediation plan documented?

Deliverable 8: User Research

Research product(s) clearly articulate(s):

  • the purpose of the research
  • methods used to conduct the research
  • who participated in the research
  • what was tested and how
  • impact of research on TDP
  • (if applicable) final design mockups produced for TDP development

@elipe17 elipe17 self-assigned this Jul 16, 2024
@elipe17 elipe17 added bug backend dev raft review This issue is ready for raft review labels Jul 16, 2024
Copy link

codecov bot commented Jul 16, 2024

Codecov Report

Attention: Patch coverage is 34.17722% with 52 lines in your changes missing coverage. Please review.

Project coverage is 92.71%. Comparing base (f3f0fa6) to head (f416519).
Report is 2 commits behind head on develop.

Files Patch % Lines
tdrs-backend/tdpservice/scheduling/parser_task.py 11.53% 23 Missing ⚠️
tdrs-backend/tdpservice/parsers/parse.py 40.74% 16 Missing ⚠️
tdrs-backend/tdpservice/parsers/models.py 27.77% 13 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3080      +/-   ##
===========================================
- Coverage    93.02%   92.71%   -0.31%     
===========================================
  Files          277      277              
  Lines         7426     7486      +60     
  Branches       661      672      +11     
===========================================
+ Hits          6908     6941      +33     
- Misses         415      443      +28     
+ Partials       103      102       -1     
Flag Coverage Δ
dev-backend 92.73% <34.17%> (-0.36%) ⬇️
dev-frontend 92.60% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
tdrs-backend/tdpservice/email/email.py 100.00% <100.00%> (+10.25%) ⬆️
tdrs-backend/tdpservice/email/helpers/data_file.py 100.00% <100.00%> (ø)
tdrs-backend/tdpservice/parsers/util.py 93.79% <100.00%> (+0.22%) ⬆️
tdrs-backend/tdpservice/parsers/models.py 78.57% <27.77%> (-13.86%) ⬇️
tdrs-backend/tdpservice/parsers/parse.py 83.73% <40.74%> (-2.08%) ⬇️
tdrs-backend/tdpservice/scheduling/parser_task.py 39.47% <11.53%> (-10.53%) ⬇️

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b0dec8...f416519. Read the comment docs.

- updated messaging language a bit
Copy link

@reitermb reitermb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error language looks good

Base automatically changed from 3004-clean-and-reparse-cmd to develop July 26, 2024 11:49
@elipe17 elipe17 added the Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI label Jul 31, 2024
@ADPennington ADPennington added Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI and removed Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Aug 1, 2024
@ADPennington ADPennington added Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI and removed Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Aug 7, 2024
@ADPennington
Copy link
Collaborator

update: still working on reproducing an exception and seeing a non-pending status. latest behavior included below:

Screenshot 2024-08-09 152356

@ADPennington ADPennington added Blocked Label for Pull Requests that are currently blocked by a dependency and removed Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Aug 9, 2024
@ADPennington ADPennington added Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI and removed Blocked Label for Pull Requests that are currently blocked by a dependency labels Aug 12, 2024
except Exception as e:
logger.error(f"Encountered error while creating datafile records: {e}")
log_parser_exception(datafile,
f"Encountered generic exception while creating database records: \n{e}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to trigger this exception, but it doesn't generate an exception-related email notification.

@@ -20,28 +22,51 @@ def parse(data_file_id, should_send_submission_email=True):
# passing the data file FileField across redis was rendering non-serializable failures, doing the below lookup
# to avoid those. I suppose good practice to not store/serializer large file contents in memory when stored in redis
# for undetermined amount of time.
data_file = DataFile.objects.get(id=data_file_id)
try:
Copy link
Collaborator

@ADPennington ADPennington Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to confirm the logic here. Are the following steps accurate?

  1. file parsing starts and status is set to Pending during this process.
  2. parsing processing completes and the status of parsing is retrieved from summary object
  3. parsing metadata from the summary object retrieved
  4. data submission email sent to approved users associated with the file/STT

if an exception is encountered along the way (i assume during parsing?):

  • if its a database exception, the exception is logged (in LogEntries?) and the task exits
  • if some other exception, a generic exception-related error message is added to the feedback report, parsing status is set to Rejected, the exception is logged (in LogEntries?), and the task exits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, would it make more sense to have both exception blocks perform the same handling? I.e LogEntry creation, reject file, and generate an error? Just for clarity, we generate the error to ensure the STT has the required info to notify an admin of the issue since we don't have a monitoring solution or email based solution in place yet.

@ADPennington ADPennington removed the Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI label Aug 13, 2024
@ADPennington ADPennington added the Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI label Aug 13, 2024
Copy link
Collaborator

@ADPennington ADPennington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elipe17 this is approved 🚀 Below are some notes from our asyncs:

  • We were able to trigger a generic uncaught exception with a tribal closed file, and another bug ticket is needed to address it. In the meantime, we confirmed that an actionable error message is included in the error report.
  • "Stress-testing" the system by submitting large files with lots of errors detected did trigger the exceptions related to ConnectionTimeout and TransportError errors which were added to the logentries as expected, but the files continue to be stuck in a Pending state, and as shown below, appears to also be associated with the worker terminating prematurely. Spike - Investigate celery worker terminating prematurely  #3144 will track investigations into the worker terminating, and in the meantime, the logentries can help with tracking the incidence of these exceptions and any associations with the worker.
    image
  • I have not yet been able to introduce an uncaught exception-related error into the feedback report for STTs. We'll monitor this in prod.

@ADPennington ADPennington added Ready to Merge and removed QASP Review Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Aug 13, 2024
@elipe17 elipe17 merged commit 2e954b4 into develop Aug 14, 2024
14 checks passed
@elipe17 elipe17 deleted the 3055-exception-handling branch August 14, 2024 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Service timeout blocks parsing completion
6 participants