-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Remove unfinished usage job entries of the host on start #10848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 4.19
Are you sure you want to change the base?
Remove unfinished usage job entries of the host on start #10848
Conversation
@blueorangutan package |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.19 #10848 +/- ##
============================================
- Coverage 15.17% 15.17% -0.01%
- Complexity 11339 11342 +3
============================================
Files 5414 5415 +1
Lines 475185 475346 +161
Branches 57991 58006 +15
============================================
+ Hits 72105 72126 +21
- Misses 395018 395155 +137
- Partials 8062 8065 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@blueorangutan package |
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
0aff3df
to
b5b2f11
Compare
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13340 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm
@blueorangutan package |
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13342 |
@blueorangutan test |
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
[SF] Trillian test result (tid-13278)
|
errors seem unrelated but running again to verify |
[SF] Trillian test result (tid-13283)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the usage processing by cleaning up unfinished usage job entries on startup and shutdown. Key changes include:
- Invoking removal of unfinished jobs at startup and in a shutdown hook.
- Refining logging messages and variable naming for clarity.
- Adding a new DAO method in the UsageJobDao interface and its implementation to support the cleanup.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
usage/src/main/java/com/cloud/usage/UsageManagerImpl.java | Added removal of open jobs on startup and refined shutdown behavior. |
engine/schema/src/main/java/com/cloud/usage/dao/UsageJobDaoImpl.java | Introduced helper methods for retrieving and removing open jobs and improved constant usage. |
engine/schema/src/main/java/com/cloud/usage/dao/UsageJobDao.java | Added the interface definition for the new removal method. |
Comments suppressed due to low confidence (1)
usage/src/main/java/com/cloud/usage/UsageManagerImpl.java:322
- Confirm that using 0 for the pid is the intended behavior during startup (to remove all unfinished jobs for the host), as the shutdown hook removes jobs using _pid.
_usageJobDao.removeLastOpenJobsOwned(_hostname, 0);
@DaanHoogland @sureshanaparti I did not see #8896 previously, so I'm commenting for both this PR and #8896 because they are related to the same issue. I think this situation was originally supposed to be covered by the take-over mechanism in the Usage heartbeat thread. It would be good to check first if there is an issue with this mechanism. Briefly looking at the code, it seems that removing unfinished job entries may (in rare situations) lead into two Usage servers thinking that they own the current job (there will be two "unfinished" job entries), or a job entry that already finished being updated. By the way, can you guys provide some more detailed information about how the issue is being reproduced? E.g.: start the Usage server (job is scheduled), change the global setting so that the Usage job executes in the next minute, kill -9 the Usage server and start it again. It would be good to have some idea on how much time passed between Usage was killed, started and tried to run the job. |
@winterhazel , the scenario was on a client production system and we guess it has been a power-off, not allowing for a graceful shutdown. The heartbeat threat has not been changed since 2015 and before the shutdown hook was implemented even a single open job would not be replaced. The strange thing that we noticed was multiple open jobs for different hosts and PIDs being pending. This is what @sureshanaparti is trying to to mitigate. We, unfortunately, have no reproduction scenario (save just faking records in the DB before starting). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM didn't test
Can you refer to a PR that implements this @winterhazel ?
@sureshanaparti , I think @winterhazel has a point here. Can you comment on this scenario? |
@DaanHoogland unfortunately no. This mechanism was introduced alongside Usage in commit ee81905, and there is no documentation about it except the code in the heartbeat thread. Also, just notifying that I want to do some tests regarding this mechanism and the possible situations that I mentioned previously to see whether there is a better approach before approving this one. |
Can you update, @winterhazel ? I think as a patch this one is ok, given that improvements and research into why the original implementation didn’t work, can be done later. |
@DaanHoogland I had a look into the code, and found that one situation in which it happens is when the last open job is a "single" job (a job that was created via the One way to reproduce:
Doing this will result in the Usage job never executing. I think that the ideal approach here would be adapting the take over mechanism to consider "single" jobs as well, but I am also ok with this patch for the moment. It does allow Usage to execute when the blocking job was assigned to a dead process from the same host, which I think will consist of most production scenarios. However, if it was assigned to a host that does not run Usage anymore, or not assigned at all, Usage still won't run. Thus, I will implement my suggestion to also cover these cases when I get some time.
Do you still have access to these entries to confirm whether it was the same situation that I found? Was there a "single" open job (
On a note, regarding these concurrency problems that I mentioned, I checked that they can already happen on the current version. This patch will not make them worse. The way Usage handles the jobs should be addressed instead. |
tnx @winterhazel , we still need testing so we’ll merge after the releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM
Description
This PR removes unfinished usage job entries of the host.
The stale / unfinished usage job entries are not allowing to pick the latest job for processing, and prevents usage from running, to fix this, cleanup the unfinished usage jobs during start.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Tested with some unfinished job entires in the cloud_usage.usage_job table.
How did you try to break this feature and the system with this change?