Skip to content

Remove unfinished usage job entries of the host on start #10848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: 4.19
Choose a base branch
from

Conversation

sureshanaparti
Copy link
Contributor

@sureshanaparti sureshanaparti commented May 12, 2025

Description

This PR removes unfinished usage job entries of the host.

The stale / unfinished usage job entries are not allowing to pick the latest job for processing, and prevents usage from running, to fix this, cleanup the unfinished usage jobs during start.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Tested with some unfinished job entires in the cloud_usage.usage_job table.

2025-05-12 09:10:25,875 INFO  [cloud.usage.UsageManagerImpl] (main:null) (logid:) Starting Usage Manager
2025-05-12 09:10:25,985 INFO  [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Found 13 opens job, to remove
2025-05-12 09:10:25,986 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17099, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:20:58 UTC 2020
2025-05-12 09:10:25,998 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17100, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:25:58 UTC 2020
2025-05-12 09:10:26,005 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17101, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:30:58 UTC 2020
2025-05-12 09:10:26,014 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17102, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:34:58 UTC 2020
2025-05-12 09:10:26,022 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17103, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:39:58 UTC 2020
2025-05-12 09:10:26,028 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17104, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:44:58 UTC 2020
2025-05-12 09:10:26,034 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17105, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 23:19:58 UTC 2020
2025-05-12 09:10:26,040 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 19990, pid: 814, job type: 0, scheduled: 0, heartbeat: Thu Oct 21 22:17:16 UTC 2021
2025-05-12 09:10:26,045 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 19991, pid: 814, job type: 0, scheduled: 0, heartbeat: Thu Oct 21 22:17:19 UTC 2021
2025-05-12 09:10:26,051 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 19992, pid: 814, job type: 0, scheduled: 0, heartbeat: Thu Oct 21 22:17:19 UTC 2021
2025-05-12 09:10:26,057 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 19993, pid: 814, job type: 0, scheduled: 0, heartbeat: Thu Oct 21 22:17:19 UTC 2021
2025-05-12 09:10:26,064 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 19994, pid: 814, job type: 0, scheduled: 0, heartbeat: Fri Oct 22 22:16:16 UTC 2021
2025-05-12 09:10:26,070 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 26119, pid: 1855871, job type: 0, scheduled: 0, heartbeat: Mon May 12 09:00:08 UTC 2025

How did you try to break this feature and the system with this change?

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

Copy link

codecov bot commented May 12, 2025

Codecov Report

Attention: Patch coverage is 0% with 31 lines in your changes missing coverage. Please review.

Project coverage is 15.17%. Comparing base (f0838cd) to head (b5b2f11).
Report is 18 commits behind head on 4.19.

Files with missing lines Patch % Lines
...main/java/com/cloud/usage/dao/UsageJobDaoImpl.java 0.00% 26 Missing ⚠️
...rc/main/java/com/cloud/usage/UsageManagerImpl.java 0.00% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19   #10848      +/-   ##
============================================
- Coverage     15.17%   15.17%   -0.01%     
- Complexity    11339    11342       +3     
============================================
  Files          5414     5415       +1     
  Lines        475185   475346     +161     
  Branches      57991    58006      +15     
============================================
+ Hits          72105    72126      +21     
- Misses       395018   395155     +137     
- Partials       8062     8065       +3     
Flag Coverage Δ
uitests 4.28% <ø> (ø)
unittests 15.89% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@sureshanaparti sureshanaparti changed the title Remove unfinished usage job entries for the host Remove unfinished usage job entries of the host May 12, 2025
@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@sureshanaparti sureshanaparti force-pushed the remove_unfinished_usage_job_entries branch from 0aff3df to b5b2f11 Compare May 12, 2025 10:11
@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13340

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@winterhazel winterhazel self-requested a review May 12, 2025 11:55
@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13342

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-13278)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 51263 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10848-t13278-kvm-ol8.zip
Smoke tests completed. 129 look OK, 4 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_create_template Error 13.79 test_templates.py
test_CreateTemplateWithDuplicateName Error 25.66 test_templates.py
test_02_create_template_with_checksum_sha1 Error 66.08 test_templates.py
test_03_create_template_with_checksum_sha256 Error 65.91 test_templates.py
test_05_create_template_with_no_checksum Error 121.50 test_templates.py
test_03_delete_template Error 1.13 test_templates.py
test_04_extract_template Error 1.13 test_templates.py
test_01_template_usage Error 13.79 test_usage.py
test_01_volume_usage Error 191.54 test_usage.py
test_04_deploy_vnf_appliance Error 312.40 test_vnf_templates.py
ContextSuite context=TestRVPCSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVPCSite2SiteVPNMultipleOptions>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcRemoteAccessVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py

@DaanHoogland
Copy link
Contributor

errors seem unrelated but running again to verify

@blueorangutan
Copy link

[SF] Trillian test result (tid-13283)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 52445 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10848-t13283-kvm-ol8.zip
Smoke tests completed. 133 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland DaanHoogland changed the title Remove unfinished usage job entries of the host Remove unfinished usage job entries of the host on start May 14, 2025
@DaanHoogland DaanHoogland requested a review from Copilot May 14, 2025 05:59
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the usage processing by cleaning up unfinished usage job entries on startup and shutdown. Key changes include:

  • Invoking removal of unfinished jobs at startup and in a shutdown hook.
  • Refining logging messages and variable naming for clarity.
  • Adding a new DAO method in the UsageJobDao interface and its implementation to support the cleanup.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
usage/src/main/java/com/cloud/usage/UsageManagerImpl.java Added removal of open jobs on startup and refined shutdown behavior.
engine/schema/src/main/java/com/cloud/usage/dao/UsageJobDaoImpl.java Introduced helper methods for retrieving and removing open jobs and improved constant usage.
engine/schema/src/main/java/com/cloud/usage/dao/UsageJobDao.java Added the interface definition for the new removal method.
Comments suppressed due to low confidence (1)

usage/src/main/java/com/cloud/usage/UsageManagerImpl.java:322

  • Confirm that using 0 for the pid is the intended behavior during startup (to remove all unfinished jobs for the host), as the shutdown hook removes jobs using _pid.
_usageJobDao.removeLastOpenJobsOwned(_hostname, 0);

@DaanHoogland DaanHoogland modified the milestones: 4.19.3, 4.19.4 May 14, 2025
@winterhazel
Copy link
Member

@DaanHoogland @sureshanaparti I did not see #8896 previously, so I'm commenting for both this PR and #8896 because they are related to the same issue.

I think this situation was originally supposed to be covered by the take-over mechanism in the Usage heartbeat thread. It would be good to check first if there is an issue with this mechanism.

Briefly looking at the code, it seems that removing unfinished job entries may (in rare situations) lead into two Usage servers thinking that they own the current job (there will be two "unfinished" job entries), or a job entry that already finished being updated.

By the way, can you guys provide some more detailed information about how the issue is being reproduced? E.g.: start the Usage server (job is scheduled), change the global setting so that the Usage job executes in the next minute, kill -9 the Usage server and start it again. It would be good to have some idea on how much time passed between Usage was killed, started and tried to run the job.

@DaanHoogland
Copy link
Contributor

@winterhazel , the scenario was on a client production system and we guess it has been a power-off, not allowing for a graceful shutdown. The heartbeat threat has not been changed since 2015 and before the shutdown hook was implemented even a single open job would not be replaced. The strange thing that we noticed was multiple open jobs for different hosts and PIDs being pending. This is what @sureshanaparti is trying to to mitigate. We, unfortunately, have no reproduction scenario (save just faking records in the DB before starting).

@rohityadavcloud rohityadavcloud modified the milestones: 4.19.4, 4.20.2 May 16, 2025
Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM didn't test

@DaanHoogland
Copy link
Contributor

I think this situation was originally supposed to be covered by the take-over mechanism in the Usage heartbeat thread. It would be good to check first if there is an issue with this mechanism.

Can you refer to a PR that implements this @winterhazel ?

Briefly looking at the code, it seems that removing unfinished job entries may (in rare situations) lead into two Usage servers thinking that they own the current job (there will be two "unfinished" job entries), or a job entry that already finished being updated.

@sureshanaparti , I think @winterhazel has a point here. Can you comment on this scenario?

@winterhazel
Copy link
Member

Can you refer to a PR that implements this @winterhazel ?

@DaanHoogland unfortunately no. This mechanism was introduced alongside Usage in commit ee81905, and there is no documentation about it except the code in the heartbeat thread.

Also, just notifying that I want to do some tests regarding this mechanism and the possible situations that I mentioned previously to see whether there is a better approach before approving this one.

@DaanHoogland
Copy link
Contributor

Also, just notifying that I want to do some tests regarding this mechanism and the possible situations that I mentioned previously to see whether there is a better approach before approving this one.

Can you update, @winterhazel ? I think as a patch this one is ok, given that improvements and research into why the original implementation didn’t work, can be done later.

@winterhazel
Copy link
Member

Can you update, @winterhazel ?

@DaanHoogland I had a look into the code, and found that one situation in which it happens is when the last open job is a "single" job (a job that was created via the generateUsageRecords API) that was assigned by the API either to an Usage server process that is not running anymore, or to none. The take over mechanism does not consider these jobs, so they remain open indefinitely, blocking Usage from executing.

One way to reproduce:

  1. Stop Usage and ensure that there are no open jobs left
  2. Call generateUsageRecords
  3. Start Usage

Doing this will result in the Usage job never executing.

I think that the ideal approach here would be adapting the take over mechanism to consider "single" jobs as well, but I am also ok with this patch for the moment. It does allow Usage to execute when the blocking job was assigned to a dead process from the same host, which I think will consist of most production scenarios. However, if it was assigned to a host that does not run Usage anymore, or not assigned at all, Usage still won't run. Thus, I will implement my suggestion to also cover these cases when I get some time.

The strange thing that we noticed was multiple open jobs for different hosts and PIDs being pending

Do you still have access to these entries to confirm whether it was the same situation that I found? Was there a "single" open job (job_type=1 and end_millis=0)?

@sureshanaparti , I think @winterhazel has a point here. Can you comment on this scenario?

On a note, regarding these concurrency problems that I mentioned, I checked that they can already happen on the current version. This patch will not make them worse. The way Usage handles the jobs should be addressed instead.

@DaanHoogland
Copy link
Contributor

tnx @winterhazel , we still need testing so we’ll merge after the releases.

Copy link
Contributor

@harikrishna-patnala harikrishna-patnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants