Remove unfinished usage job entries of the host on start #10848

sureshanaparti · 2025-05-12T09:27:57Z

Description

This PR removes unfinished usage job entries of the host.

The stale / unfinished usage job entries are not allowing to pick the latest job for processing, and prevents usage from running, to fix this, cleanup the unfinished usage jobs during start.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI
test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Tested with some unfinished job entires in the cloud_usage.usage_job table.

2025-05-12 09:10:25,875 INFO  [cloud.usage.UsageManagerImpl] (main:null) (logid:) Starting Usage Manager
2025-05-12 09:10:25,985 INFO  [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Found 13 opens job, to remove
2025-05-12 09:10:25,986 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17099, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:20:58 UTC 2020
2025-05-12 09:10:25,998 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17100, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:25:58 UTC 2020
2025-05-12 09:10:26,005 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17101, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:30:58 UTC 2020
2025-05-12 09:10:26,014 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17102, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:34:58 UTC 2020
2025-05-12 09:10:26,022 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17103, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:39:58 UTC 2020
2025-05-12 09:10:26,028 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17104, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 10:44:58 UTC 2020
2025-05-12 09:10:26,034 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 17105, pid: 683, job type: 0, scheduled: 0, heartbeat: Fri Nov 27 23:19:58 UTC 2020
2025-05-12 09:10:26,040 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 19990, pid: 814, job type: 0, scheduled: 0, heartbeat: Thu Oct 21 22:17:16 UTC 2021
2025-05-12 09:10:26,045 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 19991, pid: 814, job type: 0, scheduled: 0, heartbeat: Thu Oct 21 22:17:19 UTC 2021
2025-05-12 09:10:26,051 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 19992, pid: 814, job type: 0, scheduled: 0, heartbeat: Thu Oct 21 22:17:19 UTC 2021
2025-05-12 09:10:26,057 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 19993, pid: 814, job type: 0, scheduled: 0, heartbeat: Thu Oct 21 22:17:19 UTC 2021
2025-05-12 09:10:26,064 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 19994, pid: 814, job type: 0, scheduled: 0, heartbeat: Fri Oct 22 22:16:16 UTC 2021
2025-05-12 09:10:26,070 DEBUG [usage.dao.UsageJobDaoImpl] (main:null) (logid:) Removing job - id: 26119, pid: 1855871, job type: 0, scheduled: 0, heartbeat: Mon May 12 09:00:08 UTC 2025

How did you try to break this feature and the system with this change?

sureshanaparti · 2025-05-12T09:32:38Z

@blueorangutan package

codecov · 2025-05-12T09:36:20Z

Codecov Report

Attention: Patch coverage is 0% with 31 lines in your changes missing coverage. Please review.

Project coverage is 15.17%. Comparing base (f0838cd) to head (b5b2f11).
Report is 18 commits behind head on 4.19.

Files with missing lines	Patch %	Lines
...main/java/com/cloud/usage/dao/UsageJobDaoImpl.java	0.00%	26 Missing ⚠️
...rc/main/java/com/cloud/usage/UsageManagerImpl.java	0.00%	5 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               4.19   #10848      +/-   ##
============================================
- Coverage     15.17%   15.17%   -0.01%     
- Complexity    11339    11342       +3     
============================================
  Files          5414     5415       +1     
  Lines        475185   475346     +161     
  Branches      57991    58006      +15     
============================================
+ Hits          72105    72126      +21     
- Misses       395018   395155     +137     
- Partials       8062     8065       +3

Flag	Coverage Δ
uitests	`4.28% <ø> (ø)`
unittests	`15.89% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sureshanaparti · 2025-05-12T09:38:54Z

@blueorangutan package

blueorangutan · 2025-05-12T09:40:04Z

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

usage/src/main/java/com/cloud/usage/UsageManagerImpl.java

blueorangutan · 2025-05-12T10:51:57Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13340

DaanHoogland

clgtm

DaanHoogland · 2025-05-12T10:56:07Z

@blueorangutan package

blueorangutan · 2025-05-12T10:58:03Z

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2025-05-12T11:59:31Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13342

DaanHoogland · 2025-05-12T13:28:47Z

@blueorangutan test

blueorangutan · 2025-05-12T13:30:03Z

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan · 2025-05-13T04:35:24Z

[SF] Trillian test result (tid-13278)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 51263 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10848-t13278-kvm-ol8.zip
Smoke tests completed. 129 look OK, 4 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_create_template	`Error`	13.79	test_templates.py
test_CreateTemplateWithDuplicateName	`Error`	25.66	test_templates.py
test_02_create_template_with_checksum_sha1	`Error`	66.08	test_templates.py
test_03_create_template_with_checksum_sha256	`Error`	65.91	test_templates.py
test_05_create_template_with_no_checksum	`Error`	121.50	test_templates.py
test_03_delete_template	`Error`	1.13	test_templates.py
test_04_extract_template	`Error`	1.13	test_templates.py
test_01_template_usage	`Error`	13.79	test_usage.py
test_01_volume_usage	`Error`	191.54	test_usage.py
test_04_deploy_vnf_appliance	`Error`	312.40	test_vnf_templates.py
ContextSuite context=TestRVPCSite2SiteVpn>:setup	`Error`	0.00	test_vpc_vpn.py
ContextSuite context=TestVPCSite2SiteVPNMultipleOptions>:setup	`Error`	0.00	test_vpc_vpn.py
ContextSuite context=TestVpcRemoteAccessVpn>:setup	`Error`	0.00	test_vpc_vpn.py
ContextSuite context=TestVpcSite2SiteVpn>:setup	`Error`	0.00	test_vpc_vpn.py

DaanHoogland · 2025-05-13T06:34:03Z

errors seem unrelated but running again to verify

blueorangutan · 2025-05-13T21:55:06Z

[SF] Trillian test result (tid-13283)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 52445 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10848-t13283-kvm-ol8.zip
Smoke tests completed. 133 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File

Copilot

Pull Request Overview

This PR enhances the usage processing by cleaning up unfinished usage job entries on startup and shutdown. Key changes include:

Invoking removal of unfinished jobs at startup and in a shutdown hook.
Refining logging messages and variable naming for clarity.
Adding a new DAO method in the UsageJobDao interface and its implementation to support the cleanup.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
usage/src/main/java/com/cloud/usage/UsageManagerImpl.java	Added removal of open jobs on startup and refined shutdown behavior.
engine/schema/src/main/java/com/cloud/usage/dao/UsageJobDaoImpl.java	Introduced helper methods for retrieving and removing open jobs and improved constant usage.
engine/schema/src/main/java/com/cloud/usage/dao/UsageJobDao.java	Added the interface definition for the new removal method.

Comments suppressed due to low confidence (1)

usage/src/main/java/com/cloud/usage/UsageManagerImpl.java:322

Confirm that using 0 for the pid is the intended behavior during startup (to remove all unfinished jobs for the host), as the shutdown hook removes jobs using _pid.

_usageJobDao.removeLastOpenJobsOwned(_hostname, 0);

winterhazel · 2025-05-15T02:28:29Z

@DaanHoogland @sureshanaparti I did not see #8896 previously, so I'm commenting for both this PR and #8896 because they are related to the same issue.

I think this situation was originally supposed to be covered by the take-over mechanism in the Usage heartbeat thread. It would be good to check first if there is an issue with this mechanism.

Briefly looking at the code, it seems that removing unfinished job entries may (in rare situations) lead into two Usage servers thinking that they own the current job (there will be two "unfinished" job entries), or a job entry that already finished being updated.

By the way, can you guys provide some more detailed information about how the issue is being reproduced? E.g.: start the Usage server (job is scheduled), change the global setting so that the Usage job executes in the next minute, kill -9 the Usage server and start it again. It would be good to have some idea on how much time passed between Usage was killed, started and tried to run the job.

DaanHoogland · 2025-05-15T06:49:27Z

@winterhazel , the scenario was on a client production system and we guess it has been a power-off, not allowing for a graceful shutdown. The heartbeat threat has not been changed since 2015 and before the shutdown hook was implemented even a single open job would not be replaced. The strange thing that we noticed was multiple open jobs for different hosts and PIDs being pending. This is what @sureshanaparti is trying to to mitigate. We, unfortunately, have no reproduction scenario (save just faking records in the DB before starting).

rohityadavcloud

LGTM didn't test

DaanHoogland · 2025-05-27T06:25:06Z

I think this situation was originally supposed to be covered by the take-over mechanism in the Usage heartbeat thread. It would be good to check first if there is an issue with this mechanism.

Can you refer to a PR that implements this @winterhazel ?

Briefly looking at the code, it seems that removing unfinished job entries may (in rare situations) lead into two Usage servers thinking that they own the current job (there will be two "unfinished" job entries), or a job entry that already finished being updated.

@sureshanaparti , I think @winterhazel has a point here. Can you comment on this scenario?

winterhazel · 2025-05-28T01:33:52Z

Can you refer to a PR that implements this @winterhazel ?

@DaanHoogland unfortunately no. This mechanism was introduced alongside Usage in commit ee81905, and there is no documentation about it except the code in the heartbeat thread.

Also, just notifying that I want to do some tests regarding this mechanism and the possible situations that I mentioned previously to see whether there is a better approach before approving this one.

DaanHoogland · 2025-06-02T09:02:40Z

Also, just notifying that I want to do some tests regarding this mechanism and the possible situations that I mentioned previously to see whether there is a better approach before approving this one.

Can you update, @winterhazel ? I think as a patch this one is ok, given that improvements and research into why the original implementation didn’t work, can be done later.

winterhazel · 2025-06-04T02:44:48Z

Can you update, @winterhazel ?

@DaanHoogland I had a look into the code, and found that one situation in which it happens is when the last open job is a "single" job (a job that was created via the generateUsageRecords API) that was assigned by the API either to an Usage server process that is not running anymore, or to none. The take over mechanism does not consider these jobs, so they remain open indefinitely, blocking Usage from executing.

One way to reproduce:

Stop Usage and ensure that there are no open jobs left
Call generateUsageRecords
Start Usage

Doing this will result in the Usage job never executing.

I think that the ideal approach here would be adapting the take over mechanism to consider "single" jobs as well, but I am also ok with this patch for the moment. It does allow Usage to execute when the blocking job was assigned to a dead process from the same host, which I think will consist of most production scenarios. However, if it was assigned to a host that does not run Usage anymore, or not assigned at all, Usage still won't run. Thus, I will implement my suggestion to also cover these cases when I get some time.

The strange thing that we noticed was multiple open jobs for different hosts and PIDs being pending

Do you still have access to these entries to confirm whether it was the same situation that I found? Was there a "single" open job (job_type=1 and end_millis=0)?

@sureshanaparti , I think @winterhazel has a point here. Can you comment on this scenario?

On a note, regarding these concurrency problems that I mentioned, I checked that they can already happen on the current version. This patch will not make them worse. The way Usage handles the jobs should be addressed instead.

DaanHoogland · 2025-06-04T18:07:12Z

tnx @winterhazel , we still need testing so we’ll merge after the releases.

harikrishna-patnala

code LGTM

boring-cyborg bot added the component:usage-server label May 12, 2025

sureshanaparti added this to the 4.19.3 milestone May 12, 2025

sureshanaparti requested review from DaanHoogland and shwstppr May 12, 2025 09:32

sureshanaparti changed the title ~~Remove unfinished usage job entries for the host~~ Remove unfinished usage job entries of the host May 12, 2025

DaanHoogland reviewed May 12, 2025

View reviewed changes

usage/src/main/java/com/cloud/usage/UsageManagerImpl.java Outdated Show resolved Hide resolved

Remove unfinished usage job entries of the host

b5b2f11

sureshanaparti force-pushed the remove_unfinished_usage_job_entries branch from 0aff3df to b5b2f11 Compare May 12, 2025 10:11

DaanHoogland approved these changes May 12, 2025

View reviewed changes

winterhazel self-requested a review May 12, 2025 11:55

DaanHoogland changed the title ~~Remove unfinished usage job entries of the host~~ Remove unfinished usage job entries of the host on start May 14, 2025

DaanHoogland requested a review from Copilot May 14, 2025 05:59

Copilot AI reviewed May 14, 2025

View reviewed changes

DaanHoogland modified the milestones: 4.19.3, 4.19.4 May 14, 2025

rohityadavcloud modified the milestones: 4.19.4, 4.20.2 May 16, 2025

rohityadavcloud approved these changes May 16, 2025

View reviewed changes

DaanHoogland added the status:needs-testing label May 16, 2025

DaanHoogland force-pushed the 4.19 branch from 8ef8b46 to 857ccb0 Compare May 27, 2025 14:19

winterhazel approved these changes Jun 4, 2025

View reviewed changes

harikrishna-patnala approved these changes Jun 5, 2025

View reviewed changes

Remove unfinished usage job entries of the host on start #10848

Are you sure you want to change the base?

Remove unfinished usage job entries of the host on start #10848

Uh oh!

Conversation

sureshanaparti commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Uh oh!

sureshanaparti commented May 12, 2025

Uh oh!

codecov bot commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sureshanaparti commented May 12, 2025

Uh oh!

blueorangutan commented May 12, 2025

Uh oh!

Uh oh!

blueorangutan commented May 12, 2025

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

DaanHoogland commented May 12, 2025

Uh oh!

blueorangutan commented May 12, 2025

Uh oh!

blueorangutan commented May 12, 2025

Uh oh!

DaanHoogland commented May 12, 2025

Uh oh!

blueorangutan commented May 12, 2025

Uh oh!

blueorangutan commented May 13, 2025

Uh oh!

DaanHoogland commented May 13, 2025

Uh oh!

blueorangutan commented May 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

winterhazel commented May 15, 2025

Uh oh!

DaanHoogland commented May 15, 2025

Uh oh!

rohityadavcloud left a comment

Choose a reason for hiding this comment

Uh oh!

DaanHoogland commented May 27, 2025

Uh oh!

winterhazel commented May 28, 2025

Uh oh!

DaanHoogland commented Jun 2, 2025

Uh oh!

winterhazel commented Jun 4, 2025

Uh oh!

DaanHoogland commented Jun 4, 2025

Uh oh!

harikrishna-patnala left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sureshanaparti commented May 12, 2025 •

edited

Loading

codecov bot commented May 12, 2025 •

edited

Loading