Skip to content

Disabled the setting reboot.host.and.alert.management.on.heartbeat.timeout by default #10111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

slavkap
Copy link
Contributor

@slavkap slavkap commented Dec 16, 2024

Description

This PR disables the setting reboot.host.and.alert.management.on.heartbeat.timeout. When there is a storage issue, even if the high availability isn't enabled, CloudStack will reboot the host.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

@DaanHoogland DaanHoogland added this to the 4.19.2 milestone Dec 16, 2024
Copy link

codecov bot commented Dec 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 15.12%. Comparing base (a2f2e87) to head (79a5f78).

Additional details and impacted files
@@             Coverage Diff              @@
##               4.19   #10111      +/-   ##
============================================
- Coverage     15.13%   15.12%   -0.01%     
+ Complexity    11268    11262       -6     
============================================
  Files          5408     5408              
  Lines        473867   473867              
  Branches      57778    57778              
============================================
- Hits          71700    71684      -16     
- Misses       394165   394185      +20     
+ Partials       8002     7998       -4     
Flag Coverage Δ
uitests 4.30% <ø> (ø)
unittests 15.84% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DaanHoogland
Copy link
Contributor

@slavkap , have you tested this with HA enabled?

@weizhouapache
Copy link
Member

@slavkap
can you start a discussion on dev/user mailing list ?

this changes the current behaviour.
IMHO, if no objections, we could merge it in 4.21(next major release), but not 4.20/4.19

`reboot.host.and.alert.management.on.heartbeat.timeout` has to be
disabled. Even the high availability isn't enabled when there is an
issue with a storage CloudStack will reboot the host
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@slavkap slavkap changed the base branch from 4.19 to main December 17, 2024 09:52
@slavkap
Copy link
Contributor Author

slavkap commented Dec 17, 2024

@DaanHoogland, I've tested this with and without HA
@weizhouapache, sure, I'll start a discussion for this

@slavkap slavkap marked this pull request as draft December 17, 2024 09:56
@DaanHoogland DaanHoogland changed the title Disabled the setting do-not-reboot-host-on-heartbeat-timeout to not reboot a host on heartbeat timeout Disabled the setting reboot.host.and.alert.management.on.heartbeat.timeout by default Jan 8, 2025
@DaanHoogland
Copy link
Contributor

@slavkap , I changed the title . Hope you don't mind. It was a bit confusing to me.
Are you still looking into this?

@slavkap
Copy link
Contributor Author

slavkap commented Jan 10, 2025

@DaanHoogland, I don't mind the change, thanks!
Yes, I opened a discussion in the mailing list for this

@DaanHoogland DaanHoogland modified the milestones: 4.19.2, 4.19.3 Feb 3, 2025
@DaanHoogland
Copy link
Contributor

moved forward

@slavkap slavkap modified the milestones: 4.19.3, 4.21.0 Feb 3, 2025
@slavkap
Copy link
Contributor Author

slavkap commented Feb 3, 2025

@DaanHoogland, I rebased it on main as @weizhouapache suggested merging it possibly in a major release.

@boubouX
Copy link

boubouX commented Mar 28, 2025

We experienced the unfortunate event of this issue, causing cascading reboots of all our hosts while the NFS server had no running VM. It was an operational nightmare that resulted in approximately 45 minutes of downtime. Changing its default value to false offers us more gain than loss. We adjusted it to our settings; thank you, Wei. This was simply catastrophic!

@hanisirfan
Copy link

As someone who works with VMware products, I never had an experience where a host reboots when datastore are inaccessible. I believe changing the default for CloudStack to "false" is a great move.

@sureshanaparti
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13621

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 13671

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13677

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@DaanHoogland
Copy link
Contributor

@sureshanaparti , I think we can merge this one, pending smoke tests. But it merits a note in the release notes page for the next version.

@blueorangutan
Copy link

[SF] Trillian test result (tid-13502)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 55426 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10111-t13502-kvm-ol8.zip
Smoke tests completed. 141 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

7 participants