-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for other unhealthy VM states #2597
Add metrics for other unhealthy VM states #2597
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Looks like this is breaking integration specs in CI starting at
Edit: trimmed CI logs, see more relevant data in #2598 |
if !request_path.nil? | ||
request_path = request_path.sub(endpoint, '') | ||
end | ||
parsed_endpoint = URI.parse(endpoint + (request_path || '')) | ||
headers = options['headers'] || {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is causing invalid requests, from integration spec logs:
==> sandbox/logs/health_monitor.out-7b5e96eb <==
E, [2025-02-01T02:17:48.089550 #17476] ERROR : Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151
E, [2025-02-01T02:17:48.089706 #17476] ERROR : /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:115:in `rescue in perform_request'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:114:in `perform_request'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:48:in `block in get_deployment_instances_full'
<internal:kernel>:187:in `loop'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:46:in `get_deployment_instances_full'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/instance_manager.rb:37:in `block in fetch_deployments'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/instance_manager.rb:30:in `each'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/instance_manager.rb:30:in `fetch_deployments'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/runner.rb:182:in `fetch_deployments'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/runner.rb:131:in `block in poll_director'
/usr/local/bundle/ruby/3.3.0/gems/async-2.21.3/lib/async/task.rb:200:in `block in run'
/usr/local/bundle/ruby/3.3.0/gems/async-2.21.3/lib/async/task.rb:438:in `block in schedule'
I, [2025-02-01T02:17:48.090186 #17476] INFO : [ALERT] Alert @ 2025-02-01 02:17:48 UTC, severity 3: Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151
D, [2025-02-01T02:17:48.090430 #17476] DEBUG : (Resurrector) ignoring event of category '': Alert @ 2025-02-01 02:17:48 UTC, severity 3: Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151
I, [2025-02-01T02:17:48.090784 #17476] INFO : (Event logger) notifying director about event: Alert @ 2025-02-01 02:17:48 UTC, severity 3: Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151
D, [2025-02-01T02:17:48.090854 #17476] DEBUG : sending HTTP POST to: https://127.0.0.1:61002/events
@anshrupani it would be good if you could look at this and try to see why this change is causing the invalid (ex: https://127.0.0.1:61002https://127.0.0.1/tasks/151
) requests to be generated.
I'm going to revert this PR for now. The broken tests are fixable, but another issue was raised that this causes the health monitor to query the director, then the director to query all the VMs for data. This goes against the design goals of the health monitor. Health monitor is supposed to gather VM information from the heartbeats that come across the NATS message bus. It does gather basic facts from the bosh director, such as which deployments exist. Essentially things the bosh director can answer from its database. The new get_deployment_instances_full method causes the director to create a task which then polls all the VMs for their instance data. If we want to emit a metric like this, we'd need to be able to extract that information from the heartbeat data. At least some of that data already exists, although if it's not enough, you might need to first make a PR to the agent to add info to that payload. |
Sorry, I was away on vacation. I agree with @jpalermo, we'll try to figure out another way to keep the design goals intact. Thanks a lot for pointing this out. @aramprice: once we redo this PR, I guess the issue with the tasks would be taken care of as well. Thanks :) |
What is this change about?
As of now, bosh does not collect metrics regarding VMs which are unhealthy (basically any state that is not running). With this PR, we now introduce a new metric: bosh_unhealthy_instances, which basically reports the same.
What tests have you run against this PR?
We tested a bosh dev release on our development environments for the new metric, and it worked fine. We also ran the bosh-monitor unit tests, and added more as needed.
How should this change be described in bosh release notes?
Bosh now emits a new metric: "bosh_unhealthy_instances", which can be used to track unhealthy instances per deployment over time.
Does this PR introduce a breaking change?
No
Tag your pair, your PM, and/or team!
@DennisAhaus