Add metrics for other unhealthy VM states #2597

anshrupani · 2025-01-22T07:46:00Z

What is this change about?

As of now, bosh does not collect metrics regarding VMs which are unhealthy (basically any state that is not running). With this PR, we now introduce a new metric: bosh_unhealthy_instances, which basically reports the same.

What tests have you run against this PR?

We tested a bosh dev release on our development environments for the new metric, and it worked fine. We also ran the bosh-monitor unit tests, and added more as needed.

How should this change be described in bosh release notes?

Bosh now emits a new metric: "bosh_unhealthy_instances", which can be used to track unhealthy instances per deployment over time.

Does this PR introduce a breaking change?

No

Tag your pair, your PM, and/or team!

@DennisAhaus

src/bosh-monitor/lib/bosh/monitor/director.rb

src/bosh-monitor/spec/unit/bosh/monitor/director_spec.rb

src/bosh-monitor/lib/bosh/monitor/director.rb

src/bosh-monitor/spec/unit/bosh/monitor/director_spec.rb

src/bosh-monitor/lib/bosh/monitor/director.rb

xtreme-nitin-ravindran

lgtm

aramprice · 2025-01-31T01:30:19Z

Looks like this is breaking integration specs in CI starting at

https://bosh.ci.cloudfoundry.org/teams/main/pipelines/bosh-director/jobs/integration-postgres-hotswap/builds/559#L678a4bbc:5981

Failures:

  1) using director with config server when config server certificates are trusted when deployment manifest has variables when all variables are set in config server when health monitor is around and resurrector is enabled interpolates values correctly when resurrector kicks in
     Failure/Error: if vm.vm_cid == resurrected_vm.vm_cid
     
     NoMethodError:
       undefined method `vm_cid' for nil
     # ./spec/integration_support/director.rb:170:in `kill_vm_and_wait_for_resurrection'
     # ./spec/integration/config_server/config_server_spec.rb:401:in `block (6 levels) in <top (required)>'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:263:in `instance_exec'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:263:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:511:in `block in with_around_and_singleton_context_hooks'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:468:in `block in with_around_example_hooks'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/hooks.rb:486:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/hooks.rb:624:in `run_around_example_hooks_for'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/hooks.rb:486:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:468:in `with_around_example_hooks'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:511:in `with_around_and_singleton_context_hooks'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:259:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:646:in `block in run_examples'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:642:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:642:in `run_examples'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:607:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:121:in `block (3 levels) in run_specs'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:121:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:121:in `block (2 levels) in run_specs'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/configuration.rb:2097:in `with_suite_hooks'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:116:in `block in run_specs'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/reporter.rb:74:in `report'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:115:in `run_specs'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:89:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:71:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:45:in `invoke'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/exe/rspec:4:in `<top (required)>'
     # /usr/local/bundle/ruby/3.3.0/bin/rspec:25:in `load'
     # /usr/local/bundle/ruby/3.3.0/bin/rspec:25:in `<top (required)>'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli/exec.rb:59:in `load'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli/exec.rb:59:in `kernel_load'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli/exec.rb:23:in `run'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli.rb:452:in `exec'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/vendor/thor/lib/thor/command.rb:28:in `run'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/vendor/thor/lib/thor.rb:538:in `dispatch'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli.rb:35:in `dispatch'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/vendor/thor/lib/thor/base.rb:584:in `start'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli.rb:29:in `start'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/exe/bundle:28:in `block in <top (required)>'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/friendly_errors.rb:117:in `with_friendly_errors'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/exe/bundle:20:in `<top (required)>'
     # /usr/local/bundle/bin/bundle:108:in `load'
     # /usr/local/bundle/bin/bundle:108:in `<main>'

Edit: trimmed CI logs, see more relevant data in #2598

aramprice · 2025-02-01T02:25:36Z

src/bosh-monitor/lib/bosh/monitor/director.rb

+      if !request_path.nil?
+       request_path = request_path.sub(endpoint, '')
+      end
+      parsed_endpoint = URI.parse(endpoint + (request_path || ''))
+      headers = options['headers'] || {}


This code is causing invalid requests, from integration spec logs:

==> sandbox/logs/health_monitor.out-7b5e96eb <== E, [2025-02-01T02:17:48.089550 #17476] ERROR : Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151 E, [2025-02-01T02:17:48.089706 #17476] ERROR : /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:115:in `rescue in perform_request' /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:114:in `perform_request' /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:48:in `block in get_deployment_instances_full' <internal:kernel>:187:in `loop' /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:46:in `get_deployment_instances_full' /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/instance_manager.rb:37:in `block in fetch_deployments' /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/instance_manager.rb:30:in `each' /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/instance_manager.rb:30:in `fetch_deployments' /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/runner.rb:182:in `fetch_deployments' /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/runner.rb:131:in `block in poll_director' /usr/local/bundle/ruby/3.3.0/gems/async-2.21.3/lib/async/task.rb:200:in `block in run' /usr/local/bundle/ruby/3.3.0/gems/async-2.21.3/lib/async/task.rb:438:in `block in schedule' I, [2025-02-01T02:17:48.090186 #17476] INFO : [ALERT] Alert @ 2025-02-01 02:17:48 UTC, severity 3: Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151 D, [2025-02-01T02:17:48.090430 #17476] DEBUG : (Resurrector) ignoring event of category '': Alert @ 2025-02-01 02:17:48 UTC, severity 3: Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151 I, [2025-02-01T02:17:48.090784 #17476] INFO : (Event logger) notifying director about event: Alert @ 2025-02-01 02:17:48 UTC, severity 3: Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151 D, [2025-02-01T02:17:48.090854 #17476] DEBUG : sending HTTP POST to: https://127.0.0.1:61002/events

@anshrupani it would be good if you could look at this and try to see why this change is causing the invalid (ex: https://127.0.0.1:61002https://127.0.0.1/tasks/151) requests to be generated.

jpalermo · 2025-02-03T22:41:23Z

I'm going to revert this PR for now. The broken tests are fixable, but another issue was raised that this causes the health monitor to query the director, then the director to query all the VMs for data.

This goes against the design goals of the health monitor. Health monitor is supposed to gather VM information from the heartbeats that come across the NATS message bus. It does gather basic facts from the bosh director, such as which deployments exist. Essentially things the bosh director can answer from its database.

The new get_deployment_instances_full method causes the director to create a task which then polls all the VMs for their instance data.

If we want to emit a metric like this, we'd need to be able to extract that information from the heartbeat data. At least some of that data already exists, although if it's not enough, you might need to first make a PR to the agent to add info to that payload.

anshrupani · 2025-02-04T11:00:53Z

Sorry, I was away on vacation. I agree with @jpalermo, we'll try to figure out another way to keep the design goals intact. Thanks a lot for pointing this out. @aramprice: once we redo this PR, I guess the issue with the tasks would be taken care of as well. Thanks :)

anshrupani added 4 commits January 3, 2025 10:36

Add metrics for other vm states

7f1d49b

rename detached to inactive instances

4966102

add metric for unhealthy_instances

af15b81

add unit tests for changes to the metrics

4cff26b

anshrupani changed the title ~~Add metrics for other vm states~~ Add metrics for other unhealthy VM states Jan 22, 2025

DennisAhaus suggested changes Jan 22, 2025

View reviewed changes

resolve comments

afb0eb0

jpalermo requested review from a team, aramprice and lnguyen and removed request for a team January 23, 2025 15:53

xtreme-nitin-ravindran approved these changes Jan 23, 2025

View reviewed changes

aramprice approved these changes Jan 23, 2025

View reviewed changes

lnguyen approved these changes Jan 30, 2025

View reviewed changes

fmoehler approved these changes Jan 30, 2025

View reviewed changes

benjaminguttmann-avtq approved these changes Jan 30, 2025

View reviewed changes

jpalermo merged commit e1cdc79 into cloudfoundry:main Jan 30, 2025
9 checks passed

aramprice reviewed Feb 1, 2025

View reviewed changes

aramprice mentioned this pull request Feb 1, 2025

PR #2597 "Add metrics for other unhealthy VM states" broke health monitor #2598

Closed

jpalermo mentioned this pull request Feb 3, 2025

Revert "Add metrics for other unhealthy VM states" #2600

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics for other unhealthy VM states #2597

Add metrics for other unhealthy VM states #2597

anshrupani commented Jan 22, 2025

xtreme-nitin-ravindran left a comment

aramprice commented Jan 31, 2025 •

edited

Loading

aramprice Feb 1, 2025 •

edited

Loading

jpalermo commented Feb 3, 2025

anshrupani commented Feb 4, 2025

Add metrics for other unhealthy VM states #2597

Add metrics for other unhealthy VM states #2597

Conversation

anshrupani commented Jan 22, 2025

What is this change about?

What tests have you run against this PR?

How should this change be described in bosh release notes?

Does this PR introduce a breaking change?

Tag your pair, your PM, and/or team!

xtreme-nitin-ravindran left a comment

Choose a reason for hiding this comment

aramprice commented Jan 31, 2025 • edited Loading

aramprice Feb 1, 2025 • edited Loading

Choose a reason for hiding this comment

jpalermo commented Feb 3, 2025

anshrupani commented Feb 4, 2025

aramprice commented Jan 31, 2025 •

edited

Loading

aramprice Feb 1, 2025 •

edited

Loading