Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for other unhealthy VM states #2597

Merged

Conversation

anshrupani
Copy link
Contributor

What is this change about?

As of now, bosh does not collect metrics regarding VMs which are unhealthy (basically any state that is not running). With this PR, we now introduce a new metric: bosh_unhealthy_instances, which basically reports the same.

What tests have you run against this PR?

We tested a bosh dev release on our development environments for the new metric, and it worked fine. We also ran the bosh-monitor unit tests, and added more as needed.

How should this change be described in bosh release notes?

Bosh now emits a new metric: "bosh_unhealthy_instances", which can be used to track unhealthy instances per deployment over time.

Does this PR introduce a breaking change?

No

Tag your pair, your PM, and/or team!

@DennisAhaus

@anshrupani anshrupani changed the title Add metrics for other vm states Add metrics for other unhealthy VM states Jan 22, 2025
src/bosh-monitor/lib/bosh/monitor/director.rb Outdated Show resolved Hide resolved
src/bosh-monitor/spec/unit/bosh/monitor/director_spec.rb Outdated Show resolved Hide resolved
src/bosh-monitor/lib/bosh/monitor/director.rb Outdated Show resolved Hide resolved
src/bosh-monitor/lib/bosh/monitor/director.rb Outdated Show resolved Hide resolved
src/bosh-monitor/spec/unit/bosh/monitor/director_spec.rb Outdated Show resolved Hide resolved
src/bosh-monitor/lib/bosh/monitor/director.rb Outdated Show resolved Hide resolved
@jpalermo jpalermo requested review from a team, aramprice and lnguyen and removed request for a team January 23, 2025 15:53
Copy link
Member

@xtreme-nitin-ravindran xtreme-nitin-ravindran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jpalermo jpalermo merged commit e1cdc79 into cloudfoundry:main Jan 30, 2025
9 checks passed
@aramprice
Copy link
Member

aramprice commented Jan 31, 2025

Looks like this is breaking integration specs in CI starting at

Failures:

  1) using director with config server when config server certificates are trusted when deployment manifest has variables when all variables are set in config server when health monitor is around and resurrector is enabled interpolates values correctly when resurrector kicks in
     Failure/Error: if vm.vm_cid == resurrected_vm.vm_cid
     
     NoMethodError:
       undefined method `vm_cid' for nil
     # ./spec/integration_support/director.rb:170:in `kill_vm_and_wait_for_resurrection'
     # ./spec/integration/config_server/config_server_spec.rb:401:in `block (6 levels) in <top (required)>'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:263:in `instance_exec'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:263:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:511:in `block in with_around_and_singleton_context_hooks'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:468:in `block in with_around_example_hooks'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/hooks.rb:486:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/hooks.rb:624:in `run_around_example_hooks_for'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/hooks.rb:486:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:468:in `with_around_example_hooks'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:511:in `with_around_and_singleton_context_hooks'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example.rb:259:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:646:in `block in run_examples'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:642:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:642:in `run_examples'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:607:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `block in run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/example_group.rb:608:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:121:in `block (3 levels) in run_specs'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:121:in `map'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:121:in `block (2 levels) in run_specs'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/configuration.rb:2097:in `with_suite_hooks'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:116:in `block in run_specs'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/reporter.rb:74:in `report'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:115:in `run_specs'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:89:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:71:in `run'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/lib/rspec/core/runner.rb:45:in `invoke'
     # /usr/local/bundle/ruby/3.3.0/gems/rspec-core-3.13.2/exe/rspec:4:in `<top (required)>'
     # /usr/local/bundle/ruby/3.3.0/bin/rspec:25:in `load'
     # /usr/local/bundle/ruby/3.3.0/bin/rspec:25:in `<top (required)>'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli/exec.rb:59:in `load'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli/exec.rb:59:in `kernel_load'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli/exec.rb:23:in `run'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli.rb:452:in `exec'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/vendor/thor/lib/thor/command.rb:28:in `run'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/vendor/thor/lib/thor.rb:538:in `dispatch'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli.rb:35:in `dispatch'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/vendor/thor/lib/thor/base.rb:584:in `start'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/cli.rb:29:in `start'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/exe/bundle:28:in `block in <top (required)>'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/lib/bundler/friendly_errors.rb:117:in `with_friendly_errors'
     # /usr/local/lib/ruby/gems/3.3.0/gems/bundler-2.6.3/exe/bundle:20:in `<top (required)>'
     # /usr/local/bundle/bin/bundle:108:in `load'
     # /usr/local/bundle/bin/bundle:108:in `<main>'

Edit: trimmed CI logs, see more relevant data in #2598

Comment on lines +99 to +103
if !request_path.nil?
request_path = request_path.sub(endpoint, '')
end
parsed_endpoint = URI.parse(endpoint + (request_path || ''))
headers = options['headers'] || {}
Copy link
Member

@aramprice aramprice Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is causing invalid requests, from integration spec logs:

==> sandbox/logs/health_monitor.out-7b5e96eb <==
E, [2025-02-01T02:17:48.089550 #17476] ERROR : Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151
E, [2025-02-01T02:17:48.089706 #17476] ERROR : /tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:115:in `rescue in perform_request'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:114:in `perform_request'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:48:in `block in get_deployment_instances_full'
<internal:kernel>:187:in `loop'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/director.rb:46:in `get_deployment_instances_full'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/instance_manager.rb:37:in `block in fetch_deployments'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/instance_manager.rb:30:in `each'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/instance_manager.rb:30:in `fetch_deployments'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/runner.rb:182:in `fetch_deployments'
/tmp/build/e55deab7/bosh/src/bosh-monitor/lib/bosh/monitor/runner.rb:131:in `block in poll_director'
/usr/local/bundle/ruby/3.3.0/gems/async-2.21.3/lib/async/task.rb:200:in `block in run'
/usr/local/bundle/ruby/3.3.0/gems/async-2.21.3/lib/async/task.rb:438:in `block in schedule'
I, [2025-02-01T02:17:48.090186 #17476]  INFO : [ALERT] Alert @ 2025-02-01 02:17:48 UTC, severity 3: Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151
D, [2025-02-01T02:17:48.090430 #17476] DEBUG : (Resurrector) ignoring event of category '': Alert @ 2025-02-01 02:17:48 UTC, severity 3: Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151
I, [2025-02-01T02:17:48.090784 #17476]  INFO : (Event logger) notifying director about event: Alert @ 2025-02-01 02:17:48 UTC, severity 3: Invalid URI: https://127.0.0.1:61002https://127.0.0.1/tasks/151
D, [2025-02-01T02:17:48.090854 #17476] DEBUG : sending HTTP POST to: https://127.0.0.1:61002/events

@anshrupani it would be good if you could look at this and try to see why this change is causing the invalid (ex: https://127.0.0.1:61002https://127.0.0.1/tasks/151) requests to be generated.

@jpalermo
Copy link
Member

jpalermo commented Feb 3, 2025

I'm going to revert this PR for now. The broken tests are fixable, but another issue was raised that this causes the health monitor to query the director, then the director to query all the VMs for data.

This goes against the design goals of the health monitor. Health monitor is supposed to gather VM information from the heartbeats that come across the NATS message bus. It does gather basic facts from the bosh director, such as which deployments exist. Essentially things the bosh director can answer from its database.

The new get_deployment_instances_full method causes the director to create a task which then polls all the VMs for their instance data.

If we want to emit a metric like this, we'd need to be able to extract that information from the heartbeat data. At least some of that data already exists, although if it's not enough, you might need to first make a PR to the agent to add info to that payload.

@anshrupani
Copy link
Contributor Author

Sorry, I was away on vacation. I agree with @jpalermo, we'll try to figure out another way to keep the design goals intact. Thanks a lot for pointing this out. @aramprice: once we redo this PR, I guess the issue with the tasks would be taken care of as well. Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

8 participants