Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V2] Enable metrics server #706

Merged
merged 8 commits into from
Oct 10, 2024
Merged

[V2] Enable metrics server #706

merged 8 commits into from
Oct 10, 2024

Conversation

rquitales
Copy link
Member

Proposed changes

This PR exposes the metrics service in our manifests and adds new unit tests to ensure the metrics logic is correct.

Technical changes

  1. Added sigs.k8s.io/controller-runtime/pkg/metrics/filters dependency to allow metrics to be secure by default (guidance from kubebuilder book/scaffold)
  2. Adds new kustomize patches and bases to expose the metrics server in a typical PKO deployment
  3. Refactor the updateStackCallback function
  4. Added the stacks_reconciling metric
  5. Added ginkgo tests to exercise the Program and Stack metrics
  6. Fixed some bugs in our metrics as determined through the unit tests

Testing

  • Manually tested a deployed PKO installation with the Prometheus Operator
  • Added unit tests (using ginkgo)

Note, e2e tests will be added in a followup PR.

Related issues (optional)

Closes: #576

@rquitales rquitales added the impact/no-changelog-required This issue doesn't require a CHANGELOG update label Oct 4, 2024
@rquitales rquitales requested review from blampe and EronWright October 4, 2024 20:03
Copy link

codecov bot commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 67.27273% with 18 lines in your changes missing coverage. Please review.

Project coverage is 50.22%. Comparing base (2a67d4c) to head (2bc40ee).
Report is 1 commits behind head on v2.

Files with missing lines Patch % Lines
operator/cmd/main.go 0.00% 12 Missing ⚠️
...erator/internal/controller/pulumi/metrics_stack.go 88.23% 2 Missing and 2 partials ⚠️
operator/internal/controller/pulumi/utils.go 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##               v2     #706      +/-   ##
==========================================
+ Coverage   49.57%   50.22%   +0.65%     
==========================================
  Files          27       27              
  Lines        2911     2919       +8     
==========================================
+ Hits         1443     1466      +23     
+ Misses       1290     1272      -18     
- Partials      178      181       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@EronWright EronWright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool.

The "filter" seems to be a perfect compliment to our plan for authenticating the Automation RPC endpoint. It imports the "apiserver" library and that will probably be useful for that effort.

@@ -29,6 +29,29 @@ require (
sigs.k8s.io/controller-runtime v0.18.4
)

require (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit -

There should only be two require blocks (for direct and indirect) but I'm guessing go is getting confused by the replace directive? I wonder if go mod tidy consolidates these after the replace is moved to the top or bottom of the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the go.mod file by manually deleting the indirect block and letting it regenerate.

flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
"Enable leader election for controller manager. "+
"Enabling this will ensure there is only one active controller manager.")
flag.BoolVar(&secureMetrics, "metrics-secure", false,
"If set the metrics endpoint is served securely")
flag.BoolVar(&secureMetrics, "metrics-secure", true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -43,7 +43,6 @@ spec:
args:
- --leader-elect
- --health-probe-bind-address=:8081
- --metrics-bind-address=:8383
Copy link
Contributor

@blampe blampe Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an argument for keeping the port as-is considering the v1 chart spits out a ServiceMonitor that targets :8383 instead of the port name. It would be nice for the user's metrics to continue to Just Work™ -- although if we're changing up the keys significantly then maybe there's no point.

In any case we should probably try to keep the metrics service as close as possible to what we have in v1, to at least keep selectors the same.

Maybe unrelated - it seems like the v1 operator has two metrics ports, 8383 and 8686? Do you know why that is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PKO v1 was scaffolded with operator-sdk pre v1, which did not use controller-runtime/kubebuilder under the hood. It seems like the older version of the framework was modelled using kube-state-metrics (where 2 ports were exposed, one for the operator, and another for metrics on the CRs itself, ref). Kubebuilder uses a more simplified approach which exposes a globabl prometheus registry that we can add additional metrics to.

With the implementation of metrics being consolidated into 1 port, and industry standards moving to secure metrics by default, now seemed like a good time to make this change. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah totally agree one port is the most reasonable. AFAICT the other port wasn't actually used in v1. I'm still wondering if it's OK to change the port, though. Existing service monitors would stop scraping metrics, right?

return collectors
}

func init() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌 🙌 🙌

Comment on lines +18 to +19
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like the linting sacrifice we take on by using these libraries. Especially in this case I think it makes things harder to read and more complicated than necessary. You're free to take this or leave it!

package pulumi

import (
	"testing"

	"github.com/prometheus/client_golang/prometheus/testutil"
	"github.com/stretchr/testify/assert"
)

func TestProgramMetrics(t *testing.T) {
	numPrograms.Set(0)

	// it should increment the programs_active metric when a new Program is created
	newProgramCallback(nil)
	assert.Equal(t, 1.0, testutil.ToFloat64(numPrograms))

	// it should increment the programs_active metric when another Program is created
	newProgramCallback(nil)
	assert.Equal(t, 2.0, testutil.ToFloat64(numPrograms))

	// it should decrement the programs_active metric when a Program is deleted
	deleteProgramCallback(nil)
	assert.Equal(t, 1.0, testutil.ToFloat64(numPrograms))

	// it should decrement the programs_active metric when another Program is deleted
	deleteProgramCallback(nil)
	assert.Equal(t, 0.0, testutil.ToFloat64(numPrograms))

	// it should not decrement the programs_active metric when a Program is deleted and the metric is already at 0
	deleteProgramCallback(nil)
	assert.Equal(t, 0.0, testutil.ToFloat64(numPrograms))
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, and in fact, I originally wrote this test using std Golang. However, I switched to using ginkgo ultimately, for consistency with other tests that we already have that test metrics. Also, this appeared to be more of a BDD style test, which is why I ultimately aligned with using Ginkgo.

Since we're already taking linting sacrifices in other files due to ginkgo, I'll leave this as is since I believe consistency here is more important.

Copy link
Contributor

@blampe blampe Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: whoops, that comment was in response to the metrics port.

@rquitales rquitales force-pushed the rquitales/enable-metrics-server branch from 4b44748 to 3554cd5 Compare October 7, 2024 23:32
@rquitales rquitales merged commit d52dbe7 into v2 Oct 10, 2024
6 of 7 checks passed
@rquitales rquitales deleted the rquitales/enable-metrics-server branch October 10, 2024 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/no-changelog-required This issue doesn't require a CHANGELOG update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants