Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus: ingress_controller_configuration_push_count should tell between intermittent errors and lock-ups #2484

Closed
4 tasks done
mflendrich opened this issue May 11, 2022 · 4 comments
Assignees
Labels
area/feature New feature or request priority/medium
Milestone

Comments

@mflendrich
Copy link
Contributor

mflendrich commented May 11, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Problem Statement

@seh reported that they'd like to use the ingress_controller_configuration_push_count Prometheus metric to alert in case a configuration lock-up (see #2195) happens, but that metric today does not tell between transient errors (e.g. network disconnect) and those that require fixing a config conflict (e.g. conflicting consumers #2324, #680)

Proposed Solution

Add a metric label on the ingress_controller_configuration_push_count metric telling between failures that require fixing a conflict and those not requiring fixing a conflict.

Additional information

Note that the existing metric ingress_controller_translation_count[success=true|false] may be answering this question already.

Acceptance Criteria

  • Our prometheus integration includes a new label on the metric ingress_controller_translation_count[success=true|false] where true is successfully pushed configs while false is failed pushes
  • Our prometheus OOTB dashboard is updated with a new widget using the new label
  • Documentation is updated for the new metric/label

No response

@mflendrich mflendrich changed the title Prometheus: ingress_controller_configuration_push_count should tell between intermittent errors and lock-ups Prometheus: ingress_controller_configuration_push_count should tell between intermittent errors and configuration errors May 11, 2022
@mflendrich mflendrich changed the title Prometheus: ingress_controller_configuration_push_count should tell between intermittent errors and configuration errors Prometheus: ingress_controller_configuration_push_count should tell between intermittent errors and lock-ups May 11, 2022
@mflendrich mflendrich added the area/feature New feature or request label May 13, 2022
@shaneutt shaneutt added this to the Release v2.5.0 milestone May 24, 2022
@mflendrich mflendrich removed this from the Release v2.5.0 milestone Jun 7, 2022
@mflendrich mflendrich removed their assignment Aug 2, 2022
@mflendrich mflendrich added this to the KIC v2.7.0 milestone Aug 2, 2022
@czeslavo czeslavo self-assigned this Sep 21, 2022
@czeslavo czeslavo modified the milestones: KIC v2.7.0, KIC v2.8.0 Sep 26, 2022
@pmalek pmalek modified the milestones: KIC v2.7.0, KIC v2.8.0 Sep 27, 2022
@czeslavo czeslavo modified the milestones: KIC v2.8.0, KIC v2.7.0 Sep 27, 2022
@czeslavo
Copy link
Contributor

I changed back the milestone to KIC v2.7.0 because the functional changes managed to get into this release.

@czeslavo
Copy link
Contributor

Grafana dashboard got updated in the KIC's repository https://github.com/Kong/kubernetes-ingress-controller/blob/main/grafana.json, but I'm waiting for Grafana (the company) to respond to my question regarding problems with the dashboard visibility on their catalog.

@jrsmroz
Copy link
Contributor

jrsmroz commented Sep 28, 2022

So, what's left to close the issue? We should close KIC 2.7 milestone soon.

@czeslavo
Copy link
Contributor

We're missing a review on Kong/docs.konghq.com#4500. The dashboard JSON in our repository has been updated so I'd consider the acceptance criteria for it complete.

Regarding the dashboard on grafana.com, I think we can take it separately. I've created an issue to track this: #2991

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/feature New feature or request priority/medium
Projects
None yet
Development

No branches or pull requests

5 participants