Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Helm Chart Prometheus Adapter Metrics #53

Merged
merged 2 commits into from
Dec 18, 2024
Merged

Conversation

paulliwog
Copy link
Contributor

Proposed changes

The x100 multiplier in the existing metrics is not compatible with the scaling ratio from 0.0 to 1.0 in the chart. I'm removing the multiplier to make it consistent with the comments in the values file for scaling.auto.engine.metrics.requestCapacityRatio

This could be a breaking change for any users that have changed the ratio to use a scale from 0 to 100 rather than changing the custom metrics.

Types of changes

What types of changes does your code introduce to the Deepgram self-hosted resources?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update or tests (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have tested my changes in my local self-hosted environment
    • Please describe your testing setup and methodology here
  • I have added necessary documentation (if appropriate)

Further comments

@paulliwog paulliwog requested a review from a team as a code owner December 16, 2024 18:21
@bd-g
Copy link
Collaborator

bd-g commented Dec 16, 2024

@paulliwog You mentioned you tested the above - do you have screenshots or text output of the prometheus output and/or the HPA with the scaling metrics off by two orders of magnitude?

FWIW a quick read suggest this is correct, and we'll verify on our end, but it's helpful to document in the PR the evidence of the above changes being needed! 😄

@paulliwog
Copy link
Contributor Author

Here are screenshots of the custom metric and it's components:
Screenshot 2024-12-16 at 12 49 06 PM
Screenshot 2024-12-16 at 12 49 45 PM
Screenshot 2024-12-16 at 12 50 15 PM

And here is the HPA using a scaling factor of 0.5, you can see that the values match the current load show in the custom metrics above

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app.kubernetes.io/instance: deepgram-self-hosted
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: deepgram-self-hosted
    app.kubernetes.io/version: release-241121
    helm.sh/chart: deepgram-self-hosted-0.8.0
  name: deepgram-engine-hpa
  namespace: dg-self-hosted
spec:
  behavior:
    scaleDown:
      policies:
        - periodSeconds: 60
          type: Pods
          value: 1
        - periodSeconds: 60
          type: Percent
          value: 25
      selectPolicy: Max
    scaleUp:
      policies:
        - periodSeconds: 15
          type: Pods
          value: 4
        - periodSeconds: 15
          type: Percent
          value: 100
      selectPolicy: Max
      stabilizationWindowSeconds: 0
  maxReplicas: 20
  metrics:
    - external:
        metric:
          name: engine_requests_active_to_max_ratio
        target:
          type: Value
          value: 500m
      type: External
  minReplicas: 2
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: deepgram-engine
status:
  conditions:
    - lastTransitionTime: '2024-12-10T04:28:37Z'
      message: recommended size matches current size
      reason: ReadyForNewScale
      status: 'True'
      type: AbleToScale
    - lastTransitionTime: '2024-12-14T19:13:19Z'
      message: >-
        the HPA was able to successfully calculate a replica count from external
        metric engine_requests_active_to_max_ratio(nil)
      reason: ValidMetricFound
      status: 'True'
      type: ScalingActive
    - lastTransitionTime: '2024-12-14T21:37:54Z'
      message: the desired replica count is less than the minimum replica count
      reason: TooFewReplicas
      status: 'True'
      type: ScalingLimited
  currentMetrics:
    - external:
        current:
          value: 3m
        metric:
          name: engine_requests_active_to_max_ratio
      type: External
  currentReplicas: 2
  desiredReplicas: 2
  lastScaleTime: '2024-12-14T21:39:54Z'

We identified this issue after identifying that only a few concurrent requests would cause the engine to scale up rather than ~100 requests that we were expecting to trigger an increase in scale. I don't have the previous HPA, but it would be 300m in this case with a single concurrent request.

@bd-g
Copy link
Collaborator

bd-g commented Dec 16, 2024

Great. That all tracks. Could you add a line to CHANGELOG.md describing your changes? Nest it under ## Unreleased, with a subheader of ## Changed.

That will also pass the PR checks.

@bd-g bd-g merged commit 9aec42d into deepgram:main Dec 18, 2024
2 checks passed
@bd-g
Copy link
Collaborator

bd-g commented Dec 18, 2024

Looks great! Thanks so much. I'll put out a new version with a minor version bump for you to ingest.

@bd-g
Copy link
Collaborator

bd-g commented Dec 19, 2024

@paulliwog Released in 0.8.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants