Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added readiness and liveness probes #8488

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from
Open

Added readiness and liveness probes #8488

wants to merge 6 commits into from

Conversation

azhavoro
Copy link
Contributor

@azhavoro azhavoro commented Sep 30, 2024

Motivation and context

How has this been tested?

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • I have linked related issues (see GitHub docs)
  • I have increased versions of npm packages if it is necessary
    (cvat-canvas,
    cvat-core,
    cvat-data and
    cvat-ui)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a management command workerprobe to check the health of workers in specified queues.
    • Enhanced health monitoring with readiness and liveness probes across multiple components including backend, workers, frontend, OPA, and kvrocks.
  • Chores

    • Updated Helm chart version from 0.13.2 to 0.14.0.
    • Modified configuration files to improve process management and health checks.

Copy link
Contributor

coderabbitai bot commented Sep 30, 2024

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes introduce a new management command, workerprobe, to monitor the health and liveness of workers across specified queues in the CVAT application. Additionally, the Helm chart has been updated to version 0.14.0, incorporating readiness and liveness probes for various components, including the backend, frontend, and workers. The supervisord configuration has also been modified to enhance process management by adding autorestart capabilities and dynamic process count settings.

Changes

Files Change Summary
cvat/apps/health/management/commands/workerprobe.py Introduced a new management command workerprobe to check worker health in specified queues.
helm-chart/Chart.yaml Updated chart version from 0.13.2 to 0.14.0.
helm-chart/templates/cvat_backend/*.yml Added readiness and liveness probes for backend services, including analytics, annotation, export, and webhooks.
helm-chart/templates/cvat_frontend/deployment.yml Introduced readiness and liveness probes for the frontend application.
helm-chart/templates/cvat_kvrocks/statefulset.yml Added readiness and liveness probes for the kvrocks StatefulSet.
helm-chart/templates/cvat_opa/deployment.yml Introduced readiness and liveness probes for the OPA component.
helm-chart/values.yaml Updated to include configurations for readiness and liveness probes across multiple components.
supervisord/utils.conf Enhanced process management with autorestart=true for specific programs and dynamic numprocs settings.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant WorkerProbe
    participant QueueManager
    participant Worker

    User->>WorkerProbe: Run command with queue names
    WorkerProbe->>QueueManager: Validate queue names
    QueueManager->>WorkerProbe: Return valid queues
    WorkerProbe->>Worker: Check health status
    Worker->>WorkerProbe: Return health status
    WorkerProbe->>User: Report health status
Loading

🐰 In the meadow, where bunnies hop,
New commands and probes make our hearts stop!
With readiness checks and liveness too,
Our workers are healthy, all thanks to you!
So let's dance in the fields, let joy abound,
For our changes bring happiness all around! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (17)
helm-chart/Chart.yaml (1)

Line range hint 24-61: Consider using consistent versioning patterns for dependencies.

While the current dependency versioning works, there's an inconsistency in how versions are specified:

  • Some use exact versions (e.g., nuclio: 0.19.0, traefik: 25.0.0)
  • Others use version ranges (e.g., postgresql: "12.1.", vector: "0.19.")

Consider adopting a consistent approach across all dependencies. Using version ranges (e.g., "X.Y.*") for all dependencies could help automatically incorporate bug fixes while still maintaining control over major and minor version updates.

helm-chart/templates/cvat_frontend/deployment.yml (1)

48-59: LGTM! Consider differentiating readiness and liveness probes.

The addition of readiness and liveness probes for the frontend container is a great improvement for monitoring the health and availability of the service. The implementation is flexible, allowing for easy configuration through Helm values.

Some suggestions for potential improvements:

  1. Consider differentiating the readiness and liveness probes. While TCP checks are good, you might want to use an HTTP GET request for the readiness probe to ensure the application is fully initialized and ready to serve traffic.
  2. For the liveness probe, you could keep the TCP check or use a simple HTTP endpoint that doesn't put much load on the system.

Example:

readinessProbe:
  httpGet:
    path: /
    port: 80
  # ... other configurations

livenessProbe:
  tcpSocket:
    port: 80
  # ... other configurations

This differentiation can provide more accurate information about the application's state.

helm-chart/templates/cvat_opa/deployment.yml (2)

56-61: Readiness probe configuration looks good.

The readiness probe is well-configured using a tcpSocket check on port 8181, which is appropriate for an OPA server. The conditional inclusion and use of toYaml and omit functions provide flexibility for different deployment scenarios.

Consider adding a comment explaining the purpose of the readiness probe and the significance of port 8181 for better maintainability. For example:

# Readiness probe to ensure OPA server is listening on port 8181
{{- if .Values.cvat.opa.readinessProbe.enabled }}
readinessProbe:
  tcpSocket:
    port: 8181  # Default OPA server port
  {{- toYaml (omit .Values.cvat.opa.readinessProbe "enabled") | nindent 12 }}
{{- end }}

62-67: Liveness probe configuration is correct.

The liveness probe is well-configured, mirroring the readiness probe setup. Using a tcpSocket check on port 8181 is appropriate for monitoring the OPA server's health.

For consistency with the readiness probe suggestion, consider adding a comment explaining the purpose of the liveness probe:

# Liveness probe to ensure OPA server remains responsive
{{- if .Values.cvat.opa.livenessProbe.enabled }}
livenessProbe:
  tcpSocket:
    port: 8181  # Default OPA server port
  {{- toYaml (omit .Values.cvat.opa.livenessProbe "enabled") | nindent 12 }}
{{- end }}
helm-chart/templates/cvat_backend/worker_webhooks/deployment.yml (2)

63-72: Readiness probe implementation looks good.

The readiness probe is well-implemented:

  • It's conditionally included, allowing flexibility in deployment.
  • Uses an exec probe with a custom management command, which is appropriate for checking application-specific health.
  • Allows for additional configuration through Helm values.

Consider adding a comment explaining the purpose of the workerprobe webhooks command for better maintainability.


73-82: Consider differentiating between readiness and liveness probes.

The liveness probe implementation is correct, but it's identical to the readiness probe. While this can be valid in some cases, it's often beneficial to have different checks for readiness and liveness:

  • Readiness typically checks if the application is ready to receive traffic.
  • Liveness checks if the application is functioning correctly.

Using the same check for both might lead to unnecessary restarts if the application is temporarily unable to handle requests but is still running.

If possible, consider implementing a separate check for the liveness probe that focuses on the overall health of the application rather than its readiness to handle requests.

helm-chart/templates/cvat_backend/worker_qualityreports/deployment.yml (2)

63-72: Readiness probe implementation looks good.

The readiness probe is well-implemented, using a custom management command workerprobe to check the worker's readiness for the quality_reports queue. The conditional inclusion and use of the omit function allow for flexible configuration through Helm values.

Consider adding a comment explaining the purpose of the workerprobe command for better clarity:

readinessProbe:
  exec:
    command:
    - python
    - manage.py
    - workerprobe  # Custom command to check worker readiness
    - quality_reports

73-82: Liveness probe implementation is correct.

The liveness probe is correctly implemented, using the same custom management command workerprobe as the readiness probe. This is appropriate for checking the worker's ability to handle tasks in the quality_reports queue.

For consistency with the readiness probe, consider adding a similar comment:

livenessProbe:
  exec:
    command:
    - python
    - manage.py
    - workerprobe  # Custom command to check worker liveness
    - quality_reports
helm-chart/templates/cvat_backend/worker_analyticsreports/deployment.yml (1)

64-83: Excellent addition of readiness and liveness probes!

The introduction of readiness and liveness probes for the CVAT backend worker handling analytics reports is a great improvement for reliability and observability. Here are some observations and suggestions:

  1. The conditional inclusion of probes based on their enabled status in the values file provides flexibility in configuration.
  2. Using the same command (python manage.py workerprobe analytics_reports) for both readiness and liveness probes ensures consistency in health checks.

Suggestions for consideration:

  1. Consider if different checks for readiness and liveness might be more appropriate. Readiness typically checks if the application is ready to handle requests, while liveness checks if it's functioning correctly.
  2. Document the implementation and behavior of the workerprobe command, as it's not visible in this file.
  3. Monitor the impact of these probes on worker performance, as frequent health checks might increase the load.

To further improve this implementation, consider the following:

  1. Document the workerprobe command's behavior and any potential side effects.
  2. Set up monitoring to track the frequency and duration of probe checks to ensure they don't impact worker performance significantly.
  3. Consider implementing different checks for readiness and liveness if the application's startup and runtime health criteria differ.
helm-chart/templates/cvat_backend/utils/deployment.yml (1)

63-73: Readiness probe configuration looks good, consider explicit key definitions.

The addition of the readiness probe is a good practice for ensuring the backend utilities are ready to handle requests. The use of a custom management command (workerprobe) to check specific worker tasks (notifications and cleaning) is appropriate for this deployment.

Consider explicitly defining the keys you expect from .Values.cvat.backend.worker.readinessProbe instead of using toYaml for the entire object. This would provide better clarity and control over the probe configuration. For example:

readinessProbe:
  exec:
    command:
    - python
    - manage.py
    - workerprobe
    - notifications
    - cleaning
  initialDelaySeconds: {{ .Values.cvat.backend.worker.readinessProbe.initialDelaySeconds }}
  periodSeconds: {{ .Values.cvat.backend.worker.readinessProbe.periodSeconds }}
  timeoutSeconds: {{ .Values.cvat.backend.worker.readinessProbe.timeoutSeconds }}
  successThreshold: {{ .Values.cvat.backend.worker.readinessProbe.successThreshold }}
  failureThreshold: {{ .Values.cvat.backend.worker.readinessProbe.failureThreshold }}

This approach would make it clearer which configuration options are available and expected.

helm-chart/templates/cvat_backend/worker_import/deployment.yml (1)

63-72: LGTM! Consider adding a comment for clarity.

The readiness probe configuration looks good. It's conditionally added and uses a custom management command to check the worker's readiness, which is a robust approach.

Consider adding a brief comment explaining the purpose of the workerprobe command for better maintainability:

readinessProbe:
  exec:
    command:
    # Check if the import worker is ready to process tasks
    - python
    - manage.py
    - workerprobe
    - import
helm-chart/templates/cvat_backend/worker_annotation/deployment.yml (1)

63-82: LGTM! Readiness and liveness probes added correctly.

The implementation of readiness and liveness probes for the worker annotation container looks good. A few observations:

  1. Both probes use the same command python manage.py workerprobe annotation, which is appropriate for checking the worker's ability to process annotation tasks.
  2. The probes are conditionally included based on the enabled flags in the values file, providing flexibility in deployment configurations.
  3. The use of the omit function ensures that the enabled flag is not included in the actual probe configuration.

For improved clarity, consider adding a comment explaining the purpose of the workerprobe command and why it's suitable for both readiness and liveness checks. This would help future maintainers understand the rationale behind this implementation.

helm-chart/templates/cvat_backend/worker_export/deployment.yml (1)

64-73: LGTM! Consider adding a comment for clarity.

The addition of the readiness probe is well-implemented. It's conditionally included and uses a custom management command for health checking, which is a good practice for ensuring the export worker is ready to handle requests.

Consider adding a brief comment explaining the purpose of the workerprobe export command for better maintainability. For example:

readinessProbe:
  exec:
    command:
    - python
    - manage.py
    - workerprobe  # Custom management command to check worker health
    - export       # Specific to the export worker
helm-chart/values.yaml (4)

45-53: LGTM: Backend server probe configuration looks good.

The readiness and liveness probe configurations for the backend server are well-structured and reasonable. The longer initial delay (60s) for the liveness probe allows sufficient time for the server to start up, while the failure threshold of 10 provides a buffer against transient issues.

Consider adding a failureThreshold to the readiness probe as well, to provide more resilience against temporary network issues or brief spikes in server load. For example:

readinessProbe:
  enabled: true
  periodSeconds: 15
  initialDelaySeconds: 15
  failureThreshold: 3

55-64: LGTM: Worker probe configuration is appropriate.

The readiness and liveness probe configurations for the worker are well-suited for background processes. The longer period (120s) allows for less frequent checks, which is appropriate for workers that might take longer to complete tasks.

Consider adding a failureThreshold to both probes to provide more resilience against temporary issues. For example:

readinessProbe:
  enabled: true
  periodSeconds: 120
  initialDelaySeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3

livenessProbe:
  enabled: true
  periodSeconds: 120
  initialDelaySeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3

242-249: LGTM: OPA probe configuration looks good with room for minor improvements.

The readiness and liveness probe configurations for the Open Policy Agent (OPA) are well-structured. The frequent checks (every 15 seconds) ensure quick detection of any issues, and the initial delay of 15 seconds should be sufficient for OPA to start up.

Consider the following improvements:

  1. Add a failureThreshold to both probes for better resilience against transient issues.
  2. Include a httpGet section to specify the health check endpoint for OPA.

Here's a suggested configuration:

readinessProbe:
  enabled: true
  httpGet:
    path: /health
    port: 8181
  periodSeconds: 15
  initialDelaySeconds: 15
  failureThreshold: 3

livenessProbe:
  enabled: true
  httpGet:
    path: /health
    port: 8181
  periodSeconds: 15
  initialDelaySeconds: 15
  failureThreshold: 3

These changes will make the probes more specific to OPA and provide better control over failure conditions.


300-307: LGTM: Kvrocks probe configuration is good with room for minor enhancements.

The readiness and liveness probe configurations for kvrocks are well-structured. The frequent checks (every 10 seconds) ensure quick detection of any issues, and the initial delay of 30 seconds should be sufficient for kvrocks to start up.

Consider the following improvements:

  1. Add a failureThreshold to both probes for better resilience against transient issues.
  2. Include a tcpSocket check instead of the default HTTP check, as kvrocks is a Redis-compatible key-value database.

Here's a suggested configuration:

readinessProbe:
  enabled: true
  tcpSocket:
    port: 6666  # Default kvrocks port
  periodSeconds: 10
  initialDelaySeconds: 30
  failureThreshold: 3

livenessProbe:
  enabled: true
  tcpSocket:
    port: 6666  # Default kvrocks port
  periodSeconds: 10
  initialDelaySeconds: 30
  failureThreshold: 3

These changes will make the probes more specific to kvrocks and provide better control over failure conditions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 0b7fc51 and d12ad5b.

📒 Files selected for processing (15)
  • cvat/apps/health/management/commands/workerprobe.py (1 hunks)
  • helm-chart/Chart.yaml (1 hunks)
  • helm-chart/templates/cvat_backend/server/deployment.yml (1 hunks)
  • helm-chart/templates/cvat_backend/utils/deployment.yml (1 hunks)
  • helm-chart/templates/cvat_backend/worker_analyticsreports/deployment.yml (1 hunks)
  • helm-chart/templates/cvat_backend/worker_annotation/deployment.yml (1 hunks)
  • helm-chart/templates/cvat_backend/worker_export/deployment.yml (1 hunks)
  • helm-chart/templates/cvat_backend/worker_import/deployment.yml (1 hunks)
  • helm-chart/templates/cvat_backend/worker_qualityreports/deployment.yml (1 hunks)
  • helm-chart/templates/cvat_backend/worker_webhooks/deployment.yml (1 hunks)
  • helm-chart/templates/cvat_frontend/deployment.yml (1 hunks)
  • helm-chart/templates/cvat_kvrocks/statefulset.yml (1 hunks)
  • helm-chart/templates/cvat_opa/deployment.yml (1 hunks)
  • helm-chart/values.yaml (4 hunks)
  • supervisord/utils.conf (1 hunks)
🔇 Additional comments (16)
supervisord/utils.conf (4)

27-27: LGTM: Improved robustness for rqscheduler

The addition of autorestart=true for the rqscheduler program is a good improvement. This ensures that the scheduler process will automatically restart if it exits unexpectedly, enhancing the overall reliability of the system.


34-35: LGTM: Enhanced scalability and reliability for rqworker-notifications

The changes to the rqworker-notifications configuration are well-considered:

  1. Using numprocs=%(ENV_NUMPROCS)s allows for dynamic scaling of notification workers based on environment variables, improving deployment flexibility.
  2. Adding autorestart=true ensures that notification workers automatically restart if they exit unexpectedly, enhancing system reliability.

These modifications will contribute to a more robust and scalable notification processing system.


Line range hint 44-44: LGTM: Improved reliability for rqworker-cleaning

The addition of autorestart=true for the rqworker-cleaning program is a positive change. This ensures that the cleaning worker process will automatically restart if it exits unexpectedly, enhancing the overall reliability of the system.

It's worth noting that the numprocs setting for this worker was already using the environment variable %(ENV_NUMPROCS)s, which is consistent with the change made to the notifications worker. This maintains a uniform approach to worker process scaling across different worker types.


Line range hint 27-44: Summary: Improved robustness and scalability of worker processes

The changes made to the supervisord configuration file consistently enhance the reliability and scalability of the CVAT worker processes:

  1. All worker types (rqscheduler, rqworker-notifications, and rqworker-cleaning) now have autorestart=true, ensuring automatic recovery from unexpected exits.
  2. The rqworker-notifications configuration now uses a dynamic numprocs setting, aligning with the existing configuration of rqworker-cleaning.

These improvements contribute to a more robust and flexible CVAT application, aligning well with the PR objectives of enhancing reliability and introducing better health management. The consistent approach across different worker types is commendable and promotes maintainability.

helm-chart/Chart.yaml (1)

19-19: LGTM: Version bump is appropriate for the changes.

The increment of the chart version from 0.13.2 to 0.14.0 is consistent with Semantic Versioning for adding new features (readiness and liveness probes). This change aligns well with the PR objectives.

As a reminder, please ensure that you've updated the changelog to reflect this version change and the new features added. You can verify this with the following script:

helm-chart/templates/cvat_opa/deployment.yml (1)

56-67: Summary: Readiness and liveness probes enhance OPA deployment reliability.

The addition of readiness and liveness probes to the OPA deployment is a valuable improvement. These probes will help Kubernetes better manage the lifecycle and health of the OPA container, ensuring that traffic is only routed to ready instances and that unresponsive instances are restarted.

Key points:

  1. Both probes use tcpSocket checks on port 8181, which is appropriate for the OPA server.
  2. The probes are conditionally included, allowing for flexible configuration.
  3. Additional probe settings can be easily added through the values file.

These changes align well with the PR objective of adding readiness and liveness probes to improve the reliability of the CVAT application in containerized environments.

helm-chart/templates/cvat_backend/worker_webhooks/deployment.yml (1)

63-82: Request for additional information and consistency check.

The implementation of the readiness and liveness probes looks good, but I have a few questions and suggestions:

  1. Could you provide more information about the workerprobe command? It would be helpful to have a brief explanation of what it checks and how it determines the health of the webhooks worker.

  2. Have similar probes been added to other deployments in the CVAT application? It would be good to ensure consistency across all components.

  3. The configuration of the probes (like initial delay, timeout, etc.) is not visible in this file. Are these configured elsewhere, or are you using default values?

To check for consistency across the application, you can run the following command:

This will help verify if similar probes have been added to other components of the application.

✅ Verification successful

Probes Implemented Consistently Across Workers with Centralized Configuration

  1. workerprobe Command: The workerprobe command is consistently used across all worker deployments (e.g., analytics_reports, annotation, export, import, quality_reports, webhooks) to perform health checks. This standardized approach ensures reliable monitoring of worker components.

  2. Consistency Across Deployments: Similar readiness and liveness probes are implemented in all relevant deployments within the CVAT application, maintaining consistency and reliability across different services.

  3. Centralized Probe Configuration: The configurations for the probes, such as periodSeconds, initialDelaySeconds, and timeoutSeconds, are centralized in the helm-chart/values.yaml file. This ensures that probe settings are consistent and easily manageable across the entire codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for readiness and liveness probes in other deployment files
fd -e yml -e yaml | xargs grep -n -A 10 'readinessProbe\|livenessProbe'

Length of output: 33719

helm-chart/templates/cvat_backend/worker_qualityreports/deployment.yml (1)

63-82: Verify probe implementation and consider default parameters.

The readiness and liveness probes are well-configured, but there are two points to consider:

  1. Verify that the workerprobe management command performs appropriate checks for both readiness and liveness of the quality_reports worker. Ensure it checks connectivity to required services and the ability to process tasks.

  2. Consider specifying default values for probe parameters such as initialDelaySeconds, periodSeconds, timeoutSeconds, etc. This can help fine-tune the probe behavior and prevent potential issues during container startup or under high load.

Example:

readinessProbe:
  exec:
    command:
    - python
    - manage.py
    - workerprobe
    - quality_reports
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

To verify the workerprobe command implementation, you can run:

This will help ensure that the probe performs appropriate checks for the worker's health.

helm-chart/templates/cvat_kvrocks/statefulset.yml (1)

63-71: Readiness probe implementation looks good!

The readiness probe for the kvrocks container is well-implemented:

  • It uses an appropriate exec probe to check the Redis server's status.
  • The command correctly checks for both "PONG" and "NOAUTH" responses, covering different server states.
  • The probe is conditionally included, allowing for flexible deployment configurations.
  • Additional configuration can be customized through the Values file, promoting reusability.
helm-chart/templates/cvat_backend/server/deployment.yml (2)

68-74: Readiness probe implementation looks good.

The readiness probe is well-implemented:

  1. It's conditionally included, allowing for flexibility in deployment configurations.
  2. The /api/server/about endpoint is a suitable choice for checking readiness.
  3. The use of the omit function ensures clean configuration by excluding the enabled flag.

75-81: Liveness probe implementation is correct.

The liveness probe is correctly implemented:

  1. It's conditionally included, providing deployment flexibility.
  2. The /api/server/health/ endpoint is an appropriate choice for checking the server's health.
  3. The omit function is used to exclude the enabled flag from the probe configuration.
helm-chart/templates/cvat_backend/utils/deployment.yml (1)

Line range hint 1-84: Overall, the addition of health checks improves deployment reliability.

The introduction of readiness and liveness probes to the CVAT backend utilities deployment is a positive change that enhances the reliability and manageability of the application in a Kubernetes environment. These probes allow for better health monitoring and automatic recovery in case of issues.

The rest of the deployment configuration remains unchanged, which maintains consistency with the existing setup. This focused change aligns well with the PR objectives of adding readiness and liveness probes to the CVAT project.

To further improve the deployment, consider the following:

  1. Implement different checks for readiness and liveness probes as suggested earlier.
  2. Ensure that the workerprobe management command is efficient and doesn't cause significant overhead when called frequently.
  3. Review and adjust the probe timing parameters (e.g., initialDelaySeconds, periodSeconds) based on your application's specific behavior and requirements.
  4. Document the purpose and expected behavior of these probes in your project documentation for future reference and maintenance.
helm-chart/templates/cvat_backend/worker_import/deployment.yml (2)

63-82: Summary: Improved worker health monitoring

The addition of readiness and liveness probes enhances the reliability and observability of the worker deployment. These changes allow Kubernetes to better manage the lifecycle of the worker pods, ensuring they are ready to accept work and are functioning correctly.

Key points:

  1. Both probes use a custom workerprobe command, allowing for specific health checks.
  2. Probes are conditionally added, providing flexibility in configuration.
  3. The probes specifically target the 'import' queue, which is appropriate for this worker.

These changes align well with the PR objectives of adding readiness and liveness probes to the CVAT project.


73-82: LGTM! Verify the workerprobe command implementation.

The liveness probe configuration looks good and is consistent with the readiness probe.

Since both readiness and liveness probes use the same workerprobe command, it's important to ensure that this command appropriately distinguishes between readiness and liveness checks. Please verify the implementation of the workerprobe command:

Ensure that the command performs different checks for readiness (e.g., can accept new tasks) and liveness (e.g., is running and not deadlocked).

helm-chart/templates/cvat_backend/worker_export/deployment.yml (1)

64-83: Summary: Good addition of health probes with room for refinement.

The introduction of readiness and liveness probes to the CVAT backend worker export deployment is a valuable improvement. These probes will enhance the reliability and manageability of the application in a Kubernetes environment. The implementation is flexible, allowing for easy configuration through Helm values.

Key points:

  1. Both probes are conditionally included, providing deployment flexibility.
  2. The use of a custom management command (workerprobe) allows for tailored health checks.
  3. The omit function is used correctly to exclude the enabled key from the final configuration.

Suggestions for further improvement:

  1. Differentiate between readiness and liveness probe commands or parameters for more accurate health monitoring.
  2. Add brief comments explaining the purpose of the workerprobe command and its parameters.

Overall, these changes will significantly improve the robustness of the CVAT deployment in Kubernetes.

helm-chart/values.yaml (1)

Line range hint 1-307: Overall, the addition of readiness and liveness probes is a significant improvement.

The introduction of readiness and liveness probes for the backend, worker, frontend, OPA, and kvrocks components is a valuable enhancement to the CVAT Helm chart. These probes will significantly improve the health monitoring and reliability of the deployed services.

Key points:

  1. The backend and worker configurations are well-thought-out, with appropriate timing parameters.
  2. The frontend probe configuration could benefit from more specific parameters.
  3. The OPA and kvrocks configurations are good but could be further improved with component-specific checks.

These changes will enable better orchestration of the CVAT components in a Kubernetes environment, allowing for more robust deployments and easier troubleshooting.

To further enhance this PR, consider implementing the suggested improvements for each component, particularly the frontend probe configuration. These refinements will ensure that each component has tailored health checks that accurately reflect its specific characteristics and requirements.

Comment on lines +72 to +80
{{- if .Values.cvat.kvrocks.livenessProbe.enabled }}
livenessProbe:
exec:
command:
- /bin/sh
- -c
- ./bin/redis-cli -p 6666 PING | grep -E '(PONG|NOAUTH)' || exit 1
{{- toYaml (omit .Values.cvat.kvrocks.livenessProbe "enabled") | nindent 12 }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider differentiating the liveness probe from the readiness probe

While the liveness probe implementation is functional, it's identical to the readiness probe. This approach might not be ideal:

  1. Readiness and liveness probes serve different purposes:

    • Readiness determines if the pod is ready to receive traffic.
    • Liveness determines if the pod needs to be restarted.
  2. Using the same check for both could lead to unnecessary restarts if the pod is temporarily unable to serve requests but is still running.

  3. The current check might be too strict for a liveness probe, as it requires the Redis server to be fully operational.

Consider implementing a simpler liveness probe that checks if the kvrocks process is running, rather than checking the Redis server's response. For example:

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - pgrep kvrocks || exit 1
  {{- toYaml (omit .Values.cvat.kvrocks.livenessProbe "enabled") | nindent 12 }}

This change would make the liveness probe less likely to cause unnecessary restarts while still detecting if the kvrocks process has crashed.

Comment on lines +68 to +81
{{- if $localValues.readinessProbe.enabled }}
readinessProbe:
httpGet:
path: /api/server/about
port: 8080
{{- toYaml (omit $localValues.readinessProbe "enabled") | nindent 12 }}
{{- end }}
{{- if $localValues.livenessProbe.enabled }}
livenessProbe:
httpGet:
path: /api/server/health/
port: 8080
{{- toYaml (omit $localValues.livenessProbe "enabled") | nindent 12 }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Missing timing parameters in probe configurations.

The following readinessProbe and livenessProbe definitions are missing essential timing parameters (initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, and failureThreshold):

  • helm-chart/templates/cvat_kvrocks/statefulset.yml
    • readinessProbe
    • livenessProbe
  • helm-chart/values.yaml
    • readinessProbe
    • livenessProbe
  • helm-chart/templates/cvat_opa/deployment.yml
    • readinessProbe
    • livenessProbe
  • helm-chart/templates/cvat_backend/worker_webhooks/deployment.yml
    • readinessProbe
    • livenessProbe
  • helm-chart/templates/cvat_backend/worker_import/deployment.yml
    • readinessProbe
    • livenessProbe
  • helm-chart/templates/cvat_backend/server/deployment.yml
    • readinessProbe
    • livenessProbe
  • helm-chart/templates/cvat_backend/worker_qualityreports/deployment.yml
    • readinessProbe
    • livenessProbe
  • helm-chart/templates/cvat_backend/worker_annotation/deployment.yml
    • readinessProbe
    • livenessProbe
  • helm-chart/templates/cvat_backend/utils/deployment.yml
    • readinessProbe
    • livenessProbe
  • helm-chart/templates/cvat_backend/worker_analyticsreports/deployment.yml
    • readinessProbe
    • livenessProbe
  • helm-chart/templates/cvat_frontend/deployment.yml
    • readinessProbe
    • livenessProbe
  • helm-chart/templates/cvat_backend/worker_export/deployment.yml
    • readinessProbe
    • livenessProbe

Please update each probe configuration to include the necessary timing parameters to ensure proper functionality and avoid potential deployment issues.

🔗 Analysis chain

Verify probe timing parameters.

The implementation of both readiness and liveness probes looks good. However, it's important to ensure that appropriate timing parameters (such as initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, and failureThreshold) are configured in the $localValues.readinessProbe and $localValues.livenessProbe objects.

These parameters are crucial for fine-tuning the behavior of the probes and avoiding premature restarts or false positives/negatives.

To confirm the presence of these parameters, you can run the following command:

Please review the output to ensure that appropriate timing parameters are set for both probes.


Probe configurations not found in values.yaml.

To locate and verify the timing parameters for both readiness and liveness probes, please run the following command:


Verify the presence of timing parameters in all probe configurations.

To ensure that both readiness and liveness probes have appropriate timing parameters configured, please run the following command:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for probe timing parameters in values.yaml

# Test: Search for readinessProbe and livenessProbe configurations
rg --type yaml 'readinessProbe:|livenessProbe:' -A 10 values.yaml

Length of output: 119


Script:

#!/bin/bash
# Description: Search for readinessProbe and livenessProbe across all YAML files

rg --type yaml '(readinessProbe:|livenessProbe:)' --multiline

Length of output: 2373


Script:

#!/bin/bash
# Description: Check for timing parameters in all readinessProbe and livenessProbe definitions

# Define the timing parameters to check
TIMING_PARAMS=("initialDelaySeconds" "periodSeconds" "timeoutSeconds" "successThreshold" "failureThreshold")

# Search for readinessProbe and livenessProbe in all YAML files
probes=$(rg --type yaml '(readinessProbe:|livenessProbe:)' --multiline)

# Iterate over each probe found
echo "$probes" | while read -r line; do
  file=$(echo "$line" | cut -d':' -f1)
  probe=$(echo "$line" | cut -d':' -f2 | tr -d ' ')
  echo "Checking $probe in $file"
  for param in "${TIMING_PARAMS[@]}"; do
    if ! rg --type yaml "$param:" "$file" >/dev/null; then
      echo "  - Missing $param in $probe"
    fi
  done
done

Length of output: 50850

Comment on lines +74 to +84
{{- if .Values.cvat.backend.worker.livenessProbe.enabled }}
livenessProbe:
exec:
command:
- python
- manage.py
- workerprobe
- notifications
- cleaning
{{- toYaml (omit .Values.cvat.backend.worker.livenessProbe "enabled") | nindent 12 }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider differentiating between readiness and liveness probes.

While the addition of the liveness probe is beneficial, using the same command for both readiness and liveness probes might not be ideal. The liveness probe should check if the application is alive, while the readiness probe checks if it's ready to accept traffic.

Consider implementing a simpler check for the liveness probe that doesn't depend on the worker tasks. For example, you could create a simple endpoint that returns a 200 OK status if the application is running. This would prevent unnecessary restarts if the worker tasks are temporarily unavailable.

Example:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8000
  initialDelaySeconds: {{ .Values.cvat.backend.worker.livenessProbe.initialDelaySeconds }}
  periodSeconds: {{ .Values.cvat.backend.worker.livenessProbe.periodSeconds }}
  timeoutSeconds: {{ .Values.cvat.backend.worker.livenessProbe.timeoutSeconds }}
  successThreshold: {{ .Values.cvat.backend.worker.livenessProbe.successThreshold }}
  failureThreshold: {{ .Values.cvat.backend.worker.livenessProbe.failureThreshold }}

This assumes you've implemented a /healthz endpoint in your application that performs a basic aliveness check.

Comment on lines +74 to +83
{{- if .Values.cvat.backend.worker.livenessProbe.enabled }}
livenessProbe:
exec:
command:
- python
- manage.py
- workerprobe
- export
{{- toYaml (omit .Values.cvat.backend.worker.livenessProbe "enabled") | nindent 12 }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider differentiating between readiness and liveness probes.

The liveness probe is well-implemented and follows the same pattern as the readiness probe. However, using the same command for both readiness and liveness probes might not be ideal.

Consider the following suggestions:

  1. Differentiate the commands used for readiness and liveness probes. For example, the readiness probe could check if the worker is ready to accept new tasks, while the liveness probe could check if the worker is still running and responsive.
  2. If different commands are not feasible, consider adding parameters to the workerprobe command to distinguish between readiness and liveness checks.

Example:

livenessProbe:
  exec:
    command:
    - python
    - manage.py
    - workerprobe
    - export
    - --liveness  # Add a parameter to distinguish liveness check

This differentiation will provide more accurate health monitoring for your Kubernetes deployment.

Comment on lines +194 to +197
readinessProbe:
enabled: true
livenessProbe:
enabled: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Frontend probe configuration needs more specific parameters.

While enabling readiness and liveness probes for the frontend is a good practice, the current configuration lacks specific parameters. This may lead to using default values that might not be optimal for a frontend service.

Consider adding more specific parameters to the frontend probes. Here's a suggested configuration:

readinessProbe:
  enabled: true
  httpGet:
    path: /
    port: 80
  periodSeconds: 10
  initialDelaySeconds: 30
  failureThreshold: 3

livenessProbe:
  enabled: true
  httpGet:
    path: /
    port: 80
  periodSeconds: 30
  initialDelaySeconds: 60
  failureThreshold: 3

These settings provide a more tailored approach for a frontend service, with appropriate paths, ports, and timing parameters.

workers = [w for w in Worker.all(queue.connection) if queue.name in w.queue_names() and w.hostname == hostname]

if len(workers) != int(os.getenv("NUMPROCS", 1)) or \
not all((datetime.now() - w.last_heartbeat).seconds < w.worker_ttl for w in workers):
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use total_seconds() instead of seconds for accurate time difference calculation

The use of .seconds in the time difference calculation only returns the number of seconds modulo 24 hours, which can lead to incorrect results if the time difference exceeds one day. To ensure the liveness check is accurate regardless of the time difference, use .total_seconds() instead.

Apply this diff to fix the issue:

-                 not all((datetime.now() - w.last_heartbeat).seconds < w.worker_ttl for w in workers):
+                 not all((datetime.now() - w.last_heartbeat).total_seconds() < w.worker_ttl for w in workers):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
not all((datetime.now() - w.last_heartbeat).seconds < w.worker_ttl for w in workers):
not all((datetime.now() - w.last_heartbeat).total_seconds() < w.worker_ttl for w in workers):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rabbit is correct here, though IMO a better fix would be datetime.now() - w.last_heartbeat < timedelta(seconds=w.worker_ttl).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SpecLad You're absolutely right! Using datetime.now() - w.last_heartbeat < timedelta(seconds=w.worker_ttl) provides a cleaner and more accurate comparison by directly comparing timedelta objects. Here's the updated code:

not all(datetime.now() - w.last_heartbeat < timedelta(seconds=w.worker_ttl) for w in workers):

Thanks for the suggestion!

(^_^)/

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.33%. Comparing base (0572fa1) to head (1106a39).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8488      +/-   ##
===========================================
+ Coverage    74.31%   74.33%   +0.01%     
===========================================
  Files          397      398       +1     
  Lines        43139    43140       +1     
  Branches      3905     3905              
===========================================
+ Hits         32058    32067       +9     
+ Misses       11081    11073       -8     
Components Coverage Δ
cvat-ui 78.68% <ø> (+0.04%) ⬆️
cvat-server 70.62% <ø> (+<0.01%) ⬆️

@azhavoro azhavoro changed the title Added readiness and liveness probes [WIP] Added readiness and liveness probes Oct 1, 2024
@azhavoro azhavoro changed the title [WIP] Added readiness and liveness probes Added readiness and liveness probes Oct 8, 2024
Copy link

sonarcloud bot commented Oct 8, 2024

allowed_queue_names = list(q.value for q in settings.CVAT_QUEUES)
hostname = platform.node()
for queue_name in options["queue_names"]:
if queue_name not in allowed_queue_names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you could replace this with if queue_name not in settings.RQ_QUEUES, making allowed_queue_names unnecessary.


queue = django_rq.get_queue(queue_name)

workers = [w for w in Worker.all(queue.connection) if queue.name in w.queue_names() and w.hostname == hostname]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just need the queue for the connection, it would make more sense to get the connection directly (django_rq.get_connection).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you also need to add an __init__.py to commands.


if len(workers) != int(os.getenv("NUMPROCS", 1)) or \
not all((datetime.now() - w.last_heartbeat).seconds < w.worker_ttl for w in workers):
raise CommandError(f"Unhealthy workers in the {queue_name} queue")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to print what exactly is wrong here (expected number vs actual number, etc.).

path: /api/server/health/
port: 8080
{{- toYaml (omit $localValues.livenessProbe "enabled") | nindent 12 }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This liveness probe is IMO not suitable. /api/server/health/ returns the aggregate status of the whole system, which includes services that the server has no control over (like the database and Redis). The purpose of the liveness probe is to tell Kubernetes when to restart the pod, but it doesn't make sense to restart the server when, e.g., the database is down.

I think that we should follow the documentation here:

A common pattern for liveness probes is to use the same low-cost HTTP endpoint as for readiness probes, but with a higher failureThreshold. This ensures that the pod is observed as not-ready for some period of time before it is hard killed.

- workerprobe
- analytics_reports
{{- toYaml (omit .Values.cvat.backend.worker.livenessProbe "enabled") | nindent 12 }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is duplicated everywhere. Can you factor it out into a template?

command:
- /bin/sh
- -c
- ./bin/redis-cli -p 6666 PING | grep -E '(PONG|NOAUTH)' || exit 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think || exit 1 does anything useful here. It just overwrites the previous error code.

@@ -53,6 +53,18 @@ spec:
env:
{{- toYaml . | nindent 10 }}
{{- end }}
{{- if .Values.cvat.opa.readinessProbe.enabled }}
readinessProbe:
tcpSocket:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPA has an actual health endpoint, so perhaps you could use that?

readinessProbe:
enabled: true
livenessProbe:
enabled: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you have other options here, like in the other probe settings?

"sh",
"-c",
"for p in rq:finished:* rq:job:* rq:wip:* rq:finished:* rq:failed:*; "
'do redis-cli -e --scan --pattern "$p" | xargs -r redis-cli -e del ; done',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what you did here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants