-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added readiness and liveness probes #8488
base: develop
Are you sure you want to change the base?
Changes from all commits
1621d16
d12ad5b
b240baf
f77ee89
92c36ae
1106a39
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,29 @@ | ||||||
import os | ||||||
import platform | ||||||
from datetime import datetime | ||||||
from django.core.management.base import BaseCommand, CommandError | ||||||
from django.conf import settings | ||||||
from rq.worker import Worker | ||||||
import django_rq | ||||||
|
||||||
|
||||||
class Command(BaseCommand): | ||||||
help = "Check worker liveness in specified queues" | ||||||
|
||||||
def add_arguments(self, parser): | ||||||
parser.add_argument("queue_names", nargs="+", type=str) | ||||||
|
||||||
def handle(self, *args, **options): | ||||||
allowed_queue_names = list(q.value for q in settings.CVAT_QUEUES) | ||||||
hostname = platform.node() | ||||||
for queue_name in options["queue_names"]: | ||||||
if queue_name not in allowed_queue_names: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like you could replace this with |
||||||
raise CommandError(f"Queue {queue_name} is not defined") | ||||||
|
||||||
queue = django_rq.get_queue(queue_name) | ||||||
|
||||||
workers = [w for w in Worker.all(queue.connection) if queue.name in w.queue_names() and w.hostname == hostname] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you just need the queue for the connection, it would make more sense to get the connection directly ( |
||||||
|
||||||
if len(workers) != int(os.getenv("NUMPROCS", 1)) or \ | ||||||
not all((datetime.now() - w.last_heartbeat).seconds < w.worker_ttl for w in workers): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use The use of Apply this diff to fix the issue: - not all((datetime.now() - w.last_heartbeat).seconds < w.worker_ttl for w in workers):
+ not all((datetime.now() - w.last_heartbeat).total_seconds() < w.worker_ttl for w in workers): 📝 Committable suggestion
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The rabbit is correct here, though IMO a better fix would be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SpecLad You're absolutely right! Using not all(datetime.now() - w.last_heartbeat < timedelta(seconds=w.worker_ttl) for w in workers): Thanks for the suggestion! (^_^)/ |
||||||
raise CommandError(f"Unhealthy workers in the {queue_name} queue") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be useful to print what exactly is wrong here (expected number vs actual number, etc.). |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -65,6 +65,20 @@ spec: | |
{{- end }} | ||
ports: | ||
- containerPort: 8080 | ||
{{- if $localValues.readinessProbe.enabled }} | ||
readinessProbe: | ||
httpGet: | ||
path: /api/server/about | ||
port: 8080 | ||
{{- toYaml (omit $localValues.readinessProbe "enabled") | nindent 12 }} | ||
{{- end }} | ||
{{- if $localValues.livenessProbe.enabled }} | ||
livenessProbe: | ||
httpGet: | ||
path: /api/server/health/ | ||
port: 8080 | ||
{{- toYaml (omit $localValues.livenessProbe "enabled") | nindent 12 }} | ||
{{- end }} | ||
Comment on lines
+68
to
+81
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Codebase verification Missing timing parameters in probe configurations. The following
Please update each probe configuration to include the necessary timing parameters to ensure proper functionality and avoid potential deployment issues. 🔗 Analysis chainVerify probe timing parameters. The implementation of both readiness and liveness probes looks good. However, it's important to ensure that appropriate timing parameters (such as These parameters are crucial for fine-tuning the behavior of the probes and avoiding premature restarts or false positives/negatives. To confirm the presence of these parameters, you can run the following command: Please review the output to ensure that appropriate timing parameters are set for both probes. Probe configurations not found in To locate and verify the timing parameters for both readiness and liveness probes, please run the following command: Verify the presence of timing parameters in all probe configurations. To ensure that both readiness and liveness probes have appropriate timing parameters configured, please run the following command: 🏁 Scripts executedThe following scripts were executed for the analysis: Script: #!/bin/bash
# Description: Check for probe timing parameters in values.yaml
# Test: Search for readinessProbe and livenessProbe configurations
rg --type yaml 'readinessProbe:|livenessProbe:' -A 10 values.yaml
Length of output: 119 Script: #!/bin/bash
# Description: Search for readinessProbe and livenessProbe across all YAML files
rg --type yaml '(readinessProbe:|livenessProbe:)' --multiline
Length of output: 2373 Script: #!/bin/bash
# Description: Check for timing parameters in all readinessProbe and livenessProbe definitions
# Define the timing parameters to check
TIMING_PARAMS=("initialDelaySeconds" "periodSeconds" "timeoutSeconds" "successThreshold" "failureThreshold")
# Search for readinessProbe and livenessProbe in all YAML files
probes=$(rg --type yaml '(readinessProbe:|livenessProbe:)' --multiline)
# Iterate over each probe found
echo "$probes" | while read -r line; do
file=$(echo "$line" | cut -d':' -f1)
probe=$(echo "$line" | cut -d':' -f2 | tr -d ' ')
echo "Checking $probe in $file"
for param in "${TIMING_PARAMS[@]}"; do
if ! rg --type yaml "$param:" "$file" >/dev/null; then
echo " - Missing $param in $probe"
fi
done
done
Length of output: 50850 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This liveness probe is IMO not suitable. I think that we should follow the documentation here:
|
||
volumeMounts: | ||
{{- if not .Values.cvat.backend.disableDistinctCachePerService }} | ||
- mountPath: /home/django/data/cache | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -60,8 +60,17 @@ spec: | |
{{- with concat .Values.cvat.backend.additionalEnv $localValues.additionalEnv }} | ||
{{- toYaml . | nindent 10 }} | ||
{{- end }} | ||
ports: | ||
- containerPort: 8080 | ||
{{- if .Values.cvat.backend.worker.livenessProbe.enabled }} | ||
livenessProbe: | ||
exec: | ||
command: | ||
- python | ||
- manage.py | ||
- workerprobe | ||
- notifications | ||
- cleaning | ||
{{- toYaml (omit .Values.cvat.backend.worker.livenessProbe "enabled") | nindent 12 }} | ||
{{- end }} | ||
Comment on lines
+63
to
+73
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Consider differentiating between readiness and liveness probes. While the addition of the liveness probe is beneficial, using the same command for both readiness and liveness probes might not be ideal. The liveness probe should check if the application is alive, while the readiness probe checks if it's ready to accept traffic. Consider implementing a simpler check for the liveness probe that doesn't depend on the worker tasks. For example, you could create a simple endpoint that returns a 200 OK status if the application is running. This would prevent unnecessary restarts if the worker tasks are temporarily unavailable. Example: livenessProbe:
httpGet:
path: /healthz
port: 8000
initialDelaySeconds: {{ .Values.cvat.backend.worker.livenessProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.cvat.backend.worker.livenessProbe.periodSeconds }}
timeoutSeconds: {{ .Values.cvat.backend.worker.livenessProbe.timeoutSeconds }}
successThreshold: {{ .Values.cvat.backend.worker.livenessProbe.successThreshold }}
failureThreshold: {{ .Values.cvat.backend.worker.livenessProbe.failureThreshold }} This assumes you've implemented a |
||
volumeMounts: | ||
{{- if not .Values.cvat.backend.disableDistinctCachePerService }} | ||
- mountPath: /home/django/data/cache | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,6 +61,16 @@ spec: | |
{{- with concat .Values.cvat.backend.additionalEnv $localValues.additionalEnv }} | ||
{{- toYaml . | nindent 10 }} | ||
{{- end }} | ||
{{- if .Values.cvat.backend.worker.livenessProbe.enabled }} | ||
livenessProbe: | ||
exec: | ||
command: | ||
- python | ||
- manage.py | ||
- workerprobe | ||
- analytics_reports | ||
{{- toYaml (omit .Values.cvat.backend.worker.livenessProbe "enabled") | nindent 12 }} | ||
{{- end }} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section is duplicated everywhere. Can you factor it out into a template? |
||
{{- with concat .Values.cvat.backend.additionalVolumeMounts $localValues.additionalVolumeMounts }} | ||
volumeMounts: | ||
{{- toYaml . | nindent 10 }} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,6 +61,16 @@ spec: | |
{{- with concat .Values.cvat.backend.additionalEnv $localValues.additionalEnv }} | ||
{{- toYaml . | nindent 10 }} | ||
{{- end }} | ||
{{- if .Values.cvat.backend.worker.livenessProbe.enabled }} | ||
livenessProbe: | ||
exec: | ||
command: | ||
- python | ||
- manage.py | ||
- workerprobe | ||
- export | ||
{{- toYaml (omit .Values.cvat.backend.worker.livenessProbe "enabled") | nindent 12 }} | ||
{{- end }} | ||
Comment on lines
+64
to
+73
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Consider differentiating between readiness and liveness probes. The liveness probe is well-implemented and follows the same pattern as the readiness probe. However, using the same command for both readiness and liveness probes might not be ideal. Consider the following suggestions:
Example: livenessProbe:
exec:
command:
- python
- manage.py
- workerprobe
- export
- --liveness # Add a parameter to distinguish liveness check This differentiation will provide more accurate health monitoring for your Kubernetes deployment. |
||
volumeMounts: | ||
{{- if not .Values.cvat.backend.disableDistinctCachePerService }} | ||
- mountPath: /home/django/data/cache | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,6 +59,25 @@ spec: | |
{{- with .Values.cvat.kvrocks.additionalEnv }} | ||
{{- toYaml . | nindent 10 }} | ||
{{- end }} | ||
#https://github.com/apache/kvrocks/blob/unstable/Dockerfile | ||
{{- if .Values.cvat.kvrocks.readinessProbe.enabled }} | ||
readinessProbe: | ||
exec: | ||
command: | ||
- /bin/sh | ||
- -c | ||
- ./bin/redis-cli -p 6666 PING | grep -E '(PONG|NOAUTH)' || exit 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think |
||
{{- toYaml (omit .Values.cvat.kvrocks.readinessProbe "enabled") | nindent 12 }} | ||
{{- end }} | ||
{{- if .Values.cvat.kvrocks.livenessProbe.enabled }} | ||
livenessProbe: | ||
exec: | ||
command: | ||
- /bin/sh | ||
- -c | ||
- ./bin/redis-cli -p 6666 PING | grep -E '(PONG|NOAUTH)' || exit 1 | ||
{{- toYaml (omit .Values.cvat.kvrocks.livenessProbe "enabled") | nindent 12 }} | ||
{{- end }} | ||
Comment on lines
+72
to
+80
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Consider differentiating the liveness probe from the readiness probe While the liveness probe implementation is functional, it's identical to the readiness probe. This approach might not be ideal:
Consider implementing a simpler liveness probe that checks if the kvrocks process is running, rather than checking the Redis server's response. For example: livenessProbe:
exec:
command:
- /bin/sh
- -c
- pgrep kvrocks || exit 1
{{- toYaml (omit .Values.cvat.kvrocks.livenessProbe "enabled") | nindent 12 }} This change would make the liveness probe less likely to cause unnecessary restarts while still detecting if the kvrocks process has crashed. |
||
volumeMounts: | ||
- name: {{ .Release.Name }}-kvrocks-data | ||
mountPath: /var/lib/kvrocks/data | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,6 +53,18 @@ spec: | |
env: | ||
{{- toYaml . | nindent 10 }} | ||
{{- end }} | ||
{{- if .Values.cvat.opa.readinessProbe.enabled }} | ||
readinessProbe: | ||
tcpSocket: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OPA has an actual health endpoint, so perhaps you could use that? |
||
port: 8181 | ||
{{- toYaml (omit .Values.cvat.opa.readinessProbe "enabled") | nindent 12 }} | ||
{{- end }} | ||
{{- if .Values.cvat.opa.livenessProbe.enabled }} | ||
livenessProbe: | ||
tcpSocket: | ||
port: 8181 | ||
{{- toYaml (omit .Values.cvat.opa.livenessProbe "enabled") | nindent 12 }} | ||
{{- end }} | ||
{{- with .Values.cvat.opa.additionalVolumeMounts }} | ||
volumeMounts: | ||
{{- toYaml . | nindent 10 }} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -42,7 +42,21 @@ cvat: | |
additionalEnv: [] | ||
additionalVolumes: [] | ||
additionalVolumeMounts: [] | ||
readinessProbe: | ||
enabled: true | ||
periodSeconds: 15 | ||
initialDelaySeconds: 15 | ||
livenessProbe: | ||
enabled: true | ||
periodSeconds: 15 | ||
failureThreshold: 10 | ||
initialDelaySeconds: 60 | ||
worker: | ||
livenessProbe: | ||
enabled: true | ||
periodSeconds: 120 | ||
initialDelaySeconds: 30 | ||
timeoutSeconds: 10 | ||
export: | ||
replicas: 2 | ||
labels: {} | ||
|
@@ -172,6 +186,10 @@ cvat: | |
# - mountPath: /tmp | ||
# name: tmp | ||
# subPath: test | ||
readinessProbe: | ||
enabled: true | ||
livenessProbe: | ||
enabled: true | ||
Comment on lines
+189
to
+192
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Frontend probe configuration needs more specific parameters. While enabling readiness and liveness probes for the frontend is a good practice, the current configuration lacks specific parameters. This may lead to using default values that might not be optimal for a frontend service. Consider adding more specific parameters to the frontend probes. Here's a suggested configuration: readinessProbe:
enabled: true
httpGet:
path: /
port: 80
periodSeconds: 10
initialDelaySeconds: 30
failureThreshold: 3
livenessProbe:
enabled: true
httpGet:
path: /
port: 80
periodSeconds: 30
initialDelaySeconds: 60
failureThreshold: 3 These settings provide a more tailored approach for a frontend service, with appropriate paths, ports, and timing parameters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why don't you have other options here, like in the other probe settings? |
||
service: | ||
type: ClusterIP | ||
ports: | ||
|
@@ -216,6 +234,14 @@ cvat: | |
# name: tmp | ||
# subPath: test | ||
composeCompatibleServiceName: true # Sets service name to opa in order to be compatible with Docker Compose. Necessary because changing IAM_OPA_DATA_URL via environment variables in current images. Hinders multiple deployment due to duplicate name | ||
readinessProbe: | ||
enabled: true | ||
periodSeconds: 15 | ||
initialDelaySeconds: 15 | ||
livenessProbe: | ||
enabled: true | ||
periodSeconds: 15 | ||
initialDelaySeconds: 15 | ||
service: | ||
type: ClusterIP | ||
ports: | ||
|
@@ -266,6 +292,14 @@ cvat: | |
# - mountPath: /tmp | ||
# name: tmp | ||
# subPath: test | ||
readinessProbe: | ||
enabled: true | ||
periodSeconds: 10 | ||
initialDelaySeconds: 30 | ||
livenessProbe: | ||
enabled: true | ||
periodSeconds: 10 | ||
initialDelaySeconds: 30 | ||
defaultStorage: | ||
enabled: true | ||
# storageClassName: default | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -240,11 +240,26 @@ def kube_restore_clickhouse_db(): | |
|
||
|
||
def docker_restore_redis_inmem(): | ||
docker_exec_redis_inmem(["redis-cli", "-e", "flushall"]) | ||
docker_exec_redis_inmem( | ||
[ | ||
"sh", | ||
"-c", | ||
"for p in rq:finished:* rq:job:* rq:wip:* rq:finished:* rq:failed:*; " | ||
'do redis-cli -e --scan --pattern "$p" | xargs -r redis-cli -e del ; done', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you explain what you did here? |
||
] | ||
) | ||
|
||
|
||
def kube_restore_redis_inmem(): | ||
kube_exec_redis_inmem(["sh", "-c", 'redis-cli -e -a "${REDIS_PASSWORD}" flushall']) | ||
kube_exec_redis_inmem( | ||
[ | ||
"sh", | ||
"-c", | ||
"for p in rq:finished:* rq:job:* rq:wip:* rq:finished:* rq:failed:*; " | ||
'do redis-cli -e -a "${REDIS_PASSWORD}" --scan --pattern "$p" | ' | ||
'xargs -r redis-cli -e -a "${REDIS_PASSWORD}" del ; done', | ||
] | ||
) | ||
|
||
|
||
def docker_restore_redis_ondisk(): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you also need to add an
__init__.py
tocommands
.