-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Celery queue to debug CPU/memory issues #2693
Comments
Before starting working on this card I took a deeper look into the OpenShift metrics and I realized that, at the moment, the short running workers aren't restarting due to memory usage. The short running workers memory usage (never reaching the threshold) & restartsI think the liveness probe check is the reason why they are restarting even though there is no evidence in the logs. However, you can see in the logs that when there is a missed heartbeat detected from a long running worker for a short running one then, usually, the short running worker is restarted. short running 0, missed heartbeat & restartsshort running 1, missed heartbeat & restartsshort running 2, missed heartbeat & restartsI can't see a strong correlation between cpu or bandwidth usage and short running workers restarts. short running workers memory usage & restartscpu usage short running workerscpu usage all pods stackedcpu usage all podstx rate short running workersrx rate short running workersI can't even see a correlation between the executed tasks in the worker and its restarts. |
When a worker container is restarted due to liveness probe there should be |
Yes, I don't know either. In openshift, for all the pods, I could see
It seems that code error OOMKilled+137 could also be related with restarts due to a failed liveness probe check. |
Did you zoom all the way in? The highest peaks don't have to be visible otherwise. I also think the graphs may not be 100% accurate. |
No, I didn't zoom in 🙈 and this can actually explain the missing heartbeat. But I would say that these graphs didn't show a memory leak (yet), more a spike? In the first graph I shared, there are smaller spikes and the memory usage after them goes down again and it seems to stick with 500M. So probably I would just adjust the memory limit to see if they really are just spikes? And if they aren't I will later create a new worker. WDYT? |
In theory it could even be a combination of both.
Well, we request 320M for the short running workers, so if 500M is the "norm" we should increase that.
I suppose it's worth the try. I was thinking whether maybe the |
Definitely worth the try, but I would say at the same time, there is still some kind of issue with the memory/concurrency as Matej described in the original issue, so to pinpoint it, I would also try creating a new queue. |
Linked above the PR for adjusting memory in short running workers. Regarding the new queue, I don't say we don't have a memory leak problem. I believe it is not yet evident in the short running's memory graph. The memory is quite steady around 500MiB (with some spikes) and I think this is because the workers are restarted too often. I believe that, now, it appears more clear (the leakage) in the long running's memory graph. So, I am not sure if it is related to a single task code. I think we can try having some memory profile logging. On the other hand since we almost always restart the workers every week I would say it does not seem a big problem, at least for the long running workers, but from what I see in the short running workers graph I don't expect their leakage to be much worse. |
The tasks succeded while memory was increasing is not different than usual: short-running-0 tasks succeeded
|
short-running-1 tasks succeeded
|
We still have restarts caused by the OOKiller. Here the logs before the OOKiller restart for the short-running-0.
|
Here the logs before the OOKiller restart for the short-running-1.
|
For the first time I also could spot a worker restarting not due to the OOKiller but due to a celery error. When the pod 2 restarted it had used just half of its memory limit. It took exactly 30 min to the worker to become responsive again (thus I think this log is related with #2697)
|
Merged #2708, #2709 and packit/deployment#642. |
Related to #2522
Context
Based on the discussion on the arch meeting with regards to the CPU and memory issues. The issue is tied to the production deployment (higher load) and short-running workers (tasks are run concurrently).
As for the memory issues the best “guess” is failing clean up, therefore let's set up a new queue that will run in the same way as the short-running one, but just with a subset of tasks to try to pinpoint specific handlers that are causing issues.
TODO
Create a new queue
Pick a task (e.g.,
process_message
since there's a high amount of those) or subset of tasks (e.g., less frequent tasks that could be filtered further on) that will run in that queue(optionally) Unify the way tasks are split between the queues; currently we declare the queue both in the decorator:
packit-service/packit_service/worker/tasks.py
Line 413 in 2c46677
and also in the global Celery config:
packit-service/packit_service/celery_config.py
Lines 18 to 23 in 2c46677
(optionally) Improve the docs on what tasks are supposed to be run where; currently by default everything gets run in the short-running unless specified otherwise, also there are some tasks that stand out, e.g., VM Image build being triggered from the short-running
Based on the time spent on previous points, either stalk the OpenShift/Celery metrics, or create a follow-up card
The text was updated successfully, but these errors were encountered: