-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] picamera2 locks up after hours, receiving no more buffers #1090
Comments
Can support this is happening to me as well. |
I can also confirm that I'm encountering the same issue.
So far, I have not been able to drive the bug with a minimal version that just features the video capture. But the bug is intermittent, so maybe I just haven't run the test code long enough for it to crop up. I will kept testing. @djhanove, are you able to share how you handle it using the watchdog thread? |
@dmunnet something like this, I'll let you figure out the rest
|
@djhanove, thanks for this! |
I can also confirm I've seen what appears to be the same problem. While developing updated Python code using PiCamera2 to support the Pi Camera Module 3 as part of the "My Naturewatch Camera Server" I ran into this issue. I'm using a Pi Zero 2W and ended up testing on two separate Pi Zero 2Ws, both of which are running identical code and both have the problem. Interestingly I tested the same code using an older Pi Camera module and couldn't reproduce the problem. I tried to find the simplest code that would reproduce the problem, but given that the failure can occur anywhere between 1-7 days on average it's not been easy to say for sure whether the problem is present in test code or not. The app I've been developing also uses the H.264 encoder using a circular buffer to write out to a file once motion has been detected. Usually when the problem occurs I see " I shall keep an eye on this thread with interest as it would be good to resolve this issue properly. |
@caracoluk I have around 55 3b+'s with V1 camera modules where this doesn't happen (obviously older kernel and firmware) |
I did try lowering the frame rate to 15fps (from 25fps), thinking that the issue might be related to the increased resources required handling the larger frames from the Camera module 3. That didn't seem to make any difference. I enabled debugging using the "vclog -m" and "vclog -a" commands and nothing at all was shown in these logs when the problem occurs. The only logging I've found at all is in dmesg and from some of the other threads it seems like these are just symptoms of the problem and offer little insight into the underlying cause. I realise it's not going to be easy for the Pi developers to resolve without them having a means of easily/quickly reproducing the problem. |
@caracoluk I ran with 4 threads on max stress test for days at a time to try and induce it and also did not see increased frequency. agreed, not an easy one to debug |
I can confirm that. Tried also max. cpu stress without difference. For memory, I have 2/3 of cma and half of system memory free while the application is running. |
Looking back at my alert logs, I did have some pi 4b's with V1 camera modules that also exhibited this so it is not exclusive to the v3 module. |
@djhanove that's useful to know. In that case this problem is likely to be affecting quite a few people. Did you notice whether the pi's with the V1 camera modules had been running for a lot longer than those with the V3 modules before they failed? |
Didn't seem to make a difference for me. I just swapped some back to the V1 camera yesterday and had 2 lockups in 24 hours on one device. |
What I have noticed is that I get the lockups more frequently if the Pi's are running at a higher temperature. Over the warmer weeks (when the CPU temperature was showing temperatures of around 72-80C) I'd see the lockups on average once every 2 days. Now it's a bit cooler, the CPU temperature is showing 65-70C, and haven't had a lockup on either of my Pi's for the last 5 days so far. That might explain why when I was trying to create the shortest segment of code to reproduce the problem I was failing to do so. Not because the problem was no longer present, but because it would take longer to occur. I see that some of you have run CPU stress tests to try and cause the problem, and that should have pushed the temperature up a fair bit, so it's difficult to say if this is just a coincidence or not. |
@caracoluk all of my devices have heat sinks on them which keeps the temps in the 60-65C range even under full load. |
I am experiencing the same issue on a CM4 running up to date bookworm. It never showed these errors while running the same picamera2 apps on bullseye. I have a raspberry pi camera module 3 on one port and an ov5647 on the other. The same fault occurs when running with only the camera module 3, although less frequently. Headless CM4 running 6.6.31-1+rpt1 (2024-05-29) aarch64. Fresh install then sudo apt update, sudo apt upgrade. Official Raspberry pi 5 power supply 25W. vcgencmd get_throttled always gives 0x0. Both camera cables short - about 15cm. Possible source of radio interference from WiFi antenna connected to the CM4 running hostapd for local WiFi hotspot. I'm working on a minimal program that exhibits the fault. I am also trying to see if the WiFI hotspot makes any difference (so far no obvious difference). I will also try more cooling on the CM4. The program is based on the very wonderful picamera2 examples. Combination of mjpeg_server, capture_circular, capture_video_multiple together with opencv object detection. Most of the time each instance uses less than 30% of 1 cpu, so top rarely shows less than 80% idle. In normal use no swap is used. 1400MBytes available on 2GByte CM4. Normally there are no clients for the http mjpeg servers - and no correlation with the occurrence of the fault. Every few days one or other of the picamera2 processes hangs - no frames received. I use systemd notify with a 10sec watchdog to kill and relaunch. Most of the time after a hang, systemd managed to relaunch successfully. Sometimes it throws kernel errors as reported above.
Once or twice both processes have stalled at the same time (both receiving no frames) and that seems to lead to a disaster. Systemd can't kill either process, so stale zombies pile up and eventually the 2GByte swapfile is all used up. The only way out is to reboot. I'll keep you posted. |
I have same issue.
I am using two cameras for MJPEG live streaming. The problem has not occurred with Bullseye and has been a problem since the migration to Bookworm. The lockups seem to happen in half a day to a day. I restart the script every hour as a workaround. I have obtained the kernel error logs and have uploaded them to the following gist URL. |
Actually I am finding it hard to narrow this down. I can rule out the possible interference of the WiFi hotspot - the errors turn up the same with it switched off. The frame rate makes a big difference. With both processes running at 20 frames per second, the errors come every few hours on both cameras, and the disaster (both processes hanging and failing to be killed, so runs out of memory) happens after a few days. With one process running at 10 frames per second and the other at 20, the errors are less frequent and with both processes at 10 frames per second the errors are much less frequent (and so far no disasters). Still working on it... |
@lowflyerUK I find that my app runs between 2-10 days without failing with the frame rate set to 20fps and a resolution of 1920x1080 using the camera module 3 on a Pi Zero 2W. I have two duplicate sets of hardware and both seem to exhibit the problem to the same degree. I did notice that on hot days the lock ups occur more frequently as I'm only using a heatsink with no active cooling. The reported CPU temperature has gone above 80C on hot days which seems to reduce the mean time between failures for me. |
@caracoluk Thanks for your comment. The cpu in this CM4 does run rather warm - between 70-80deg, although it runs normally with 90% idle. Last night I added a permanent stress process that uses 100% of one core. This pushed the temperature to around 83deg so the cpu was occasionally throttled, but no errors for 4 hours - so no dramatic difference (but only for 4 hours). I'll see if I can think of a way to improve the cooling. Update... I tried with 2 cores stressed. It failed within 2 hours on both camera processes, leading to my previously mentioned disaster, where defunct processes fill up the memory. So temperature could be a contributory factor for me, and/or possibly my picamera2 process fails if the cpu is busy. |
@lowflyerUK if we could find a way to reliably reproduce the problem within as short a time frame as possible it would make it easier for the Pi developers to investigate. Ideally we'd need the sample code to be as simple as possible as I see from other threads that is usually their first request when looking into a problem. |
@caracoluk Thanks for your encouragement! Yes, that is exactly what I am trying to do. |
Hi, It looks like we are dealing with similar issue with our custom app (https://forums.raspberrypi.com/viewtopic.php?t=359992 ), and we are still debugging. |
@naroin Many thanks for pointing out that thread. My errors do indeed seem remarkably similar. Typically I get:
after which systemd is able to relaunch the app. Fully up to date 64 bit Raspberry Pi OS on CM4 with 2GBytes RAM and 2GBytes swap. 2 separate picamera2 processes: Most of the time top shows around 80% idle. If my sums are right, that adds up to around 27MPix/sec for all 4 encodes. So less than half of 1080p30 and about an eighth of the 220MPix/sec that @6by9 told us was the rated maximium for the ISP. Maybe the CM4 can't reliably encode 4 outputs at once, even at 10 frames per sec? Should I try a Raspberry Pi 5? |
I haven't found simple sample code that replicates my issue. So I am inclined to feel that the ISP rate limit is the cause. In my case I think I can make a workaround by only issuing frames to the MJEPencoder whan a client is actually watching the stream. As I am the only client and I hardly ever watch the realtime stream, the probability of failure will be a lot less. This obviously won't be a solution for everybody. |
Interestingly my two Pi Zero 2Ws haven't had a lock up in the last 3 weeks after they would previously run for 2-5 days before it happened. There have been up software updates on either of them and I've not changed the code in any way, just left them running in the same positions. The only thing I'm aware of that has changed is the temperature as it has been quite a bit cooler recently. I remember reading somewhere that the Pi starts to throttle the CPU if the temperature increases above 80C and I was seeing temperatures reach this level. Perhaps the CPU throttling makes it more likely for this issue to occur? |
I don't think this issue is isolated to picamera2. I'm having the exact same issue with rpicam-vid that I've detailed here. https://forums.raspberrypi.com/viewtopic.php?t=376766 |
I'd like to try and reproduce this. Does anyone have reasonably simple code that I can run on the latest Raspberry Pi OS on a Pi 4 (and preferably avoiding mediamtx because I know nothing about it) that is likely to provoke the problem? Thanks. |
I think no one here has been able to reliably provoke the problem. The issue is still persistent for me across a massive fleet of pi4s with no ability to predictibly reproduce. I have to believe that others are facing the same challenges based on the dialog here. I have a thread running picam to just capture frames, but my program is also running several other tasks in parallel |
Yes, we can share one example but we have modified the kernel logs to try to investigate the issue...
In this example the duplicated response is : vc_sm_cma_vchi_rx_ack: received response 102259, throw away... |
I think I've managed to reproduce this now, and I see the same
|
I'm slightly surprised that you're mapping and unmapping as many buffers as that. The dmabuf is meant to be held on to until the VPU has responded saying that it has unmapped it. Actually the size of the allocation is only 32kB. That's not an image buffer, so what is it? Lens shading table? |
The lstable is indeed 32k. By mapping/unmapping do you mean calling mmap/munmap from userland? That only happens once per configure() cycle in the IPA. Perhaps the 32k LS table size is a coincidence and it's actually another buffer? If it was the LS table, I expect we can reproduce this without running the encoder. |
However, this happens on every frame in the kernel driver ctrl handler:
Perhaps this is not right? |
An experiment might be to change |
Perhaps it's better to do something like this in the isp driver: naushir/linux@de2c0b3? Completely untested! |
we do not know either how we've done that, but that unexpected message is having a trans_id index equaling the expected (lost message) trans_id index minus 128. And 128 matches the width of vchiq_slot_info[] circular buffer. @@ -721,6 +731,7 @@ vc_sm_cma_import_dmabuf_internal(struct vc_sm_privdata_t *private,
struct sg_table *sgt = NULL;
dma_addr_t dma_addr;
u32 cache_alias;
+ u32 trans_id;
int ret = 0;
int status;
@@ -783,21 +794,23 @@ vc_sm_cma_import_dmabuf_internal(struct vc_sm_privdata_t *private,
__func__, import.name, import.type, &dma_addr, import.size);
/* Allocate the videocore buffer. */
- status = vc_sm_cma_vchi_import(sm_state->sm_handle, &import, &result,
- &sm_state->int_trans_id);
+ status = vc_sm_cma_vchi_import(sm_state->sm_handle, &import, &result, &trans_id);
if (status == -EINTR) {
pr_debug("[%s]: requesting import memory action restart (trans_id: %u)\n",
- __func__, sm_state->int_trans_id);
+ __func__, trans_id);
ret = -ERESTARTSYS;
private->restart_sys = -EINTR;
private->int_action = VC_SM_MSG_TYPE_IMPORT;
+ private->int_trans_id = trans_id;
goto error;
} else if (status || !result.res_handle) {
pr_debug("[%s]: failed to import memory on videocore (status: %u, trans_id: %u)\n",
- __func__, status, sm_state->int_trans_id);
+ __func__, status, trans_id);
ret = -ENOMEM;
goto error;
}
+ pr_debug("[%s]: requesting import memory (trans_id: %u)\n",
+ __func__, trans_id);
mutex_init(&buffer->lock);
INIT_LIST_HEAD(&buffer->attachments);
|
Is libcamera resubmitting the same dmabuf every frame? Your patch is close, but we actually want to store/compare the I did wonder if we were running out of buffering in VCHI. With the ISP and encoding going on it is shovelling a fair number of commands around the place. |
Yes, libcamera uses a single dmabuf handle for the ls table. I'll rework the change on Monday and give it some testing. |
@davidplowman Sorry I missed this. Mine throws this, then one of the threads never gets any more frames.
I think this script is even quicker at causing it. Yes,sorry about my filenames. |
I've got an updated fix for the LS dmabuf caching. I seem to have improved behavior on my system so far, but would be nice to have it tested by other folks. The change is here: raspberrypi/linux#6429. Once the CI has run on the commit, you can pull the kernel with
|
Running this currently on a unit that has frequently been rebooting -- will keep you posted after a few days. I have some stats on number of lockups prior to the kernel patch. |
Something has changed! I took the plunge on October 19th and upgraded my remote CM4 to 6.6.51+rpt-rpi-v8 using sudo apt update/sudo apt upgrade. I haven't seen any problems since then! The simplified script ran without failure for a few days then my full scripts ran at 30 frames per second for a few days also without failure. In August the failures came every few hours when running at 20 frames per second. Things that are different: Software update Temperature Amount of movement to be processed Darker No one at home I'll let you know if anything changes. |
@naushir I've been running the kernel PR on one device that has been problematic. Have not had a thread lockup since. |
I'm testing pulls/6429 kernel on my environment (the first two cameras). |
This looks promising! I patched the kernel (currently 6.6.54) in my yocto build with the changes from the PR, leaving everything unchanged. In the last few weeks I had a lookup every day running my compley application, usually after a few hours. With the patch my test system is running flawlessly since almost two days. I will continue testing over a longer term. Thank you! |
So my CM4 had the issue again after 3 days. Obviously this is without pulls/6429. |
Had a lockup last night on one of the devices with the patched kernel. I patched 2 problematic devices with the PR and definitely seeing improvements. Curious if the buffer_count parameter may impact this...
|
I am testing on two devices with raspberrypi/linux#6429, which before had a lockup every day, usually after a few hours.
There is a clear improvement, but something still seems to be going on. dmesg shows known output:
|
Seeing the same thing as @nzottmann across 2 devices. |
My environment is still working fine without error message( up 6 days, 16:48 ). I'll continue to monitor it :) |
Hard to say if the above lockups are related to the same issue or now. Regardless, I've merged the LS updates to the kernel tree as they do seem to improve things. |
I would suggest we close this issue down and start a new issue for any further lockups that need investigating. |
My environment is still running. Looks good. @naushir Thank you! |
I'm wanting to pull this update on some of my devices that have the boot loader locked by a private key. Will this update modify the boot loader at all? I ran on one of my other non-locked boot loader devices and it reported |
I'm not sure to be honest. But perhaps you can build the kernel yourself instead of using |
Closing this one down now. We can track other potential lockups in a new issue. |
Okay that's great! Do you know when it will be available in the main image i.e. if I do a clean install of the Raspberry Pi OS via the Pi Imager? |
During the development of an application, which is streaming continuously using picamera2, I can not get rid of a bug, leading to a lock up of picamera2 which stops receiving buffers after multiple hours of runtime.
I tried to reproduce this with a minimal example on a pi4 with a current Raspberry Pi OS, but without success yet. Thus I hesitated for a long time to file a bug report, but I have several indices and found multiple similar issues, so I decided to collect my findings here, perhaps to receive some help or help others seeing the same symptoms.
I know can not really ask for help until I have a working minimal example on a current RPi OS, but I am happy to collect more debugging information which could help tracking this down or finding other impacted users who can share other relevant findings.
To Reproduce
Minimal example
I created a minimal example, which encodes the main and the lores streams to h264 and forwards them to ffmpeg outputs. ffmpeg sends them via rtsp to a https://github.com/bluenviron/mediamtx instance, which includes a webserver to open the stream in a browser using webrtc.
After multiple hours, usually in the range of 10-20h, mediamtx logs show that the rtsp streams stopped. While debugging, I could see that the picamera2 event loops only contained empty events from this time on.
Context
This minimal example is a reduced version of my more complex application, which is running on a custom yocto build on a cm4 (currently on the cm4io board) with pi camera v3.
Currently, I can not reproduce the bug on a pi4 with this minimal example. But I can reproduce the bug with
Although I cannot prove, I think the bug is present in Raspberry Pi OS too, but subtile differences lead to a much longer runtime than on my custom build. During the last months, I observed that multiple factors changed the time to fail:
On my custom build, I try to follow Raspberry Pi OS versions of related software as close as possible, currently:
Symptoms
As described, when the failure happens, the outputs stop outputting frames. picamera2 stops supplying raw frames to the encoders.
dmesg
In most cases, shows that some kind of v4l2 frame polling method seems to block forever. The same message appears twice at the same time for two different python tasks, presumably for each stream.
Sometimes, but more rarely, there is nothing in dmesg.
Sometimes I see a
load
Looking at top, I saw a load of 2.00, but no processes showing more than a few % of cpu. A quick research lead to IO workload as a possible cause, perhaps the never returning v4l2 poll.
Resolving the situation
After killing and restarting the application, it works again for a few hours. A reboot does not change anything.
Related issues
Lots of research lead me to related issues, having their origin perhaps in the same, hard to track issue.
gstreamer: v4l2h264enc stops processing frames)
The text was updated successfully, but these errors were encountered: