Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] picamera2 locks up after hours, receiving no more buffers #1090

Closed
nzottmann opened this issue Aug 9, 2024 · 66 comments
Closed

[BUG] picamera2 locks up after hours, receiving no more buffers #1090

nzottmann opened this issue Aug 9, 2024 · 66 comments

Comments

@nzottmann
Copy link

During the development of an application, which is streaming continuously using picamera2, I can not get rid of a bug, leading to a lock up of picamera2 which stops receiving buffers after multiple hours of runtime.

I tried to reproduce this with a minimal example on a pi4 with a current Raspberry Pi OS, but without success yet. Thus I hesitated for a long time to file a bug report, but I have several indices and found multiple similar issues, so I decided to collect my findings here, perhaps to receive some help or help others seeing the same symptoms.

I know can not really ask for help until I have a working minimal example on a current RPi OS, but I am happy to collect more debugging information which could help tracking this down or finding other impacted users who can share other relevant findings.

To Reproduce

Minimal example

I created a minimal example, which encodes the main and the lores streams to h264 and forwards them to ffmpeg outputs. ffmpeg sends them via rtsp to a https://github.com/bluenviron/mediamtx instance, which includes a webserver to open the stream in a browser using webrtc.
After multiple hours, usually in the range of 10-20h, mediamtx logs show that the rtsp streams stopped. While debugging, I could see that the picamera2 event loops only contained empty events from this time on.

from picamera2.encoders import H264Encoder
from picamera2.outputs import FfmpegOutput
from picamera2 import Picamera2
import time

picam2 = Picamera2()

video_config = picam2.create_video_configuration(
    main={"size": (1920, 1080), "format": "RGB888"},
    lores={"size": (1280, 720), "format": "YUV420"},
    raw={"size": (2304, 1296)},
    controls={"FrameDurationLimits": (int(1e6/24), int(1e6/24))},
    display=None,
    use_case="video"
)

picam2.align_configuration(video_config)
picam2.configure(video_config)

encoder1 = H264Encoder(bitrate=10000000, framerate=24, enable_sps_framerate=True)
output1 = FfmpegOutput("-r 24 -f rtsp rtsp://127.0.0.1:8554/stream1")
picam2.start_recording(encoder1, output1, name="main")

encoder2 = H264Encoder(bitrate=10000000, framerate=24, enable_sps_framerate=True)
output2 = FfmpegOutput("-r 24 -f rtsp rtsp://127.0.0.1:8554/stream2")
picam2.start_recording(encoder2, output2, name="lores")

while True:
    time.sleep(60)

picam2.stop_recording()

Context

This minimal example is a reduced version of my more complex application, which is running on a custom yocto build on a cm4 (currently on the cm4io board) with pi camera v3.

Currently, I can not reproduce the bug on a pi4 with this minimal example. But I can reproduce the bug with

  • the minimal example on my custom build
  • my complex application on a pi4 with an up to date Raspberry Pi OS

Although I cannot prove, I think the bug is present in Raspberry Pi OS too, but subtile differences lead to a much longer runtime than on my custom build. During the last months, I observed that multiple factors changed the time to fail:

  • The simpler the application, the longer it runs
  • With one stream instead of two, the application runs for a longer time
  • Time to fail on Raspberry Pi OS is longer than on my custom build
  • libcamera, firmware and kernel versions seem to have an impact on time to fail, but I could not clearly identify a pattern here
  • I had a case of my application running on Bookworm for 28 days, then after a manual reboot failing in less than 24h.

On my custom build, I try to follow Raspberry Pi OS versions of related software as close as possible, currently:

  • Linux 6.6.40-v8
  • raspberrypi-firmware from 2024-05-24T15:30:04 (4942b7633c0ff1af1ee95a51a33b56a9dae47529)
  • libcamera v0.3.0+rpt20240617
  • picamera2 0.3.19

Symptoms

As described, when the failure happens, the outputs stop outputting frames. picamera2 stops supplying raw frames to the encoders.

dmesg

In most cases, shows that some kind of v4l2 frame polling method seems to block forever. The same message appears twice at the same time for two different python tasks, presumably for each stream.
Sometimes, but more rarely, there is nothing in dmesg.

[Fri Aug  9 08:32:04 2024] vc_sm_cma_vchi_rx_ack: received response 18856330, throw away...
[Fri Aug  9 08:35:29 2024] INFO: task python3:265886 blocked for more than 120 seconds.
[Fri Aug  9 08:35:29 2024]       Tainted: G         C         6.6.40-v8 #1
[Fri Aug  9 08:35:29 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Aug  9 08:35:29 2024] task:python3         state:D stack:0     pid:265886 ppid:1      flags:0x0000020c
[Fri Aug  9 08:35:29 2024] Call trace:
[Fri Aug  9 08:35:29 2024]  __switch_to+0xe0/0x160
[Fri Aug  9 08:35:29 2024]  __schedule+0x37c/0xd68
[Fri Aug  9 08:35:29 2024]  schedule+0x64/0x108
[Fri Aug  9 08:35:29 2024]  schedule_preempt_disabled+0x2c/0x50
[Fri Aug  9 08:35:29 2024]  __mutex_lock.constprop.0+0x2f4/0x5a8
[Fri Aug  9 08:35:29 2024]  __mutex_lock_slowpath+0x1c/0x30
[Fri Aug  9 08:35:29 2024]  mutex_lock+0x50/0x68
[Fri Aug  9 08:35:29 2024]  v4l2_m2m_fop_poll+0x38/0x80 [v4l2_mem2mem]
[Fri Aug  9 08:35:29 2024]  v4l2_poll+0x54/0xc8 [videodev]
[Fri Aug  9 08:35:29 2024]  do_sys_poll+0x2b0/0x5c8
[Fri Aug  9 08:35:29 2024]  __arm64_sys_ppoll+0xb4/0x148
[Fri Aug  9 08:35:29 2024]  invoke_syscall+0x50/0x128
[Fri Aug  9 08:35:29 2024]  el0_svc_common.constprop.0+0xc8/0xf0
[Fri Aug  9 08:35:29 2024]  do_el0_svc+0x24/0x38
[Fri Aug  9 08:35:29 2024]  el0_svc+0x40/0xe8
[Fri Aug  9 08:35:29 2024]  el0t_64_sync_handler+0x120/0x130
[Fri Aug  9 08:35:29 2024]  el0t_64_sync+0x190/0x198

Sometimes I see a

[Thu Aug  8 08:56:34 2024] vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512

load

Looking at top, I saw a load of 2.00, but no processes showing more than a few % of cpu. A quick research lead to IO workload as a possible cause, perhaps the never returning v4l2 poll.

Resolving the situation

After killing and restarting the application, it works again for a few hours. A reboot does not change anything.

Related issues

Lots of research lead me to related issues, having their origin perhaps in the same, hard to track issue.

@djhanove
Copy link

Can support this is happening to me as well.
Running a large fleet of Pi 4's on a complex application. Completely randomly I stop receiving buffers and it will hang indefinitely until the camera is closed and restarted or the system is rebooted. I now can handle it gracefully via a watchdog thread but it is quite concerning.
I am not seeing any dmesg logs (likely due to ramdisk in use for logging to save SD cards) and I get no errors in the program when this happens.

@dmunnet
Copy link

dmunnet commented Aug 14, 2024

I can also confirm that I'm encountering the same issue.
Pi4 4MB RAM running bookworm OS that is up to date. In my application I periodically capture video footage when sensor input is detected on a GPIO line. When the bug occurs, I do not get any error message and the program appears to be running as it should but the recorded video files are zero-length. After forcing the program to quit, I get the same dmesg that others have posted...

[ 9970.424030] vc_sm_cma_vchi_rx_ack: received response 850337, throw away...
[59407.677412] vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512
[59407.677432] bcm2835_mmal_vchiq: vchiq_mmal_submit_buffer: vc_sm_import_dmabuf_fd failed, ret -512
[59407.677437] bcm2835-codec bcm2835-codec: device_run: Failed submitting ip buffer
[59407.696712] ------------[ cut here ]------------
[59407.696731] WARNING: CPU: 1 PID: 3746 at drivers/media/common/videobuf2/videobuf2-core.c:2024 __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.696775] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device cmac algif_hash aes_arm64 aes_generic algif_skcipher af_alg bnep hci_uart btbcm brcmfmac_wcc bluetooth brcmfmac brcmutil cfg80211 ov5647 ecdh_generic ecc bcm2835_unicam libaes rfkill v4l2_dv_timings v4l2_fwnode raspberrypi_hwmon v4l2_async binfmt_misc bcm2835_codec(C) bcm2835_v4l2(C) rpivid_hevc(C) bcm2835_isp(C) v4l2_mem2mem bcm2835_mmal_vchiq(C) videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 raspberrypi_gpiomem videodev vc_sm_cma(C) videobuf2_common snd_bcm2835(C) mc nvmem_rmem uio_pdrv_genirq uio i2c_dev fuse dm_mod ip_tables x_tables ipv6 rtc_ds1307 spidev regmap_i2c vc4 snd_soc_hdmi_codec drm_display_helper cec v3d drm_dma_helper gpu_sched i2c_mux_pinctrl drm_shmem_helper i2c_mux drm_kms_helper i2c_brcmstb spi_bcm2835 drm i2c_bcm2835 drm_panel_orientation_quirks snd_soc_core snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd backlight
[59407.696947] CPU: 1 PID: 3746 Comm: python3.11 Tainted: G         C         6.6.31+rpt-rpi-v8 #1  Debian 1:6.6.31-1+rpt1
[59407.696955] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
[59407.696958] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[59407.696964] pc : __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.696980] lr : __vb2_queue_cancel+0x38/0x2a0 [videobuf2_common]
[59407.696994] sp : ffffffc0831f3a70
[59407.696997] x29: ffffffc0831f3a70 x28: ffffff80ba3887f8 x27: 0000000000000009
[59407.697006] x26: 0000000000000001 x25: 00000000f7b98094 x24: ffffff80ba388600
[59407.697015] x23: ffffff80401c92e0 x22: ffffff80bae57a98 x21: ffffff8041e3cd30
[59407.697024] x20: ffffff80bae57b40 x19: ffffff80bae57a98 x18: 0000000000000003
[59407.697032] x17: 0000000000000000 x16: ffffffd7102d4348 x15: ffffffc0831f35c0
[59407.697039] x14: 0000000000000004 x13: ffffff8043740028 x12: 0000000000000000
[59407.697047] x11: ffffff80b57462f8 x10: ffffff80b5746238 x9 : ffffffd71032baf8
[59407.697055] x8 : ffffffc0831f3970 x7 : 0000000000000000 x6 : 0000000000000228
[59407.697063] x5 : ffffff80459f4e40 x4 : fffffffe01167d20 x3 : 0000000080150013
[59407.697071] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000001
[59407.697079] Call trace:
[59407.697085]  __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.697101]  vb2_core_queue_release+0x2c/0x60 [videobuf2_common]
[59407.697115]  vb2_queue_release+0x18/0x30 [videobuf2_v4l2]
[59407.697136]  v4l2_m2m_ctx_release+0x30/0x50 [v4l2_mem2mem]
[59407.697164]  bcm2835_codec_release+0x64/0x110 [bcm2835_codec]
[59407.697178]  v4l2_release+0xec/0x100 [videodev]
[59407.697282]  __fput+0xbc/0x288
[59407.697292]  ____fput+0x18/0x30
[59407.697296]  task_work_run+0x80/0xe0
[59407.697306]  do_exit+0x30c/0x988
[59407.697311]  do_group_exit+0x3c/0xa0
[59407.697315]  get_signal+0x980/0x9b0
[59407.697320]  do_notify_resume+0x318/0x1370
[59407.697325]  el0_svc_compat+0x78/0x88
[59407.697335]  el0t_32_sync_handler+0x98/0x140
[59407.697341]  el0t_32_sync+0x194/0x198
[59407.697346] ---[ end trace 0000000000000000 ]---
[59407.697353] videobuf2_common: driver bug: stop_streaming operation is leaving buf 00000000680707d2 in active  #state

So far, I have not been able to drive the bug with a minimal version that just features the video capture. But the bug is intermittent, so maybe I just haven't run the test code long enough for it to crop up. I will kept testing.

@djhanove, are you able to share how you handle it using the watchdog thread?

@djhanove
Copy link

@dmunnet something like this, I'll let you figure out the rest

import time
from picamera2 import Picamera2
import threading

import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class CameraManager:
    def __init__(self):

        self.picam2 = Picamera2()
        capture_config = self.picam2.create_video_configuration(
            main={"size": (640, 480), "format": "RGB888"},
            raw={"size": self.picam2.sensor_resolution},
            buffer_count=6,
            queue=True
        )
        self.picam2.configure(capture_config)
        self.picam2.start(show_preview=False)

        self.heatbeat_lock = threading.Lock()
        self.last_heartbeat = time.time()

        self.watchdog_thread = threading.Thread(target=self._watchdog)
        self.watchdog_thread.start()

        self.camera_thread = threading.Thread(target=self._capture_continuous)
        self.camera_thread.start()

    def _watchdog(self, timeout=10):  # Timeout in seconds
        while True:
            with self.heatbeat_lock:
                last_heartbeat = self.last_heartbeat
            if time.time() - last_heartbeat > timeout:
                logger.error("Camera capture thread unresponsive, attempting to reset.")
            time.sleep(timeout)

    def _capture_continuous(self):
        while True:
            try:
                frame = self.picam2.capture_array()
                self.signal_heartbeat()  # Signal that the thread is alive
            except Exception as e:
                logger.error(f"Failed to capture frame: {e}")
                break

            time.sleep(1 / 25)

    def signal_heartbeat(self):
        with self.heatbeat_lock:
            self.last_heartbeat = time.time()


if __name__ == "__main__":
    camera_manager = CameraManager()
    while True:
        time.sleep(1)

@dmunnet
Copy link

dmunnet commented Aug 14, 2024

@djhanove, thanks for this!
If I learn more about this issue in the course of my testing, I'll post an update.

@caracoluk
Copy link

caracoluk commented Aug 16, 2024

I can also confirm I've seen what appears to be the same problem. While developing updated Python code using PiCamera2 to support the Pi Camera Module 3 as part of the "My Naturewatch Camera Server" I ran into this issue. I'm using a Pi Zero 2W and ended up testing on two separate Pi Zero 2Ws, both of which are running identical code and both have the problem. Interestingly I tested the same code using an older Pi Camera module and couldn't reproduce the problem.

I tried to find the simplest code that would reproduce the problem, but given that the failure can occur anywhere between 1-7 days on average it's not been easy to say for sure whether the problem is present in test code or not. The app I've been developing also uses the H.264 encoder using a circular buffer to write out to a file once motion has been detected. Usually when the problem occurs I see "imported vc_sm_cma_get_buffer failed -512" written in the dmesg log, but often this is only after I've restarted the Python process (the date stamp still shows the time the application stopped functioning). If I carry out the command "libcamera-hello --list-devices" once the system has entered the failed state it will hang indefinitely. I ended up creating a watchdog to restart the Python process instead as a workaround. There's clearly a problem, and I'd come across the same issues that others have linked to above, but in each case those seemed to be either related to a few kernel versions that had a separate problem, or were caused by faulty SD cards.

I shall keep an eye on this thread with interest as it would be good to resolve this issue properly.

@djhanove
Copy link

@caracoluk I have around 55 3b+'s with V1 camera modules where this doesn't happen (obviously older kernel and firmware)
I didn't think that it could be the camera module version, but you may be onto something. We swapped to pi4's and the v3 camera module at the same time and it is only happening on our newer builds.

@caracoluk
Copy link

@caracoluk I have around 55 3b+'s with V1 camera modules where this doesn't happen (obviously older kernel and firmware) I didn't think that it could be the camera module version, but you may be onto something. We swapped to pi4's and the v3 camera module at the same time and it is only happening on our newer builds.

I did try lowering the frame rate to 15fps (from 25fps), thinking that the issue might be related to the increased resources required handling the larger frames from the Camera module 3. That didn't seem to make any difference. I enabled debugging using the "vclog -m" and "vclog -a" commands and nothing at all was shown in these logs when the problem occurs. The only logging I've found at all is in dmesg and from some of the other threads it seems like these are just symptoms of the problem and offer little insight into the underlying cause. I realise it's not going to be easy for the Pi developers to resolve without them having a means of easily/quickly reproducing the problem.

@djhanove
Copy link

@caracoluk I ran with 4 threads on max stress test for days at a time to try and induce it and also did not see increased frequency.

agreed, not an easy one to debug

@nzottmann
Copy link
Author

I can confirm that. Tried also max. cpu stress without difference. For memory, I have 2/3 of cma and half of system memory free while the application is running.

@djhanove
Copy link

@caracoluk I have around 55 3b+'s with V1 camera modules where this doesn't happen (obviously older kernel and firmware) I didn't think that it could be the camera module version, but you may be onto something. We swapped to pi4's and the v3 camera module at the same time and it is only happening on our newer builds.

I did try lowering the frame rate to 15fps (from 25fps), thinking that the issue might be related to the increased resources required handling the larger frames from the Camera module 3. That didn't seem to make any difference. I enabled debugging using the "vclog -m" and "vclog -a" commands and nothing at all was shown in these logs when the problem occurs. The only logging I've found at all is in dmesg and from some of the other threads it seems like these are just symptoms of the problem and offer little insight into the underlying cause. I realise it's not going to be easy for the Pi developers to resolve without them having a means of easily/quickly reproducing the problem.

Looking back at my alert logs, I did have some pi 4b's with V1 camera modules that also exhibited this so it is not exclusive to the v3 module.

@caracoluk
Copy link

@djhanove that's useful to know. In that case this problem is likely to be affecting quite a few people. Did you notice whether the pi's with the V1 camera modules had been running for a lot longer than those with the V3 modules before they failed?

@djhanove
Copy link

@djhanove that's useful to know. In that case this problem is likely to be affecting quite a few people. Did you notice whether the pi's with the V1 camera modules had been running for a lot longer than those with the V3 modules before they failed?

Didn't seem to make a difference for me. I just swapped some back to the V1 camera yesterday and had 2 lockups in 24 hours on one device.

@caracoluk
Copy link

What I have noticed is that I get the lockups more frequently if the Pi's are running at a higher temperature. Over the warmer weeks (when the CPU temperature was showing temperatures of around 72-80C) I'd see the lockups on average once every 2 days. Now it's a bit cooler, the CPU temperature is showing 65-70C, and haven't had a lockup on either of my Pi's for the last 5 days so far. That might explain why when I was trying to create the shortest segment of code to reproduce the problem I was failing to do so. Not because the problem was no longer present, but because it would take longer to occur. I see that some of you have run CPU stress tests to try and cause the problem, and that should have pushed the temperature up a fair bit, so it's difficult to say if this is just a coincidence or not.

@djhanove
Copy link

@caracoluk all of my devices have heat sinks on them which keeps the temps in the 60-65C range even under full load.

@lowflyerUK
Copy link

I am experiencing the same issue on a CM4 running up to date bookworm. It never showed these errors while running the same picamera2 apps on bullseye. I have a raspberry pi camera module 3 on one port and an ov5647 on the other. The same fault occurs when running with only the camera module 3, although less frequently.

Headless CM4 running 6.6.31-1+rpt1 (2024-05-29) aarch64. Fresh install then sudo apt update, sudo apt upgrade.

Official Raspberry pi 5 power supply 25W. vcgencmd get_throttled always gives 0x0.

Both camera cables short - about 15cm. Possible source of radio interference from WiFi antenna connected to the CM4 running hostapd for local WiFi hotspot.

I'm working on a minimal program that exhibits the fault. I am also trying to see if the WiFI hotspot makes any difference (so far no obvious difference). I will also try more cooling on the CM4.

The program is based on the very wonderful picamera2 examples. Combination of mjpeg_server, capture_circular, capture_video_multiple together with opencv object detection. Most of the time each instance uses less than 30% of 1 cpu, so top rarely shows less than 80% idle. In normal use no swap is used. 1400MBytes available on 2GByte CM4. Normally there are no clients for the http mjpeg servers - and no correlation with the occurrence of the fault.

Every few days one or other of the picamera2 processes hangs - no frames received. I use systemd notify with a 10sec watchdog to kill and relaunch. Most of the time after a hang, systemd managed to relaunch successfully. Sometimes it throws kernel errors as reported above.

vc_sm_cma_vchi_rx_ack: received response 78529294, throw away...

Once or twice both processes have stalled at the same time (both receiving no frames) and that seems to lead to a disaster. Systemd can't kill either process, so stale zombies pile up and eventually the 2GByte swapfile is all used up. The only way out is to reboot.

I'll keep you posted.

@Akkiesoft
Copy link

I have same issue.

I am using two cameras for MJPEG live streaming. The problem has not occurred with Bullseye and has been a problem since the migration to Bookworm.

The lockups seem to happen in half a day to a day. I restart the script every hour as a workaround.

I have obtained the kernel error logs and have uploaded them to the following gist URL.
https://gist.github.com/Akkiesoft/4400a32e4488bf58b7ce891781019399
https://gist.github.com/Akkiesoft/95e9cfcd24023dd9b3b8009fc07a472c

@lowflyerUK
Copy link

Actually I am finding it hard to narrow this down.

I can rule out the possible interference of the WiFi hotspot - the errors turn up the same with it switched off.

The frame rate makes a big difference. With both processes running at 20 frames per second, the errors come every few hours on both cameras, and the disaster (both processes hanging and failing to be killed, so runs out of memory) happens after a few days. With one process running at 10 frames per second and the other at 20, the errors are less frequent and with both processes at 10 frames per second the errors are much less frequent (and so far no disasters).

Still working on it...

@caracoluk
Copy link

@lowflyerUK I find that my app runs between 2-10 days without failing with the frame rate set to 20fps and a resolution of 1920x1080 using the camera module 3 on a Pi Zero 2W. I have two duplicate sets of hardware and both seem to exhibit the problem to the same degree. I did notice that on hot days the lock ups occur more frequently as I'm only using a heatsink with no active cooling. The reported CPU temperature has gone above 80C on hot days which seems to reduce the mean time between failures for me.

@lowflyerUK
Copy link

lowflyerUK commented Sep 1, 2024

@caracoluk Thanks for your comment. The cpu in this CM4 does run rather warm - between 70-80deg, although it runs normally with 90% idle. Last night I added a permanent stress process that uses 100% of one core. This pushed the temperature to around 83deg so the cpu was occasionally throttled, but no errors for 4 hours - so no dramatic difference (but only for 4 hours). I'll see if I can think of a way to improve the cooling.

Update... I tried with 2 cores stressed. It failed within 2 hours on both camera processes, leading to my previously mentioned disaster, where defunct processes fill up the memory. So temperature could be a contributory factor for me, and/or possibly my picamera2 process fails if the cpu is busy.

@caracoluk
Copy link

@lowflyerUK if we could find a way to reliably reproduce the problem within as short a time frame as possible it would make it easier for the Pi developers to investigate. Ideally we'd need the sample code to be as simple as possible as I see from other threads that is usually their first request when looking into a problem.

@lowflyerUK
Copy link

@caracoluk Thanks for your encouragement! Yes, that is exactly what I am trying to do.

@naroin
Copy link

naroin commented Sep 6, 2024

Hi,

It looks like we are dealing with similar issue with our custom app (https://forums.raspberrypi.com/viewtopic.php?t=359992 ), and we are still debugging.
It seems for us that the response received is the repetition of an old message received 128 messages ago. This corresponds to the depth of vchiq_slot_info[] array.

@lowflyerUK
Copy link

@naroin Many thanks for pointing out that thread. My errors do indeed seem remarkably similar. Typically I get:

Aug 30 02:01:53 nestboxcam3 systemd[1]: gardencam.service: Watchdog timeout (limit 10s)!
Aug 30 02:01:53 nestboxcam3 kernel: vc_sm_cma_vchi_rx_ack: received response 9759796, throw away...
Aug 30 02:01:53 nestboxcam3 kernel: vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512

after which systemd is able to relaunch the app.

Fully up to date 64 bit Raspberry Pi OS on CM4 with 2GBytes RAM and 2GBytes swap. 2 separate picamera2 processes:
One (ov5647) with 2 streams at 10 frames per sec: 320x240 with MJPEGEncoder and 1296x972 with H264Encoder. The other (imx708) with 2 streams at 10 frames per sec: 384x216 with MJPEGEncoder and 1536x864 with H264Encoder.

Most of the time top shows around 80% idle.

If my sums are right, that adds up to around 27MPix/sec for all 4 encodes. So less than half of 1080p30 and about an eighth of the 220MPix/sec that @6by9 told us was the rated maximium for the ISP.

Maybe the CM4 can't reliably encode 4 outputs at once, even at 10 frames per sec? Should I try a Raspberry Pi 5?

@lowflyerUK
Copy link

I haven't found simple sample code that replicates my issue. So I am inclined to feel that the ISP rate limit is the cause. In my case I think I can make a workaround by only issuing frames to the MJEPencoder whan a client is actually watching the stream. As I am the only client and I hardly ever watch the realtime stream, the probability of failure will be a lot less. This obviously won't be a solution for everybody.

@caracoluk
Copy link

Interestingly my two Pi Zero 2Ws haven't had a lock up in the last 3 weeks after they would previously run for 2-5 days before it happened. There have been up software updates on either of them and I've not changed the code in any way, just left them running in the same positions. The only thing I'm aware of that has changed is the temperature as it has been quite a bit cooler recently. I remember reading somewhere that the Pi starts to throttle the CPU if the temperature increases above 80C and I was seeing temperatures reach this level. Perhaps the CPU throttling makes it more likely for this issue to occur?

@jwainwright87
Copy link

jwainwright87 commented Sep 17, 2024

I don't think this issue is isolated to picamera2. I'm having the exact same issue with rpicam-vid that I've detailed here. https://forums.raspberrypi.com/viewtopic.php?t=376766

@davidplowman
Copy link
Collaborator

I'd like to try and reproduce this. Does anyone have reasonably simple code that I can run on the latest Raspberry Pi OS on a Pi 4 (and preferably avoiding mediamtx because I know nothing about it) that is likely to provoke the problem? Thanks.

@djhanove
Copy link

I think no one here has been able to reliably provoke the problem. The issue is still persistent for me across a massive fleet of pi4s with no ability to predictibly reproduce. I have to believe that others are facing the same challenges based on the dialog here. I have a thread running picam to just capture frames, but my program is also running several other tasks in parallel

@naroin
Copy link

naroin commented Oct 17, 2024

Yes, we can share one example but we have modified the kernel logs to try to investigate the issue...

7,920587,527773748,-;vc_sm_cma_free: dmabuf 3af4fc43, buffer c3ea9700
7,920588,527773939,-;vc_sm_cma_import_dmabuf_internal: importing dma_buf a5075a10/fd -1
7,920589,527774099,-;[vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d9d00000, size 3133440, id 39.
7,920590,527774222,-;[vc_sm_cma_import_dmabuf_internal]: requesting import memory (trans_id: 102259)
7,920591,527774240,-;[vc_sm_add_resource]: added buffer 6dfe0219 (name sm-host-resource, size 3133440)
7,920592,527774248,-;[vc_sm_cma_import_dmabuf_internal]: buffer 6dfe0219 (name sm-host-resource, size 3133440), imported 4294967295 vc_handle c000002e
7,920593,527774258,-;vc_sm_cma_import_dmabuf: dma_buf a5075a10/fd -1 imported to ptr 9a7fd358
7,920594,527774312,-;vc_sm_dma_buf_release dmabuf 3af4fc43, buffer 4d404806
7,920595,527774322,-;vc_sm_vpu_free: mem_handle c000002d
7,920596,527774335,-;vc_sm_dma_buf_release vpu_free done
7,920597,527774341,-;vc_sm_clean_up_dmabuf: buffer 4d404806, imported 4294967295
7,920598,527774563,-;vc_sm_dma_buf_release clean_up dmabuf done
7,920599,527774570,-;[vc_sm_release_resource]: buffer 4d404806 (name sm-host-resource, size 3133440), imported 4294967295
7,920600,527774578,-;[vc_sm_release_resource]: Waiting for VPU unmap response for buffer 4d404806 vc_handle c000002d
7,920601,527774585,-;vc_sm_dma_buf_release done
7,920602,527774586,-;vc_sm_vpu_event: Released addr d7600000, size 3133440, id 00000026, mem_handle c000002d
7,920603,527774598,-;[vc_sm_release_resource]: buffer 4d404806 (name sm-host-resource, size 3133440), imported 4294967295
7,920604,527774607,-;vc_sm_release_resource: Release our allocation - done
7,920605,527778251,-;vc_sm_cma_free: dmabuf b0254adc, buffer c3f1de80
7,920606,527778580,-;vc_sm_cma_import_dmabuf_internal: importing dma_buf 35385238/fd -1
7,920607,527778838,-;[vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d5f00000, size 2088960, id 38.
7,920608,527779293,-;[vc_sm_cma_import_dmabuf_internal]: requesting import memory (trans_id: 102261)
7,920609,527779393,-;[vc_sm_add_resource]: added buffer 2ab20279 (name sm-host-resource, size 2088960)
7,920610,527779405,-;[vc_sm_cma_import_dmabuf_internal]: buffer 2ab20279 (name sm-host-resource, size 2088960), imported 4294967295 vc_handle c000002d
7,920611,527779415,-;vc_sm_cma_import_dmabuf: dma_buf 35385238/fd -1 imported to ptr 6db0a007
7,920612,527779704,-;vc_sm_dma_buf_release dmabuf b0254adc, buffer 5df0e4a8
7,920613,527779716,-;vc_sm_vpu_free: mem_handle c0000023
7,920614,527779728,-;vc_sm_dma_buf_release vpu_free done
7,920615,527779740,-;vc_sm_clean_up_dmabuf: buffer 5df0e4a8, imported 4294967295
7,920616,527780187,-;vc_sm_dma_buf_release clean_up dmabuf done
7,920617,527780207,-;[vc_sm_release_resource]: buffer 5df0e4a8 (name sm-host-resource, size 2088960), imported 4294967295
7,920618,527780221,-;[vc_sm_release_resource]: Waiting for VPU unmap response for buffer 5df0e4a8 vc_handle c0000023
7,920619,527780235,-;vc_sm_dma_buf_release done

...

7,921721,528330015,-;vc_sm_release_resource: Release our allocation - done
7,921722,528330052,-;vc_sm_cma_import_dmabuf_internal: importing dma_buf 7270b242/fd -1
7,921723,528330259,-;[vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d3600000, size 3133440, id 38.
7,921724,528330457,-;[vc_sm_cma_import_dmabuf_internal]: requesting import memory (trans_id: 102385)
7,921725,528330510,-;[vc_sm_add_resource]: added buffer 05d2ba0b (name sm-host-resource, size 3133440)
7,921726,528330519,-;[vc_sm_cma_import_dmabuf_internal]: buffer 05d2ba0b (name sm-host-resource, size 3133440), imported 4294967295 vc_handle c000002d
7,921727,528330528,-;vc_sm_cma_import_dmabuf: dma_buf 7270b242/fd -1 imported to ptr 0a6c9f91
7,921728,528330663,-;vc_sm_dma_buf_release dmabuf 9a7fd358, buffer 9f498656
7,921729,528330678,-;vc_sm_vpu_free: mem_handle c000002b
7,921730,528330693,-;vc_sm_dma_buf_release vpu_free done
7,921731,528330700,-;vc_sm_clean_up_dmabuf: buffer 9f498656, imported 4294967295
7,921732,528331129,-;vc_sm_dma_buf_release clean_up dmabuf done
7,921733,528331136,-;[vc_sm_release_resource]: buffer 9f498656 (name sm-host-resource, size 3133440), imported 4294967295
7,921734,528331145,-;[vc_sm_release_resource]: Waiting for VPU unmap response for buffer 9f498656 vc_handle c000002b
7,921735,528331155,-;vc_sm_dma_buf_release done
7,921736,528331175,-;vc_sm_vpu_event: Released addr d2a00000, size 3133440, id 00000025, mem_handle c000002b
7,921737,528331191,-;[vc_sm_release_resource]: buffer 9f498656 (name sm-host-resource, size 3133440), imported 4294967295
7,921738,528331203,-;vc_sm_release_resource: Release our allocation - done
7,921739,528343688,-;vc_sm_cma_free: dmabuf 3af4fc43, buffer c3ea9a80
7,921740,528343885,-;vc_sm_cma_import_dmabuf_internal: importing dma_buf d756d9a7/fd -1
7,921741,528344062,-;[vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d6d00000, size 3133440, id 37.
7,921742,528345597,-;vc_sm_cma_free: dmabuf b0254adc, buffer c3f1df00
7,921743,528346155,-;vc_sm_cma_import_dmabuf_internal: importing dma_buf 56e2da0e/fd -1
7,921744,528346400,-;[vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d5b00000, size 2088960, id 40.
3,921745,528346680,-;vc_sm_cma_vchi_rx_ack: received response 102259, throw away...
7,921746,528346829,-;[vc_sm_cma_import_dmabuf_internal]: requesting import memory (trans_id: 102388)
7,921747,528353952,-;[vc_sm_add_resource]: added buffer 5df0e4a8 (name sm-host-resource, size 2088960)
7,921748,528353972,-;[vc_sm_cma_import_dmabuf_internal]: buffer 5df0e4a8 (name sm-host-resource, size 2088960), imported 4294967295 vc_handle c0000023
7,921749,528353987,-;vc_sm_cma_import_dmabuf: dma_buf 56e2da0e/fd -1 imported to ptr 6db0a007
7,921750,528354112,-;vc_sm_dma_buf_release dmabuf b0254adc, buffer e456977f
7,921751,528354126,-;vc_sm_vpu_free: mem_handle c000002e
7,921752,528354138,-;vc_sm_dma_buf_release vpu_free done
7,921753,528354146,-;vc_sm_clean_up_dmabuf: buffer e456977f, imported 4294967295

In this example the duplicated response is : vc_sm_cma_vchi_rx_ack: received response 102259, throw away...
And it seems to appear when two buffers are requested in a short period of time.

@naushir
Copy link
Collaborator

naushir commented Oct 17, 2024

I think I've managed to reproduce this now, and I see the same vc_sm_cma_vchi_rx_ack: received response 127885, throw away... I have no idea if this message is meant to be fatal, but things seem to die very soon after.

[ 1161.598289] vc_sm_cma_free: handle 00000000363ec871/dmabuf 00000000363ec871
[ 1161.598572] vc_sm_cma_import_dmabuf_internal: importing dma_buf 000000006f5ccf39/fd -1
[ 1161.598732] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d0200000, size 3112960.
[ 1161.598791] vc_sm_cma_import_dmabuf_internal: importing dma_buf 0000000080f07d10/fd -1
[ 1161.598810] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000ce0c8000, size 32768.
[ 1161.598832] vc_sm_cma_vchi_rx_ack: received response 127885, throw away...
[ 1161.598914] [vc_sm_add_resource]: added buffer 0000000009e567f0 (name sm-host-resource, size 32768)
[ 1161.598924] vc_sm_cma_import_dmabuf: imported to ptr 00000000b0f09765
[ 1161.599015] vc_sm_vpu_event: Released addr ce0c8000, size 32768, id 0000001a, mem_handle c0000020
[ 1161.599029] [vc_sm_release_resource]: buffer 0000000072e4c532 (name sm-host-resource, size 32768), imported 1
[ 1161.599036] vc_sm_release_resource: Release our allocation - done
[ 1161.599047] vc_sm_cma_free: handle 00000000b0f09765/dmabuf 00000000b0f09765
[ 1161.599088] vc_sm_dma_buf_release dmabuf 00000000b0f09765, buffer 0000000009e567f0
[ 1161.599097] vc_sm_dma_buf_release vpu_free done
[ 1161.599107] vc_sm_dma_buf_release clean_up dmabuf done
[ 1161.599110] [vc_sm_release_resource]: buffer 0000000009e567f0 (name sm-host-resource, size 32768), imported 1
[ 1161.599115] [vc_sm_release_resource]: Waiting for VPU unmap response on 0000000009e567f0
[ 1161.599119] vc_sm_dma_buf_release done
[ 1161.631760] vc_sm_cma_import_dmabuf_internal: importing dma_buf 0000000080f07d10/fd -1
[ 1161.631828] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000ce0c8000, size 32768.
[ 1161.631977] [vc_sm_add_resource]: added buffer 000000003a9099b5 (name sm-host-resource, size 32768)
[ 1161.631989] vc_sm_cma_import_dmabuf: imported to ptr 00000000c5b2d219
[ 1161.632083] vc_sm_vpu_event: Released addr ce0c8000, size 32768, id 0000001d, mem_handle c0000010
[ 1161.632097] [vc_sm_release_resource]: buffer 0000000009e567f0 (name sm-host-resource, size 32768), imported 1
[ 1161.632105] vc_sm_release_resource: Release our allocation - done
[ 1161.632118] vc_sm_cma_free: handle 00000000c5b2d219/dmabuf 00000000c5b2d219
[ 1161.632165] vc_sm_dma_buf_release dmabuf 00000000c5b2d219, buffer 000000003a9099b5
[ 1161.632176] vc_sm_dma_buf_release vpu_free done
[ 1161.632190] vc_sm_dma_buf_release clean_up dmabuf done
[ 1161.632193] [vc_sm_release_resource]: buffer 000000003a9099b5 (name sm-host-resource, size 32768), imported 1
[ 1161.632197] [vc_sm_release_resource]: Waiting for VPU unmap response on 000000003a9099b5
[ 1161.632203] vc_sm_dma_buf_release done
[ 1161.663490] vc_sm_cma_import_dmabuf_internal: importing dma_buf 0000000080f07d10/fd -1
[ 1161.663534] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000ce0c8000, size 32768.
[ 1161.663649] [vc_sm_add_resource]: added buffer 0000000072e4c532 (name sm-host-resource, size 32768)
[ 1161.663657] vc_sm_cma_import_dmabuf: imported to ptr 00000000b0f09765
[ 1161.663733] vc_sm_vpu_event: Released addr ce0c8000, size 32768, id 0000001a, mem_handle c0000020
[ 1161.663744] [vc_sm_release_resource]: buffer 000000003a9099b5 (name sm-host-resource, size 32768), imported 1
[ 1161.663750] vc_sm_release_resource: Release our allocation - done
[ 1161.663761] vc_sm_cma_free: handle 00000000b0f09765/dmabuf 00000000b0f09765
[ 1161.663794] vc_sm_dma_buf_release dmabuf 00000000b0f09765, buffer 0000000072e4c532
[ 1161.663803] vc_sm_dma_buf_release vpu_free done
[ 1161.663812] vc_sm_dma_buf_release clean_up dmabuf done
[ 1161.663816] [vc_sm_release_resource]: buffer 0000000072e4c532 (name sm-host-resource, size 32768), imported 1
[ 1161.663821] [vc_sm_release_resource]: Waiting for VPU unmap response on 0000000072e4c532
[ 1161.663827] vc_sm_dma_buf_release done
[ 1161.696589] vc_sm_cma_import_dmabuf_internal: importing dma_buf 0000000080f07d10/fd -1
[ 1161.696636] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000ce0c8000, size 32768.
[ 1161.696749] [vc_sm_add_resource]: added buffer 0000000009e567f0 (name sm-host-resource, size 32768)
[ 1161.696756] vc_sm_cma_import_dmabuf: imported to ptr 00000000c5b2d219
[ 1161.696832] vc_sm_vpu_event: Released addr ce0c8000, size 32768, id 0000001d, mem_handle c0000010
[ 1161.696843] [vc_sm_release_resource]: buffer 0000000072e4c532 (name sm-host-resource, size 32768), imported 1
[ 1161.696850] vc_sm_release_resource: Release our allocation - done
[ 1161.696860] vc_sm_cma_free: handle 00000000c5b2d219/dmabuf 00000000c5b2d219
[ 1161.696900] vc_sm_dma_buf_release dmabuf 00000000c5b2d219, buffer 0000000009e567f0
[ 1161.696915] vc_sm_dma_buf_release vpu_free done
[ 1161.696934] vc_sm_dma_buf_release clean_up dmabuf done
[ 1161.696936] [vc_sm_release_resource]: buffer 0000000009e567f0 (name sm-host-resource, size 32768), imported 1
[ 1161.696942] [vc_sm_release_resource]: Waiting for VPU unmap response on 0000000009e567f0
[ 1161.696947] vc_sm_dma_buf_release done
[ 1161.729894] vc_sm_cma_import_dmabuf_internal: importing dma_buf 0000000080f07d10/fd -1
[ 1161.729936] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000ce0c8000, size 32768.
[ 1161.730103] [vc_sm_add_resource]: added buffer 000000003a9099b5 (name sm-host-resource, size 32768)
[ 1161.730115] vc_sm_cma_import_dmabuf: imported to ptr 00000000b0f09765
[ 1161.730618] vc_sm_vpu_event: Released addr ce0c8000, size 32768, id 0000001a, mem_handle c0000020
[ 1161.730631] [vc_sm_release_resource]: buffer 0000000009e567f0 (name sm-host-resource, size 32768), imported 1
[ 1161.730638] vc_sm_release_resource: Release our allocation - done
[ 1161.730654] vc_sm_cma_free: handle 00000000b0f09765/dmabuf 00000000b0f09765
[ 1161.730694] vc_sm_dma_buf_release dmabuf 00000000b0f09765, buffer 000000003a9099b5
[ 1161.730708] vc_sm_dma_buf_release vpu_free done
[ 1161.730735] vc_sm_dma_buf_release clean_up dmabuf done
[ 1161.730738] [vc_sm_release_resource]: buffer 000000003a9099b5 (name sm-host-resource, size 32768), imported 1
[ 1161.730744] [vc_sm_release_resource]: Waiting for VPU unmap response on 000000003a9099b5
[ 1161.730749] vc_sm_dma_buf_release done

@6by9
Copy link
Contributor

6by9 commented Oct 17, 2024

I'm slightly surprised that you're mapping and unmapping as many buffers as that.
Ideally you want the same number of buffers on the output of libcamera as being fed to the encoder, so the buffers get mapped once and then the mappings retained for the duration. Mapping and unmapping every time will have a performance impact.

The dmabuf is meant to be held on to until the VPU has responded saying that it has unmapped it.
The log you're seeing means that we've got a response for a message that isn't expected, and I don't know how you've managed that.

Actually the size of the allocation is only 32kB. That's not an image buffer, so what is it? Lens shading table?

@naushir
Copy link
Collaborator

naushir commented Oct 18, 2024

const uint32_t MaxLsGridSize = 0x8000;

The lstable is indeed 32k.

By mapping/unmapping do you mean calling mmap/munmap from userland? That only happens once per configure() cycle in the IPA. Perhaps the 32k LS table size is a coincidence and it's actually another buffer? If it was the LS table, I expect we can reproduce this without running the encoder.

@naushir
Copy link
Collaborator

naushir commented Oct 18, 2024

However, this happens on every frame in the kernel driver ctrl handler:

        case V4L2_CID_USER_BCM2835_ISP_LENS_SHADING:
	{
		struct bcm2835_isp_lens_shading *v4l2_ls;
		struct mmal_parameter_lens_shading_v2 ls;
		struct dma_buf *dmabuf;
		void *vcsm_handle;

		v4l2_ls = (struct bcm2835_isp_lens_shading *)ctrl->p_new.p_u8;
		/*
		 * struct bcm2835_isp_lens_shading and struct
		 * mmal_parameter_lens_shading_v2 match so that we can do a
		 * simple memcpy here.
		 * Only the dmabuf to the actual table needs any manipulation.
		 */
		memcpy(&ls, v4l2_ls, sizeof(ls));

		dmabuf = dma_buf_get(v4l2_ls->dmabuf);
		if (IS_ERR_OR_NULL(dmabuf))
			return -EINVAL;

		ret = vc_sm_cma_import_dmabuf(dmabuf, &vcsm_handle);
		if (ret) {
			dma_buf_put(dmabuf);
			return -EINVAL;
		}

		ls.mem_handle_table = vc_sm_cma_int_handle(vcsm_handle);
		if (ls.mem_handle_table)
			/* The VPU will take a reference on the vcsm handle,
			 * which in turn will retain a reference on the dmabuf.
			 * This code can therefore safely release all
			 * references to the buffer.
			 */
			ret = set_isp_param(node,
					    MMAL_PARAMETER_LENS_SHADING_OVERRIDE,
					    &ls,
					    sizeof(ls));
		else
			ret = -EINVAL;

		vc_sm_cma_free(vcsm_handle);
		dma_buf_put(dmabuf);
		break;
	}

Perhaps this is not right?

@davidplowman
Copy link
Collaborator

An experiment might be to change "rpi.alsc" in the tuning file to "x.rpi.alsc" (i.e. comment it out), and see if the problem goes away? (BTW, the Python script I was running overnight seems to be OK still.)

@naushir
Copy link
Collaborator

naushir commented Oct 18, 2024

Perhaps it's better to do something like this in the isp driver: naushir/linux@de2c0b3?

Completely untested!

@naroin
Copy link

naroin commented Oct 18, 2024

The log you're seeing means that we've got a response for a message that isn't expected, and I don't know how you've managed that.

we do not know either how we've done that, but that unexpected message is having a trans_id index equaling the expected (lost message) trans_id index minus 128. And 128 matches the width of vchiq_slot_info[] circular buffer.
Please note that we added a local variable to get the vc_sm_cma_vchi_import() trans_id as in our case and naushir's case two vc_sm_cma_import_dmabuf_internal() occur at the same time :

@@ -721,6 +731,7 @@ vc_sm_cma_import_dmabuf_internal(struct vc_sm_privdata_t *private,
        struct sg_table *sgt = NULL;
        dma_addr_t dma_addr;
        u32 cache_alias;
+       u32 trans_id;
        int ret = 0;
        int status;
 
@@ -783,21 +794,23 @@ vc_sm_cma_import_dmabuf_internal(struct vc_sm_privdata_t *private,
                 __func__, import.name, import.type, &dma_addr, import.size);
 
        /* Allocate the videocore buffer. */
-       status = vc_sm_cma_vchi_import(sm_state->sm_handle, &import, &result,
-                                      &sm_state->int_trans_id);
+       status = vc_sm_cma_vchi_import(sm_state->sm_handle, &import, &result, &trans_id);
        if (status == -EINTR) {
                pr_debug("[%s]: requesting import memory action restart (trans_id: %u)\n",
-                        __func__, sm_state->int_trans_id);
+                        __func__, trans_id);
                ret = -ERESTARTSYS;
                private->restart_sys = -EINTR;
                private->int_action = VC_SM_MSG_TYPE_IMPORT;
+               private->int_trans_id = trans_id;
                goto error;
        } else if (status || !result.res_handle) {
                pr_debug("[%s]: failed to import memory on videocore (status: %u, trans_id: %u)\n",
-                        __func__, status, sm_state->int_trans_id);
+                        __func__, status, trans_id);
                ret = -ENOMEM;
                goto error;
        }
+       pr_debug("[%s]: requesting import memory (trans_id: %u)\n",
+                __func__, trans_id);
 
        mutex_init(&buffer->lock);
        INIT_LIST_HEAD(&buffer->attachments);

@6by9
Copy link
Contributor

6by9 commented Oct 18, 2024

Perhaps it's better to do something like this in the isp driver: naushir/linux@de2c0b3?

Completely untested!

Is libcamera resubmitting the same dmabuf every frame?
The VPU is mapping the dmabuf, but I suspect that the set call makes it reprocess the table.

Your patch is close, but we actually want to store/compare the struct dmabuf * from dma_buf_get, otherwise a dup on the fd will give us a new fd for the same dmabuf.

I did wonder if we were running out of buffering in VCHI. With the ISP and encoding going on it is shovelling a fair number of commands around the place.

@naushir
Copy link
Collaborator

naushir commented Oct 18, 2024

Yes, libcamera uses a single dmabuf handle for the ls table. I'll rework the change on Monday and give it some testing.

@lowflyerUK
Copy link

lowflyerUK commented Oct 20, 2024

OK, thanks. Had to mess about with some of the file names, but hopefully not with any material effect. It's running now so I'll leave it and see if anything happens. Is there any obvious sign if things go wrong, or do I need to keep an eye on dmesg?

@davidplowman Sorry I missed this. Mine throws this, then one of the threads never gets any more frames.

Aug 30 02:01:53 nestboxcam3 kernel: vc_sm_cma_vchi_rx_ack: received response 9759796, throw away...
Aug 30 02:01:53 nestboxcam3 kernel: vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512

I think this script is even quicker at causing it. Yes,sorry about my filenames.
capTest0.txt

@naushir
Copy link
Collaborator

naushir commented Oct 22, 2024

I've got an updated fix for the LS dmabuf caching. I seem to have improved behavior on my system so far, but would be nice to have it tested by other folks.

The change is here: raspberrypi/linux#6429. Once the CI has run on the commit, you can pull the kernel with

sudo rpi-update pulls/6429

@djhanove
Copy link

Running this currently on a unit that has frequently been rebooting -- will keep you posted after a few days. I have some stats on number of lockups prior to the kernel patch.

@lowflyerUK
Copy link

Something has changed! I took the plunge on October 19th and upgraded my remote CM4 to 6.6.51+rpt-rpi-v8 using sudo apt update/sudo apt upgrade. I haven't seen any problems since then! The simplified script ran without failure for a few days then my full scripts ran at 30 frames per second for a few days also without failure. In August the failures came every few hours when running at 20 frames per second.

Things that are different:

Software update
See full /var/log/apt/term.log in updated file - these seem the most likely to be relevant:
raspi-firmware (1:1.20240924-1) over (1:1.20240902-1)
libcamera0.3:arm64 (0.3.2+rpt20240927-1) over (0.3.1+rpt20240906-1)
linux-image-6.6.51+rpt-rpi-v8 (1:6.6.51-1+rpt3) ----- it was 6.6.31-1+rpt1 in August
rpicam-apps-lite (1.5.2-2) over (1.5.1-1)
python3-picamera2 (0.3.22-2) over (0.3.21-1)
term.log

Temperature
Over the summer vcgencmd measure_temp was showing high 70s regularly and occasionally vcgencmd get_throttled showed that it had been throttled. However there were times in August that it failed when it hadn't apparently been throttled. This week vcgencmd measure_temp was showing low 60s and vcgencmd get_throttled stayed at 0x0.

Amount of movement to be processed
We were not at home, so less movement in the garden camera. It is hard to see that this is important as many of the failures earlier were overnight. Not many birds were visiting the nest both earlier and now.

Darker
The garden scene was generally darker during (during the day) and the days were shorter. Less video processing?

No one at home
Hard to see how this could influence it. Less WiFi usage (although the CM4 is connected by ethernet and WiFi hostapd was not running)? Less electrical interference (although 5V supply seems fine and never had any undervoltage warnings. No USB devices connected and on-board WiFi but not being used)? Possibly fewer client accesses to webserver and streamed video (although there was never any correlation in August between access and failures)?

I'll let you know if anything changes.

@djhanove
Copy link

@naushir I've been running the kernel PR on one device that has been problematic. Have not had a thread lockup since.
Have some other devices that are problematic at customer locations, but want to be sure this kernel modification doesn't have unintended consequences elsewhere before rolling out to them...

@Akkiesoft
Copy link

I'm testing pulls/6429 kernel on my environment (the first two cameras).
It's been almost two days since the Pi started up, and so far the problem hasn't reproduced.
I will continue to check it about a week. :)

@nzottmann
Copy link
Author

This looks promising!

I patched the kernel (currently 6.6.54) in my yocto build with the changes from the PR, leaving everything unchanged. In the last few weeks I had a lookup every day running my compley application, usually after a few hours. With the patch my test system is running flawlessly since almost two days. I will continue testing over a longer term.

Thank you!

@lowflyerUK
Copy link

So my CM4 had the issue again after 3 days. Obviously this is without pulls/6429.

@djhanove
Copy link

Had a lockup last night on one of the devices with the patched kernel. I patched 2 problematic devices with the PR and definitely seeing improvements.
1 lockup in ~4 days - previously was having lockups 2-4 times per day on that device.

Curious if the buffer_count parameter may impact this...

        video_config = self.picam2.create_video_configuration(
            main={"size": size, "format": "RGB888"},
            lores={"size": (640, 480), "format": "YUV420"},
            controls={"FrameRate": self.fps},
            buffer_count=2,
        )

@nzottmann
Copy link
Author

nzottmann commented Oct 28, 2024

I am testing on two devices with raspberrypi/linux#6429, which before had a lockup every day, usually after a few hours.

  • Device 1 is still running for almost 6 days had a lockup after 7 days
  • Device 2 had a lockup after 5 days

There is a clear improvement, but something still seems to be going on.

dmesg shows known output:

[464254.288131] vc_sm_cma_vchi_rx_ack: received response 69659464, throw away...
[464480.462015] INFO: task python3:61110 blocked for more than 120 seconds.
[464480.475916]       Tainted: G        WC         6.6.54-v8 #1
[464480.481600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[464480.489547] task:python3         state:D stack:0     pid:61110 ppid:1      flags:0x0000020c
[464480.498050] Call trace:
[464480.500584]  __switch_to+0xe0/0x160
[464480.504188]  __schedule+0x37c/0xd68
[464480.507780]  schedule+0x64/0x108
[464480.511108]  schedule_preempt_disabled+0x2c/0x50
[464480.515826]  __mutex_lock.constprop.0+0x2f4/0x5a8
[464480.520634]  __mutex_lock_slowpath+0x1c/0x30
[464480.524999]  mutex_lock+0x50/0x68
[464480.528410]  v4l2_m2m_fop_poll+0x38/0x80 [v4l2_mem2mem]
[464480.533761]  v4l2_poll+0x54/0xc8 [videodev]
[464480.538119]  do_sys_poll+0x2b0/0x5c8
[464480.541789]  __arm64_sys_ppoll+0xb4/0x148
[464480.545890]  invoke_syscall+0x50/0x128
[464480.549740]  el0_svc_common.constprop.0+0xc8/0xf0
[464480.554540]  do_el0_svc+0x24/0x38
[464480.557946]  el0_svc+0x40/0xe8
[464480.561088]  el0t_64_sync_handler+0x120/0x130
[464480.565532]  el0t_64_sync+0x190/0x198

@djhanove
Copy link

Seeing the same thing as @nzottmann across 2 devices.

@Akkiesoft
Copy link

My environment is still working fine without error message( up 6 days, 16:48 ). I'll continue to monitor it :)

@naushir
Copy link
Collaborator

naushir commented Nov 1, 2024

Hard to say if the above lockups are related to the same issue or now. Regardless, I've merged the LS updates to the kernel tree as they do seem to improve things.

@naushir
Copy link
Collaborator

naushir commented Nov 1, 2024

I would suggest we close this issue down and start a new issue for any further lockups that need investigating.

@Akkiesoft
Copy link

My environment is still running. Looks good.

@naushir Thank you!

@jwainwright87
Copy link

I've got an updated fix for the LS dmabuf caching. I seem to have improved behavior on my system so far, but would be nice to have it tested by other folks.

The change is here: raspberrypi/linux#6429. Once the CI has run on the commit, you can pull the kernel with

sudo rpi-update pulls/6429

I'm wanting to pull this update on some of my devices that have the boot loader locked by a private key.

Will this update modify the boot loader at all?

I ran on one of my other non-locked boot loader devices and it reported Downloading bootloader tools Downloading bootloader images

@naushir
Copy link
Collaborator

naushir commented Nov 4, 2024

I'm not sure to be honest. But perhaps you can build the kernel yourself instead of using rpi-update to avoid this?

@naushir
Copy link
Collaborator

naushir commented Nov 4, 2024

Closing this one down now. We can track other potential lockups in a new issue.

@naushir naushir closed this as completed Nov 4, 2024
@jwainwright87
Copy link

I've got an updated fix for the LS dmabuf caching. I seem to have improved behavior on my system so far, but would be nice to have it tested by other folks.
The change is here: raspberrypi/linux#6429. Once the CI has run on the commit, you can pull the kernel with

sudo rpi-update pulls/6429

I'm wanting to pull this update on some of my devices that have the boot loader locked by a private key.

Will this update modify the boot loader at all?

I ran on one of my other non-locked boot loader devices and it reported Downloading bootloader tools Downloading bootloader images

Okay that's great! Do you know when it will be available in the main image i.e. if I do a clean install of the Raspberry Pi OS via the Pi Imager?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests