-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu related crashes with kernel >= 6.9.7 #309
Comments
There isn't much of change between asahi-6.9.5-1 and asahi-6.9.6-1 and I don't see relevant changes. It looks like there is an issue with handling failing |
I was running in the same (or similar)
|
asahi-6.9.7-1 contains @asahilina's GPUVM changes so a regression caused by that is at least possible |
Hi, as I see that @mkurz run a macbook pro, for information, I got this issue on a m2 air. This is totally random but happen several times per day. |
This is in drm/sched so it's less likely to be GPUVM related...
This is an impossible condition, since the job credit count is always 1 and the credit limit is 1280 or something like that. So I think there is some kind of memory corruption... |
The realloc crash has some interesting strings...
This string is not from the kernel... @oliverbestmann, do you have any idea where this came from? |
Also are we sure this is reproducible with v6.9.6 in at least some cases? Because then it can't be the GPUVM stuff... |
If it's reproducible with asahi-6.9.6-1 there's no obvious change which would explain why it's not in asahi-6.9.5-1 as well. Nothing in |
Are these kernels built with clang/llvm by any chance? So far everyone reporting this is on something other than Fedora, and Ella specifically pointed this out on Discord:
|
@cyrinux please describe which systems you use. Do you use Fedora-Asahi-Remix? @mkurz / @oliverbestmann do you use LLVM or gcc to build the kernel? |
I use nixos unstable with https://github.com/tpwrules/nixos-apple-silicon/ overlay. 😸
|
My kernel is built with GCC.
|
Please also report your Mesa versions, and the Rust version used for the kernel compile too. At this point I'm pretty sure this is random memory corruption, but none of us on Fedora can reproduce it so far... |
|
I am running Arch Linux ARM with all packages up to date, thanks to @joske's pull requests: https://github.com/AsahiLinux/PKGBUILDs/pulls/joske You find the
So for me this happend when going from 6.9.6-1 to 6.9.7-1 |
btw. after upgrading llvm/clang I had to re-compile mesa. |
I'm bisecting configs and running into some scary mm-related crashes that have nothing to do with the GPU. I think there is some horrible regression here that affects some kernel configs... Everyone, please post the value of these kernel configs:
For reference, on Fedora we have:
|
From https://github.com/joske/PKGBUILDs/blob/kernel/linux-asahi/config:
Both the same when building 6.9.6-1 or 6.9.7-1. The only difference between in config between the two kernels is: joske/PKGBUILDs@14913f3#diff-3a3fd6cbc5653e937609572c62143e181842a4a1ebdc1b55e9e2e34e6aa6c5fc |
I just ran into this, also using https://github.com/tpwrules/nixos-apple-silicon/tree/6015c1e2f91896e0b7a983c2824c665af32f568a
|
Sorry, I really need a consistent way to reproduce this to track it down. So far I've been unable to repro the The crash itself makes no sense. It's memory corruption, where the drm_sched job gets clobbered with something else, and then somehow consistently after that the changes made by drm_sched directly cause a crash in the allocator, in what has to be a subsequent ioctl call because the drm_sched stuff is the last thing the ioctl does. That it's somehow this consistent is very, very strange. I would have expected heap corruption to manifest in more varied ways after the fact. The actual lifetimes of the allocations involved are extremely simple, so I'm 99% sure this isn't a silly lifetime problem in my code (at least not as it relates to the specific structures referenced in the crashes). The code in both the I tried running the same kernel under kASAN and came up with nothing. I also tried Ella's config with kASAN, still nothing, Best guess is there is a spurious page being freed or something like that, so memory is reused while it is still in use. I actually already ran into one of these before (fixed in 2bb1499) which would perfectly explain this kind of behavior, except for the fact that that particular one only happened on DART pagetable freeing which only really happens when unbinding drivers (which is why we didn't notice for so long). If there is a similar bug lurking somewhere else, but it only happens sometimes, then that might explain this and the other badness.
Edit: The 52-bit VA thing is unrelated unfortunately. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, seems like I am a bit late now, probably nothing new, but still: |
Unfortunately, I just confirmed that the 52-bit problem is completely unrelated. Upstream Linux is just broken with the combination of LPA2 (52-bit support), 16K pages, and non-LPA2 hardware. Please don't build with 52-bit support. So now we're back to square one... I have no idea how to repro the GPU issue ;; |
This implies that it is working fine for you on a macbook pro m1 with wayland and gnome? Running chromium also works? What information would be helpful to you? |
That's the first time I hear gnome is involved, and also nobody mentioned chromium until your previous post ^^;; (the OP does in fact mention the process name is chromium in the oops log, but I missed that bit...) The more info about the setup I get the better, and if you can try more workloads (for example, webgl tests and other browsery things) and see if you can find something that reproduces it fast that would be very useful... Right now I'm testing chromium on an M2 Pro Mac Mini and a bunch of maps and webGL stuff doesn't seem to cause any issues, but this is on Fedora. If there's something about the userspace build that matters here, maybe I need to install another distro... |
You are right, I only mentioned wayland and gnome in the issue tpwrules/nixos-apple-silicon#218 here, I am sorry for that. I just checked my previous boot logs to find everything i can. Here is a different stack trace. This ne does not contain the Warning about a kernel paging request:
but then a few minutes later:
Then i have one from 6.9.7:
I got this warning from chromium in the log 3260 times: It looks like it is not only chromium, here I have one crash in Xwayland on 6.9.7:
Running a video conference on zoom triggered the freeze for me the fast - it takes only a few minutes for the system to freeze. -- Regarding the build: You could probably just follow the installation instructions here to get the exact same kernel build, chromium, wayland + gnome (well, at least thats what nix promises you): https://github.com/tpwrules/nixos-apple-silicon/blob/main/docs/uefi-standalone.md |
Can also confirm stability with 50+ hours uptime. Love your hard work on this project. No plans on going back to macOS. 🙂 |
Also testing asahi-6.9.9-7 on ALARM and so far looks good. Thanks! |
Actually title should be changed from |
The bug actually affects all of 6.9.x and probably a few earlier versions too, it's just a coincidence that it apparently only manifested starting with 6.9.7.
…On July 23, 2024 9:37:05 AM GMT+02:00, Matthias Kurz ***@***.***> wrote:
Actually title should be changed from `gpu related crashes with kernel >= 6.9.6` to `gpu related crashes with kernel >= 6.9.7` IMHO
--
Reply to this email directly or view it on GitHub:
#309 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
I've renamed it anyways, as it was a pretty consistent coincidence. |
looks like the regression was introduced in:
I don't see any upstream users of |
@robclark nouveau is using variable credits, just not |
hmm, if nouveau is using that, it makes it more complicated to revert. But that patch is fatally flawed, the whole point of a single-producer-single-consumer queue is that you have just a single producer and single consumer. That patch violates this rule. |
I suspect the correct fix is to remove the Edit: In fact this was already proposed here but for some reason Luben never implemented the proposed simplified |
I've only looked briefly at the credit patches, but the call in |
Fixes a race condition reported here: AsahiLinux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Signed-off-by: Rob Clark <[email protected]>
Fixes a race condition reported here: AsahiLinux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]>
Fixes a race condition reported here: #309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: #309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]>
Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]>
Fixes a race condition reported here: AsahiLinux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]>
Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]>
commit 440d52b370b03b366fd26ace36bab20552116145 upstream. Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 440d52b370b03b366fd26ace36bab20552116145 upstream. Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 440d52b370b03b366fd26ace36bab20552116145 upstream. Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 440d52b370b03b366fd26ace36bab20552116145 upstream. Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 440d52b upstream. Fixes a race condition reported here: AsahiLinux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 440d52b upstream. Fixes a race condition reported here: AsahiLinux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 440d52b370b03b366fd26ace36bab20552116145 upstream. Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 440d52b370b03b366fd26ace36bab20552116145 upstream. Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 440d52b370b03b366fd26ace36bab20552116145 upstream. Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 440d52b370b03b366fd26ace36bab20552116145 upstream. Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
Since updating from 6.9.5 to to 6.9.6 (and 6.9.9) i get random gpu/drm related crashes after a few minutes of usage.
Going back to 6.9.5 brings back a stable system.
The text was updated successfully, but these errors were encountered: