Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error checking to command buffer completion handler #521

Merged
merged 6 commits into from
Feb 3, 2025

Conversation

vovw
Copy link
Contributor

@vovw vovw commented Jan 20, 2025

solves #510

Copy link
Contributor

github-actions bot commented Jan 20, 2025

Your PR no longer requires formatting changes. Thank you for your contribution!

Copy link

codecov bot commented Jan 21, 2025

Codecov Report

Attention: Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 75.32%. Comparing base (52d7056) to head (3c9c87a).
Report is 410 commits behind head on main.

Files with missing lines Patch % Lines
src/compiler/execution.jl 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #521      +/-   ##
==========================================
+ Coverage   71.04%   75.32%   +4.28%     
==========================================
  Files          36       57      +21     
  Lines        1143     2703    +1560     
==========================================
+ Hits          812     2036    +1224     
- Misses        331      667     +336     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Benchmark suite Current: 3c9c87a Previous: 1b811cb Ratio
private array/construct 25461.833333333332 ns 26482.666666666668 ns 0.96
private array/broadcast 462917 ns 457584 ns 1.01
private array/random/randn/Float32 799458.5 ns 857895.5 ns 0.93
private array/random/randn!/Float32 634083 ns 657709 ns 0.96
private array/random/rand!/Int64 567062.5 ns 564458 ns 1.00
private array/random/rand!/Float32 591791 ns 610416 ns 0.97
private array/random/rand/Int64 769520.5 ns 805354 ns 0.96
private array/random/rand/Float32 607271 ns 602791.5 ns 1.01
private array/copyto!/gpu_to_gpu 666833 ns 672437.5 ns 0.99
private array/copyto!/cpu_to_gpu 814833 ns 635271 ns 1.28
private array/copyto!/gpu_to_cpu 613375 ns 815104.5 ns 0.75
private array/accumulate/1d 1327375 ns 1324625 ns 1.00
private array/accumulate/2d 1380978.5 ns 1393666.5 ns 0.99
private array/iteration/findall/int 2109104.5 ns 2070667 ns 1.02
private array/iteration/findall/bool 1803854 ns 1827208 ns 0.99
private array/iteration/findfirst/int 1682875 ns 1688333.5 ns 1.00
private array/iteration/findfirst/bool 1665584 ns 1671750 ns 1.00
private array/iteration/scalar 3796333 ns 3643750 ns 1.04
private array/iteration/logical 3186292 ns 3166208 ns 1.01
private array/iteration/findmin/1d 1755604 ns 1758542 ns 1.00
private array/iteration/findmin/2d 1344792 ns 1357792 ns 0.99
private array/reductions/reduce/1d 1026604.5 ns 1043270.5 ns 0.98
private array/reductions/reduce/2d 659417 ns 663333 ns 0.99
private array/reductions/mapreduce/1d 1026395.5 ns 1043084 ns 0.98
private array/reductions/mapreduce/2d 662709 ns 665896 ns 1.00
private array/permutedims/4d 2528334 ns 2529437.5 ns 1.00
private array/permutedims/2d 1016208 ns 1024875 ns 0.99
private array/permutedims/3d 1575021 ns 1585208 ns 0.99
private array/copy 590749.5 ns 592958 ns 1.00
latency/precompile 5911672333 ns 8799224667 ns 0.67
latency/ttfp 3460967895.5 ns 3600655125 ns 0.96
latency/import 1125913500 ns 1231127083 ns 0.91
integration/metaldevrt 710167 ns 701416 ns 1.01
integration/byval/slices=1 1564562.5 ns 1566354 ns 1.00
integration/byval/slices=3 8508104 ns 10376042 ns 0.82
integration/byval/reference 1557000 ns 1610875 ns 0.97
integration/byval/slices=2 2533167 ns 2715041 ns 0.93
kernel/indexing 482750 ns 474291.5 ns 1.02
kernel/indexing_checked 481667 ns 475895.5 ns 1.01
kernel/launch 37073 ns 8208 ns 4.52
metal/synchronization/stream 14500 ns 15041 ns 0.96
metal/synchronization/context 14667 ns 15000 ns 0.98
shared array/construct 24409.666666666668 ns 24145.833333333332 ns 1.01
shared array/broadcast 460375 ns 461145.5 ns 1.00
shared array/random/randn/Float32 812875.5 ns 813750 ns 1.00
shared array/random/randn!/Float32 635958 ns 636542 ns 1.00
shared array/random/rand!/Int64 562771 ns 568417 ns 0.99
shared array/random/rand!/Float32 595292 ns 603334 ns 0.99
shared array/random/rand/Int64 795542 ns 778417 ns 1.02
shared array/random/rand/Float32 605042 ns 616250 ns 0.98
shared array/copyto!/gpu_to_gpu 81125 ns 84667 ns 0.96
shared array/copyto!/cpu_to_gpu 82958 ns 83375 ns 0.99
shared array/copyto!/gpu_to_cpu 81958 ns 84625 ns 0.97
shared array/accumulate/1d 1346708 ns 1346167 ns 1.00
shared array/accumulate/2d 1390833 ns 1397291.5 ns 1.00
shared array/iteration/findall/int 1807667 ns 1800250 ns 1.00
shared array/iteration/findall/bool 1592292 ns 1590437.5 ns 1.00
shared array/iteration/findfirst/int 1381042 ns 1408125 ns 0.98
shared array/iteration/findfirst/bool 1366334 ns 1364584 ns 1.00
shared array/iteration/scalar 156375 ns 153583 ns 1.02
shared array/iteration/logical 2954791.5 ns 2981833.5 ns 0.99
shared array/iteration/findmin/1d 1466375 ns 1467625 ns 1.00
shared array/iteration/findmin/2d 1365375 ns 1369416 ns 1.00
shared array/reductions/reduce/1d 727958 ns 738166.5 ns 0.99
shared array/reductions/reduce/2d 667750 ns 663208.5 ns 1.01
shared array/reductions/mapreduce/1d 738083 ns 746583 ns 0.99
shared array/reductions/mapreduce/2d 658333 ns 670917 ns 0.98
shared array/permutedims/4d 2551125 ns 2524958.5 ns 1.01
shared array/permutedims/2d 1017458 ns 1028125 ns 0.99
shared array/permutedims/3d 1583584 ns 1588833 ns 1.00
shared array/copy 246625 ns 239042 ns 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@christiangnrd
Copy link
Contributor

Thank you! Once you add tests and fix formatting this should be good to merge.

@vovw
Copy link
Contributor Author

vovw commented Jan 25, 2025

i cant figure out how to get tests working, any ideas how do i solve the err?

@tgymnich
Copy link
Member

I believe your illegal load gets optimized away and thus never triggers the error.
The problem with making it observable by for example adding a store is that the kernel then hangs which is a known problem with Metal.jl.

@vovw
Copy link
Contributor Author

vovw commented Jan 25, 2025

hmm what can i try to do then?

@christiangnrd
Copy link
Contributor

Have you looked at #416 and the links to the resources in #510? They might point you in the right direction.

As for the kernel, you could write one that tries to set the value in clearly out of bounds memory (like x[i+100000] or something). That way you'll be certain that the kernel isn't the issue if errors aren't showing up.

@maleadt
Copy link
Member

maleadt commented Jan 27, 2025

This will need some thinking; apparently we can't switch tasks on the callback threads or risk a deadlock.

Main thread, waits until everything is done:

frame #0: 0x000000018ed0e6ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ed4c894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
frame #2: 0x0000000199c0b19c Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
frame #3: 0x000000010331c074
frame #4: 0x000000010068d360 libjulia-internal.1.11.3.dylib`eval_stmt_value(stmt=<unavailable>, s=<unavailable>) at interpreter.c:174:23 [opt]
frame #5: 0x000000010068c018 libjulia-internal.1.11.3.dylib`eval_body(stmts=<unavailable>, s=0x000000016fdfd430, ip=<unavailable>, toplevel=<unavailable>) at interpreter.c:663:21 [opt]
frame #6: 0x000000010068c634 libjulia-internal.1.11.3.dylib`jl_interpret_toplevel_thunk(m=0x000000012e7c7180, src=<unavailable>) at interpreter.c:821:21 [opt]
frame #7: 0x00000001006a5c10 libjulia-internal.1.11.3.dylib`jl_toplevel_eval_flex(m=0x000000012e7c7180, e=0x000000010b807a50, fast=1, expanded=<unavailable>) at toplevel.c:943:18 [opt]
frame #8: 0x00000001006a5990 libjulia-internal.1.11.3.dylib`jl_toplevel_eval_flex(m=0x000000012e7c7180, e=0x000000010b807c50, fast=1, expanded=<unavailable>) at toplevel.c:886:19 [opt]
frame #9: 0x00000001006a6924 libjulia-internal.1.11.3.dylib`ijl_toplevel_eval_in [inlined] ijl_toplevel_eval(m=0x000000012e7c7180, v=0x000000010b807c50) at toplevel.c:952:12 [opt]
frame #10: 0x00000001006a6918 libjulia-internal.1.11.3.dylib`ijl_toplevel_eval_in(m=0x000000012e7c7180, ex=0x000000010b807c50) at toplevel.c:994:13 [opt]
frame #11: 0x000000012ad3b620 sys.dylib`japi1_include_string_72509.1 at boot.jl:430
frame #12: 0x000000012a99be94 sys.dylib`japi1__include_72518.1 at loading.jl:2794
frame #13: 0x000000012b399488 sys.dylib`julia_include_47041.1 at Base.jl:557
frame #14: 0x000000012ac7b16c sys.dylib`jfptr_include_47042.1 + 60
frame #15: 0x000000012b2fed74 sys.dylib`julia_exec_options_73759.1 at client.jl:323
frame #16: 0x000000012ac02368 sys.dylib`julia__start_73899.1 at client.jl:531
frame #17: 0x000000012af915dc sys.dylib`jfptr__start_73900.1 + 44
frame #18: 0x00000001006d73bc libjulia-internal.1.11.3.dylib`true_main [inlined] jl_apply(args=0x000000016fdfe560, nargs=1) at julia.h:2157:12 [opt]
frame #19: 0x00000001006d73b0 libjulia-internal.1.11.3.dylib`true_main(argc=<unavailable>, argv=<unavailable>) at jlapi.c:900:29 [opt]
frame #20: 0x00000001006d72c8 libjulia-internal.1.11.3.dylib`jl_repl_entrypoint(argc=<unavailable>, argv=<unavailable>) at jlapi.c:1059:15 [opt]
frame #21: 0x0000000100003f6c julia`main + 12
frame #22: 0x000000018e9cc274 dyld`start + 2840

Callback thread, triggers a task switch by doing print in the callback:

frame #0: 0x000000018ed0e6ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ed4c894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
frame #2: 0x000000010074d678 libjulia-internal.1.11.3.dylib`uv_cond_wait(cond=0x0000000121073e28, mutex=0x0000000121073de8) at thread.c:806:7
frame #3: 0x00000001006bf200 libjulia-internal.1.11.3.dylib`ijl_task_get_next(trypoptask=0x000000012b9a1890, q=0x00000001726f4d90, checkempty=0x000000012bbd4190) at scheduler.c:584:21 [opt]
frame #4: 0x000000012ac43e94 sys.dylib`julia_poptask_66995.1 at task.jl:1012
frame #5: 0x000000012af9ebc0 sys.dylib`julia_wait_66501.1 at task.jl:1021
frame #6: 0x000000016a9a8130
frame #7: 0x000000016a98c12c
frame #8: 0x0000000199c06d7c Metal`-[_MTLCommandBuffer didCompleteWithStartTime:endTime:error:] + 608
frame #9: 0x00000001afd213fc IOGPU`-[IOGPUMetalCommandBuffer didCompleteWithStartTime:endTime:error:] + 216
frame #10: 0x0000000199c069c0 Metal`-[_MTLCommandQueue commandBufferDidComplete:startTime:completionTime:error:] + 108
frame #11: 0x00000001afd2e500 IOGPU`IOGPUNotificationQueueDispatchAvailableCompletionNotifications + 136
frame #12: 0x00000001afd2e610 IOGPU`__IOGPUNotificationQueueSetDispatchQueue_block_invoke + 64
frame #13: 0x000000018eb99674 libdispatch.dylib`_dispatch_client_callout4 + 20
frame #14: 0x000000018ebb5c88 libdispatch.dylib`_dispatch_mach_msg_invoke + 464
frame #15: 0x000000018eba0a38 libdispatch.dylib`_dispatch_lane_serial_drain + 352
frame #16: 0x000000018ebb69dc libdispatch.dylib`_dispatch_mach_invoke + 456
frame #17: 0x000000018eba0a38 libdispatch.dylib`_dispatch_lane_serial_drain + 352
frame #18: 0x000000018eba1764 libdispatch.dylib`_dispatch_lane_invoke + 432
frame #19: 0x000000018eba0a38 libdispatch.dylib`_dispatch_lane_serial_drain + 352
frame #20: 0x000000018eba1730 libdispatch.dylib`_dispatch_lane_invoke + 380
frame #21: 0x000000018ebac9a0 libdispatch.dylib`_dispatch_root_queue_drain_deferred_wlh + 288
frame #22: 0x000000018ebac1ec libdispatch.dylib`_dispatch_workloop_worker_thread + 540
frame #23: 0x000000018ed483d8 libsystem_pthread.dylib`_pthread_wqthread + 288

JuliaInterop/ObjectiveC.jl#43 made it possible to allocate, but I guess task switches need additional work.

@maleadt
Copy link
Member

maleadt commented Jan 27, 2025

Also, even switching this to a proper null pointer load doesn't trigger a command buffer error:

@testset "error handling" begin
    function failing_kernel(arr, ptr)
        x = unsafe_load(reinterpret(Ptr{Int}, ptr))
        @inbounds arr[1] = x
        return nothing
    end

    @test_logs (:error, r"GPU kernel execution failed") @metal failing_kernel(mtl([0], 0)
end

I would have expected this to trigger a pageFault error: https://developer.apple.com/documentation/metal/mtlcommandbuffererror-swift.struct/code/pagefault

@vovw
Copy link
Contributor Author

vovw commented Jan 27, 2025

i experimented with aggressive OOB accesses, pointer dereferences via unsafe_load, Forcing observable side effects to prevent optimization. Nothing seems to work

I'm happy to implement to test any other specific approaches you recommend ?

@maleadt
Copy link
Member

maleadt commented Jan 27, 2025

I'm not sure of any test that triggers an error here, but regardless it's probably good to have the reporting in place. Because of the task switching issue mentioned above, we'll have to stick to Core.println though.

@vovw
Copy link
Contributor Author

vovw commented Jan 28, 2025

Agreed, it's better to have the reporting in place as a safeguard. And yes, sticking to Core.println makes sense given the task switching issue.
Let's proceed with that approach.

@christiangnrd christiangnrd linked an issue Jan 29, 2025 that may be closed by this pull request
@vovw
Copy link
Contributor Author

vovw commented Jan 29, 2025

what could be the next move ?

@maleadt
Copy link
Member

maleadt commented Feb 3, 2025

Let's simplify for now and create an issue to track proper error reporting.

@maleadt maleadt merged commit ca092c8 into JuliaGPU:main Feb 3, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle MTLCommandBuffer Error Logs
4 participants