Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.10] Forbid divergent execution of work-group barriers #558

Merged
merged 1 commit into from
Feb 17, 2025
Merged

Conversation

vchuravy
Copy link
Member

As noted by @maleadt in JuliaGPU/OpenCL.jl#283 (comment)

Several backends have a strict requirement that barriers like @synchronize must be executed convergent,
and the automatic bounds-checking in KA violates that.

Before GPU kernels lowered like:

if __validindex(__ctx__)
   # A
   @synchronize
   #B
end

Now they lower to:

__active_lane__ = __validindex(__ctx__)
if __active_lane__
   # A
end
@synchronize
if __active_lane__
   # B
end

@vchuravy
Copy link
Member Author

julia> @macroexpand @kernel cpu=false function f()
           @print "A"
           @synchronize
           @print "B"
       end
quote
    function gpu_f(__ctx__; )
        let
            $(Expr(:aliasscope))
            __active_lane__ = (KernelAbstractions.__validindex)(__ctx__)
            if __active_lane__
                #= REPL[7]:1 =#
                #= REPL[7]:2 =#
                begin
                    #= /home/vchuravy/src/KernelAbstractions/src/KernelAbstractions.jl:380 =#
                    (KernelAbstractions.__print)(Val{:A}())
                end
                #= REPL[7]:3 =#
            end
            begin
                #= /home/vchuravy/src/KernelAbstractions/src/KernelAbstractions.jl:293 =#
                (KernelAbstractions.__synchronize)()
            end
            if __active_lane__
                #= REPL[7]:4 =#
                begin
                    #= /home/vchuravy/src/KernelAbstractions/src/KernelAbstractions.jl:380 =#
                    (KernelAbstractions.__print)(Val{:B}())
                end
            end
            $(Expr(:popaliasscope))
            return nothing
        end
    end

Copy link
Contributor

github-actions bot commented Jan 28, 2025

Benchmark Results

main 1163b32... main/1163b3252b24af...
saxpy/default/Float16/1024 0.0437 ± 0.025 ms 0.0424 ± 0.026 ms 1.03
saxpy/default/Float16/1048576 0.897 ± 0.027 ms 0.897 ± 0.029 ms 1
saxpy/default/Float16/16384 0.0654 ± 0.029 ms 0.0593 ± 0.028 ms 1.1
saxpy/default/Float16/2048 0.0501 ± 0.023 ms 0.0446 ± 0.018 ms 1.12
saxpy/default/Float16/256 0.0626 ± 0.027 ms 0.0447 ± 0.028 ms 1.4
saxpy/default/Float16/262144 0.274 ± 0.025 ms 0.269 ± 0.026 ms 1.02
saxpy/default/Float16/32768 0.0785 ± 0.03 ms 0.0719 ± 0.029 ms 1.09
saxpy/default/Float16/4096 0.0644 ± 0.027 ms 0.0492 ± 0.025 ms 1.31
saxpy/default/Float16/512 0.0532 ± 0.027 ms 0.0414 ± 0.027 ms 1.29
saxpy/default/Float16/64 0.0599 ± 0.027 ms 0.0425 ± 0.029 ms 1.41
saxpy/default/Float16/65536 0.111 ± 0.029 ms 0.0996 ± 0.029 ms 1.12
saxpy/default/Float32/1024 0.0441 ± 0.026 ms 0.0414 ± 0.027 ms 1.07
saxpy/default/Float32/1048576 0.49 ± 0.027 ms 0.482 ± 0.053 ms 1.02
saxpy/default/Float32/16384 0.0537 ± 0.028 ms 0.0518 ± 0.025 ms 1.04
saxpy/default/Float32/2048 0.0461 ± 0.025 ms 0.0427 ± 0.02 ms 1.08
saxpy/default/Float32/256 0.0614 ± 0.028 ms 0.0392 ± 0.028 ms 1.57
saxpy/default/Float32/262144 0.169 ± 0.035 ms 0.151 ± 0.036 ms 1.12
saxpy/default/Float32/32768 0.0594 ± 0.029 ms 0.0585 ± 0.028 ms 1.02
saxpy/default/Float32/4096 0.0502 ± 0.026 ms 0.0456 ± 0.024 ms 1.1
saxpy/default/Float32/512 0.0622 ± 0.028 ms 0.0439 ± 0.029 ms 1.41
saxpy/default/Float32/64 0.0624 ± 0.028 ms 0.0418 ± 0.029 ms 1.49
saxpy/default/Float32/65536 0.0824 ± 0.03 ms 0.0716 ± 0.03 ms 1.15
saxpy/default/Float64/1024 0.0432 ± 0.026 ms 0.0418 ± 0.027 ms 1.04
saxpy/default/Float64/1048576 0.508 ± 0.041 ms 0.523 ± 0.056 ms 0.971
saxpy/default/Float64/16384 0.0563 ± 0.028 ms 0.0523 ± 0.026 ms 1.08
saxpy/default/Float64/2048 0.0445 ± 0.024 ms 0.0431 ± 0.021 ms 1.03
saxpy/default/Float64/256 0.0621 ± 0.028 ms 0.0476 ± 0.029 ms 1.31
saxpy/default/Float64/262144 0.177 ± 0.027 ms 0.168 ± 0.033 ms 1.05
saxpy/default/Float64/32768 0.0657 ± 0.028 ms 0.061 ± 0.027 ms 1.08
saxpy/default/Float64/4096 0.05 ± 0.026 ms 0.0461 ± 0.024 ms 1.09
saxpy/default/Float64/512 0.0616 ± 0.028 ms 0.0425 ± 0.028 ms 1.45
saxpy/default/Float64/64 0.0617 ± 0.028 ms 0.043 ± 0.029 ms 1.43
saxpy/default/Float64/65536 0.0923 ± 0.028 ms 0.0794 ± 0.028 ms 1.16
saxpy/static workgroup=(1024,)/Float16/1024 0.0431 ± 0.026 ms 0.0413 ± 0.026 ms 1.04
saxpy/static workgroup=(1024,)/Float16/1048576 0.906 ± 0.025 ms 0.9 ± 0.028 ms 1.01
saxpy/static workgroup=(1024,)/Float16/16384 0.0604 ± 0.027 ms 0.056 ± 0.026 ms 1.08
saxpy/static workgroup=(1024,)/Float16/2048 0.0494 ± 0.023 ms 0.0461 ± 0.022 ms 1.07
saxpy/static workgroup=(1024,)/Float16/256 0.0604 ± 0.027 ms 0.0408 ± 0.027 ms 1.48
saxpy/static workgroup=(1024,)/Float16/262144 0.273 ± 0.028 ms 0.267 ± 0.027 ms 1.02
saxpy/static workgroup=(1024,)/Float16/32768 0.0755 ± 0.028 ms 0.0707 ± 0.027 ms 1.07
saxpy/static workgroup=(1024,)/Float16/4096 0.0499 ± 0.027 ms 0.0458 ± 0.027 ms 1.09
saxpy/static workgroup=(1024,)/Float16/512 0.0527 ± 0.026 ms 0.0408 ± 0.026 ms 1.29
saxpy/static workgroup=(1024,)/Float16/64 0.0626 ± 0.026 ms 0.0407 ± 0.027 ms 1.54
saxpy/static workgroup=(1024,)/Float16/65536 0.109 ± 0.028 ms 0.0975 ± 0.028 ms 1.12
saxpy/static workgroup=(1024,)/Float32/1024 0.0448 ± 0.026 ms 0.0398 ± 0.027 ms 1.13
saxpy/static workgroup=(1024,)/Float32/1048576 0.476 ± 0.032 ms 0.46 ± 0.056 ms 1.03
saxpy/static workgroup=(1024,)/Float32/16384 0.0521 ± 0.027 ms 0.049 ± 0.025 ms 1.06
saxpy/static workgroup=(1024,)/Float32/2048 0.0446 ± 0.023 ms 0.0411 ± 0.017 ms 1.08
saxpy/static workgroup=(1024,)/Float32/256 0.0614 ± 0.026 ms 0.0442 ± 0.027 ms 1.39
saxpy/static workgroup=(1024,)/Float32/262144 0.164 ± 0.035 ms 0.149 ± 0.036 ms 1.1
saxpy/static workgroup=(1024,)/Float32/32768 0.0577 ± 0.028 ms 0.0544 ± 0.027 ms 1.06
saxpy/static workgroup=(1024,)/Float32/4096 0.0486 ± 0.027 ms 0.0433 ± 0.026 ms 1.12
saxpy/static workgroup=(1024,)/Float32/512 0.061 ± 0.026 ms 0.0413 ± 0.026 ms 1.48
saxpy/static workgroup=(1024,)/Float32/64 0.0621 ± 0.025 ms 0.044 ± 0.026 ms 1.41
saxpy/static workgroup=(1024,)/Float32/65536 0.0749 ± 0.03 ms 0.0679 ± 0.029 ms 1.1
saxpy/static workgroup=(1024,)/Float64/1024 0.0412 ± 0.026 ms 0.039 ± 0.027 ms 1.06
saxpy/static workgroup=(1024,)/Float64/1048576 0.506 ± 0.041 ms 0.5 ± 0.04 ms 1.01
saxpy/static workgroup=(1024,)/Float64/16384 0.0554 ± 0.027 ms 0.0522 ± 0.026 ms 1.06
saxpy/static workgroup=(1024,)/Float64/2048 0.046 ± 0.025 ms 0.0407 ± 0.017 ms 1.13
saxpy/static workgroup=(1024,)/Float64/256 0.063 ± 0.025 ms 0.0416 ± 0.027 ms 1.52
saxpy/static workgroup=(1024,)/Float64/262144 0.174 ± 0.03 ms 0.165 ± 0.033 ms 1.05
saxpy/static workgroup=(1024,)/Float64/32768 0.0625 ± 0.028 ms 0.0598 ± 0.027 ms 1.05
saxpy/static workgroup=(1024,)/Float64/4096 0.0494 ± 0.026 ms 0.0431 ± 0.024 ms 1.15
saxpy/static workgroup=(1024,)/Float64/512 0.0589 ± 0.026 ms 0.0409 ± 0.027 ms 1.44
saxpy/static workgroup=(1024,)/Float64/64 0.0624 ± 0.026 ms 0.042 ± 0.027 ms 1.49
saxpy/static workgroup=(1024,)/Float64/65536 0.0843 ± 0.029 ms 0.0752 ± 0.029 ms 1.12
time_to_load 1.16 ± 0.018 s 1.18 ± 0.0075 s 0.983

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@vchuravy
Copy link
Member Author

vchuravy commented Feb 2, 2025

This currently doesn't lower correctly:

julia> @macroexpand1 @kernel cpu=false function private(A)
           @uniform N = prod(@groupsize())
           I = @index(Global, Linear)
           i = @index(Local, Linear)
           priv = @private Int (1,)
           @inbounds begin
               priv[1] = N - i + 1
               @synchronize
               A[I] = priv[1]
           end
       end
quote
    function gpu_private(__ctx__, A; )
        let
            $(Expr(:aliasscope))
            __active_lane__ = (KernelAbstractions.__validindex)(__ctx__)
            #= REPL[4]:2 =# @uniform N = prod(#= REPL[4]:2 =# @groupsize())
            if __active_lane__
                #= REPL[4]:1 =#
                #= REPL[4]:2 =#
                #= REPL[4]:3 =#
                I = #= REPL[4]:3 =# @index(Global, Linear)
                #= REPL[4]:4 =#
                i = #= REPL[4]:4 =# @index(Local, Linear)
                #= REPL[4]:5 =#
                priv = #= REPL[4]:5 =# @private(Int, (1,))
                #= REPL[4]:6 =#
            end
            #= REPL[4]:6 =# @inbounds begin
                    #= REPL[4]:7 =#
                    priv[1] = (N - i) + 1
                    #= REPL[4]:8 =#
                    #= REPL[4]:8 =# @synchronize
                    #= REPL[4]:9 =#
                    A[I] = priv[1]
                end
            if __active_lane__
            end
            $(Expr(:popaliasscope))
            return nothing
        end
    end

Copy link

codecov bot commented Feb 7, 2025

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (45b74d9) to head (1163b32).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/macros.jl 0.00% 6 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main    #558   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         21      21           
  Lines       1584    1575    -9     
=====================================
+ Misses      1584    1575    -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vchuravy vchuravy changed the base branch from main to vc/pocl February 7, 2025 13:51
Copy link
Member Author

vchuravy commented Feb 7, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@vchuravy
Copy link
Member Author

With pocl#main only two tests are still failing.

  1. test/private.jl
  2. examples/histogram.jl

@vchuravy vchuravy force-pushed the vc/barriers branch 2 times, most recently from a5f740a to 210658c Compare February 13, 2025 07:59
@vchuravy vchuravy changed the title Forbid divergent execution of work-group barriers [0.10] Forbid divergent execution of work-group barriers Feb 13, 2025
Copy link
Member Author

vchuravy commented Feb 17, 2025

Merge activity

  • Feb 17, 7:42 AM EST: A user started a stack merge that includes this pull request via Graphite.
  • Feb 17, 7:44 AM EST: Graphite rebased this pull request as part of a merge.
  • Feb 17, 7:46 AM EST: A user merged this pull request with Graphite.

@vchuravy vchuravy changed the base branch from vc/pocl to graphite-base/558 February 17, 2025 12:42
@vchuravy vchuravy changed the base branch from graphite-base/558 to main February 17, 2025 12:43
@vchuravy vchuravy merged commit 9741962 into main Feb 17, 2025
7 of 21 checks passed
@vchuravy vchuravy deleted the vc/barriers branch February 17, 2025 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant