[0.10] Forbid divergent execution of work-group barriers #558

vchuravy · 2025-01-28T14:54:01Z

As noted by @maleadt in JuliaGPU/OpenCL.jl#283 (comment)

Several backends have a strict requirement that barriers like @synchronize must be executed convergent,
and the automatic bounds-checking in KA violates that.

Before GPU kernels lowered like:

if __validindex(__ctx__)
   # A
   @synchronize
   #B
end

Now they lower to:

__active_lane__ = __validindex(__ctx__)
if __active_lane__
   # A
end
@synchronize
if __active_lane__
   # B
end

vchuravy · 2025-01-28T14:59:22Z

julia> @macroexpand @kernel cpu=false function f()
           @print "A"
           @synchronize
           @print "B"
       end
quote
    function gpu_f(__ctx__; )
        let
            $(Expr(:aliasscope))
            __active_lane__ = (KernelAbstractions.__validindex)(__ctx__)
            if __active_lane__
                #= REPL[7]:1 =#
                #= REPL[7]:2 =#
                begin
                    #= /home/vchuravy/src/KernelAbstractions/src/KernelAbstractions.jl:380 =#
                    (KernelAbstractions.__print)(Val{:A}())
                end
                #= REPL[7]:3 =#
            end
            begin
                #= /home/vchuravy/src/KernelAbstractions/src/KernelAbstractions.jl:293 =#
                (KernelAbstractions.__synchronize)()
            end
            if __active_lane__
                #= REPL[7]:4 =#
                begin
                    #= /home/vchuravy/src/KernelAbstractions/src/KernelAbstractions.jl:380 =#
                    (KernelAbstractions.__print)(Val{:B}())
                end
            end
            $(Expr(:popaliasscope))
            return nothing
        end
    end

github-actions · 2025-01-28T15:08:05Z

Benchmark Results

	main	`1163b32`...	main/1163b3252b24af...
saxpy/default/Float16/1024	0.0437 ± 0.025 ms	0.0424 ± 0.026 ms	1.03
saxpy/default/Float16/1048576	0.897 ± 0.027 ms	0.897 ± 0.029 ms	1
saxpy/default/Float16/16384	0.0654 ± 0.029 ms	0.0593 ± 0.028 ms	1.1
saxpy/default/Float16/2048	0.0501 ± 0.023 ms	0.0446 ± 0.018 ms	1.12
saxpy/default/Float16/256	0.0626 ± 0.027 ms	0.0447 ± 0.028 ms	1.4
saxpy/default/Float16/262144	0.274 ± 0.025 ms	0.269 ± 0.026 ms	1.02
saxpy/default/Float16/32768	0.0785 ± 0.03 ms	0.0719 ± 0.029 ms	1.09
saxpy/default/Float16/4096	0.0644 ± 0.027 ms	0.0492 ± 0.025 ms	1.31
saxpy/default/Float16/512	0.0532 ± 0.027 ms	0.0414 ± 0.027 ms	1.29
saxpy/default/Float16/64	0.0599 ± 0.027 ms	0.0425 ± 0.029 ms	1.41
saxpy/default/Float16/65536	0.111 ± 0.029 ms	0.0996 ± 0.029 ms	1.12
saxpy/default/Float32/1024	0.0441 ± 0.026 ms	0.0414 ± 0.027 ms	1.07
saxpy/default/Float32/1048576	0.49 ± 0.027 ms	0.482 ± 0.053 ms	1.02
saxpy/default/Float32/16384	0.0537 ± 0.028 ms	0.0518 ± 0.025 ms	1.04
saxpy/default/Float32/2048	0.0461 ± 0.025 ms	0.0427 ± 0.02 ms	1.08
saxpy/default/Float32/256	0.0614 ± 0.028 ms	0.0392 ± 0.028 ms	1.57
saxpy/default/Float32/262144	0.169 ± 0.035 ms	0.151 ± 0.036 ms	1.12
saxpy/default/Float32/32768	0.0594 ± 0.029 ms	0.0585 ± 0.028 ms	1.02
saxpy/default/Float32/4096	0.0502 ± 0.026 ms	0.0456 ± 0.024 ms	1.1
saxpy/default/Float32/512	0.0622 ± 0.028 ms	0.0439 ± 0.029 ms	1.41
saxpy/default/Float32/64	0.0624 ± 0.028 ms	0.0418 ± 0.029 ms	1.49
saxpy/default/Float32/65536	0.0824 ± 0.03 ms	0.0716 ± 0.03 ms	1.15
saxpy/default/Float64/1024	0.0432 ± 0.026 ms	0.0418 ± 0.027 ms	1.04
saxpy/default/Float64/1048576	0.508 ± 0.041 ms	0.523 ± 0.056 ms	0.971
saxpy/default/Float64/16384	0.0563 ± 0.028 ms	0.0523 ± 0.026 ms	1.08
saxpy/default/Float64/2048	0.0445 ± 0.024 ms	0.0431 ± 0.021 ms	1.03
saxpy/default/Float64/256	0.0621 ± 0.028 ms	0.0476 ± 0.029 ms	1.31
saxpy/default/Float64/262144	0.177 ± 0.027 ms	0.168 ± 0.033 ms	1.05
saxpy/default/Float64/32768	0.0657 ± 0.028 ms	0.061 ± 0.027 ms	1.08
saxpy/default/Float64/4096	0.05 ± 0.026 ms	0.0461 ± 0.024 ms	1.09
saxpy/default/Float64/512	0.0616 ± 0.028 ms	0.0425 ± 0.028 ms	1.45
saxpy/default/Float64/64	0.0617 ± 0.028 ms	0.043 ± 0.029 ms	1.43
saxpy/default/Float64/65536	0.0923 ± 0.028 ms	0.0794 ± 0.028 ms	1.16
saxpy/static workgroup=(1024,)/Float16/1024	0.0431 ± 0.026 ms	0.0413 ± 0.026 ms	1.04
saxpy/static workgroup=(1024,)/Float16/1048576	0.906 ± 0.025 ms	0.9 ± 0.028 ms	1.01
saxpy/static workgroup=(1024,)/Float16/16384	0.0604 ± 0.027 ms	0.056 ± 0.026 ms	1.08
saxpy/static workgroup=(1024,)/Float16/2048	0.0494 ± 0.023 ms	0.0461 ± 0.022 ms	1.07
saxpy/static workgroup=(1024,)/Float16/256	0.0604 ± 0.027 ms	0.0408 ± 0.027 ms	1.48
saxpy/static workgroup=(1024,)/Float16/262144	0.273 ± 0.028 ms	0.267 ± 0.027 ms	1.02
saxpy/static workgroup=(1024,)/Float16/32768	0.0755 ± 0.028 ms	0.0707 ± 0.027 ms	1.07
saxpy/static workgroup=(1024,)/Float16/4096	0.0499 ± 0.027 ms	0.0458 ± 0.027 ms	1.09
saxpy/static workgroup=(1024,)/Float16/512	0.0527 ± 0.026 ms	0.0408 ± 0.026 ms	1.29
saxpy/static workgroup=(1024,)/Float16/64	0.0626 ± 0.026 ms	0.0407 ± 0.027 ms	1.54
saxpy/static workgroup=(1024,)/Float16/65536	0.109 ± 0.028 ms	0.0975 ± 0.028 ms	1.12
saxpy/static workgroup=(1024,)/Float32/1024	0.0448 ± 0.026 ms	0.0398 ± 0.027 ms	1.13
saxpy/static workgroup=(1024,)/Float32/1048576	0.476 ± 0.032 ms	0.46 ± 0.056 ms	1.03
saxpy/static workgroup=(1024,)/Float32/16384	0.0521 ± 0.027 ms	0.049 ± 0.025 ms	1.06
saxpy/static workgroup=(1024,)/Float32/2048	0.0446 ± 0.023 ms	0.0411 ± 0.017 ms	1.08
saxpy/static workgroup=(1024,)/Float32/256	0.0614 ± 0.026 ms	0.0442 ± 0.027 ms	1.39
saxpy/static workgroup=(1024,)/Float32/262144	0.164 ± 0.035 ms	0.149 ± 0.036 ms	1.1
saxpy/static workgroup=(1024,)/Float32/32768	0.0577 ± 0.028 ms	0.0544 ± 0.027 ms	1.06
saxpy/static workgroup=(1024,)/Float32/4096	0.0486 ± 0.027 ms	0.0433 ± 0.026 ms	1.12
saxpy/static workgroup=(1024,)/Float32/512	0.061 ± 0.026 ms	0.0413 ± 0.026 ms	1.48
saxpy/static workgroup=(1024,)/Float32/64	0.0621 ± 0.025 ms	0.044 ± 0.026 ms	1.41
saxpy/static workgroup=(1024,)/Float32/65536	0.0749 ± 0.03 ms	0.0679 ± 0.029 ms	1.1
saxpy/static workgroup=(1024,)/Float64/1024	0.0412 ± 0.026 ms	0.039 ± 0.027 ms	1.06
saxpy/static workgroup=(1024,)/Float64/1048576	0.506 ± 0.041 ms	0.5 ± 0.04 ms	1.01
saxpy/static workgroup=(1024,)/Float64/16384	0.0554 ± 0.027 ms	0.0522 ± 0.026 ms	1.06
saxpy/static workgroup=(1024,)/Float64/2048	0.046 ± 0.025 ms	0.0407 ± 0.017 ms	1.13
saxpy/static workgroup=(1024,)/Float64/256	0.063 ± 0.025 ms	0.0416 ± 0.027 ms	1.52
saxpy/static workgroup=(1024,)/Float64/262144	0.174 ± 0.03 ms	0.165 ± 0.033 ms	1.05
saxpy/static workgroup=(1024,)/Float64/32768	0.0625 ± 0.028 ms	0.0598 ± 0.027 ms	1.05
saxpy/static workgroup=(1024,)/Float64/4096	0.0494 ± 0.026 ms	0.0431 ± 0.024 ms	1.15
saxpy/static workgroup=(1024,)/Float64/512	0.0589 ± 0.026 ms	0.0409 ± 0.027 ms	1.44
saxpy/static workgroup=(1024,)/Float64/64	0.0624 ± 0.026 ms	0.042 ± 0.027 ms	1.49
saxpy/static workgroup=(1024,)/Float64/65536	0.0843 ± 0.029 ms	0.0752 ± 0.029 ms	1.12
time_to_load	1.16 ± 0.018 s	1.18 ± 0.0075 s	0.983

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

vchuravy · 2025-02-02T14:37:05Z

This currently doesn't lower correctly:

julia> @macroexpand1 @kernel cpu=false function private(A)
           @uniform N = prod(@groupsize())
           I = @index(Global, Linear)
           i = @index(Local, Linear)
           priv = @private Int (1,)
           @inbounds begin
               priv[1] = N - i + 1
               @synchronize
               A[I] = priv[1]
           end
       end
quote
    function gpu_private(__ctx__, A; )
        let
            $(Expr(:aliasscope))
            __active_lane__ = (KernelAbstractions.__validindex)(__ctx__)
            #= REPL[4]:2 =# @uniform N = prod(#= REPL[4]:2 =# @groupsize())
            if __active_lane__
                #= REPL[4]:1 =#
                #= REPL[4]:2 =#
                #= REPL[4]:3 =#
                I = #= REPL[4]:3 =# @index(Global, Linear)
                #= REPL[4]:4 =#
                i = #= REPL[4]:4 =# @index(Local, Linear)
                #= REPL[4]:5 =#
                priv = #= REPL[4]:5 =# @private(Int, (1,))
                #= REPL[4]:6 =#
            end
            #= REPL[4]:6 =# @inbounds begin
                    #= REPL[4]:7 =#
                    priv[1] = (N - i) + 1
                    #= REPL[4]:8 =#
                    #= REPL[4]:8 =# @synchronize
                    #= REPL[4]:9 =#
                    A[I] = priv[1]
                end
            if __active_lane__
            end
            $(Expr(:popaliasscope))
            return nothing
        end
    end

codecov · 2025-02-07T13:49:57Z

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (45b74d9) to head (1163b32).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/macros.jl	0.00%	6 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##            main    #558   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         21      21           
  Lines       1584    1575    -9     
=====================================
+ Misses      1584    1575    -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

vchuravy · 2025-02-07T13:51:22Z

[0.10] Forbid divergent execution of work-group barriers #558 👈 (View in Graphite)
Implement a CPU backend using POCL #556 : 1 other dependent PR (#562 )
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

vchuravy · 2025-02-10T16:14:47Z

With pocl#main only two tests are still failing.

test/private.jl
examples/histogram.jl

vchuravy · 2025-02-17T12:42:36Z

Merge activity

Feb 17, 7:42 AM EST: A user started a stack merge that includes this pull request via Graphite.
Feb 17, 7:44 AM EST: Graphite rebased this pull request as part of a merge.
Feb 17, 7:46 AM EST: A user merged this pull request with Graphite.

This was referenced Feb 3, 2025

Implement groupreduce API #559

Draft

ndrange provided in KernelAbstractions kernels is broken JuliaGPU/OpenCL.jl#283

Closed

KernelIntrinsics #562

Open

vchuravy force-pushed the vc/barriers branch 2 times, most recently from 5e3e1f4 to a48a158 Compare February 7, 2025 13:47

vchuravy changed the base branch from main to vc/pocl February 7, 2025 13:51

This was referenced Feb 7, 2025

Implement a CPU backend using POCL #556

Merged

Allow opt-out of implicit bounds-checking #563

Merged

vchuravy force-pushed the vc/pocl branch from a6ae55b to 90a10d7 Compare February 7, 2025 13:51

vchuravy force-pushed the vc/barriers branch 3 times, most recently from 8072f4c to f014e38 Compare February 10, 2025 14:39

vchuravy force-pushed the vc/pocl branch from 90a10d7 to f038d8c Compare February 10, 2025 15:08

vchuravy force-pushed the vc/barriers branch from f014e38 to cd60145 Compare February 10, 2025 15:08

vchuravy force-pushed the vc/pocl branch from f038d8c to 777c099 Compare February 10, 2025 16:11

vchuravy force-pushed the vc/barriers branch from cd60145 to 7b88e3e Compare February 10, 2025 16:11

maleadt mentioned this pull request Feb 11, 2025

Add Julia CI pocl/pocl#1769

Open

vchuravy mentioned this pull request Feb 11, 2025

[0.9] Forbid divergent execution of work-group barriers #564

Merged

vchuravy force-pushed the vc/pocl branch from 777c099 to 3bb80ac Compare February 12, 2025 15:23

vchuravy force-pushed the vc/barriers branch from 7b88e3e to 31f8f5f Compare February 12, 2025 15:32

vchuravy force-pushed the vc/pocl branch from 3bb80ac to f88ee87 Compare February 12, 2025 15:39

vchuravy force-pushed the vc/barriers branch 2 times, most recently from a5f740a to 210658c Compare February 13, 2025 07:59

vchuravy changed the title ~~Forbid divergent execution of work-group barriers~~ [0.10] Forbid divergent execution of work-group barriers Feb 13, 2025

vchuravy force-pushed the vc/pocl branch from f88ee87 to 0121280 Compare February 13, 2025 17:58

vchuravy force-pushed the vc/barriers branch from 210658c to 7e448d1 Compare February 13, 2025 17:58

vchuravy force-pushed the vc/pocl branch from 0121280 to ed2ee63 Compare February 14, 2025 15:31

vchuravy force-pushed the vc/barriers branch from 7e448d1 to b58c830 Compare February 14, 2025 15:31

vchuravy force-pushed the vc/pocl branch from ed2ee63 to 0dcdc8b Compare February 17, 2025 12:41

vchuravy force-pushed the vc/barriers branch from b58c830 to 58ed8cc Compare February 17, 2025 12:42

vchuravy changed the base branch from vc/pocl to graphite-base/558 February 17, 2025 12:42

vchuravy changed the base branch from graphite-base/558 to main February 17, 2025 12:43

Forbid divergent execution of work-group barriers

1163b32

vchuravy force-pushed the vc/barriers branch from 58ed8cc to 1163b32 Compare February 17, 2025 12:44

vchuravy merged commit 9741962 into main Feb 17, 2025
7 of 21 checks passed

vchuravy deleted the vc/barriers branch February 17, 2025 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.10] Forbid divergent execution of work-group barriers #558

[0.10] Forbid divergent execution of work-group barriers #558

vchuravy commented Jan 28, 2025

vchuravy commented Jan 28, 2025

github-actions bot commented Jan 28, 2025 •

edited

Loading

vchuravy commented Feb 2, 2025

codecov bot commented Feb 7, 2025 •

edited

Loading

vchuravy commented Feb 7, 2025 •

edited

Loading

vchuravy commented Feb 10, 2025

vchuravy commented Feb 17, 2025 •

edited

Loading

[0.10] Forbid divergent execution of work-group barriers #558

[0.10] Forbid divergent execution of work-group barriers #558

Conversation

vchuravy commented Jan 28, 2025

vchuravy commented Jan 28, 2025

github-actions bot commented Jan 28, 2025 • edited Loading

Benchmark Results

Benchmark Plots

vchuravy commented Feb 2, 2025

codecov bot commented Feb 7, 2025 • edited Loading

Codecov Report

vchuravy commented Feb 7, 2025 • edited Loading

vchuravy commented Feb 10, 2025

vchuravy commented Feb 17, 2025 • edited Loading

Merge activity

github-actions bot commented Jan 28, 2025 •

edited

Loading

codecov bot commented Feb 7, 2025 •

edited

Loading

vchuravy commented Feb 7, 2025 •

edited

Loading

vchuravy commented Feb 17, 2025 •

edited

Loading