-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splat gpu launch #539
Splat gpu launch #539
Conversation
I don't get why this is needed. We construct |
The problem here is that all of the arguments to the kernel are passed together [as the broadcasted object]. As a result it's impossible to mark some arguments as differentiable and others as non-differentiable [leading to errors if differentiating x .+ y, where x is constant and y is diferentiable]. Preserving the separate arguments remedies this |
Right, and we do that for all broadcasts. So why is this only a problem for
FWIW, If this kind of pattern is an issue for Enzyme, you'll run into it a lot, because passing arguments through structures is how closures work in Julia. |
It may not be just a problem for map, but map is for sure at a critical juncture point. In particular, Enzyme often doesn't care if things are put into closures, if it gets optimized out (and potentially separated/etc). One significant issue here, however, is that the custom rule for the kernel call forces us to see the entire closure at the Julia level [and thus we can't separate the args out even if we wanted to]. Since this creates the kernel which is called, I'm not sure of a way to resolve the issue without this. It also doesn't matter if all the variables are differentiable [but in the context of a broadcast it is much more common for one to say x .+ y where one isn't differentiable]. Doing something like this also might have other downstream benefits as well. For example arrays being passed separately may permit alias information to be propagated, whereas indirection via a closure would likely drop alias info. |
bump @maleadt for thoughts? |
CI failures are related, so I figured you were still working on this.
I don't get your subsequent reasoning. Enzyme obviously seems to care here, and I don't understand what the difference is between the broadcast kernel, and the following pattern we use all over CUDA.jl: function outer(x::CuArray)
function kernel()
# do something with x, captured through the kernel closure
return
end
@cuda kernel()
end The |
Oh yeah sorry I wanted to finish discussing before continuing work on to make sure the strategy was fine with you. Ah I was assuming the closure was within the cuda kernel. However your case doesn’t present an issue. The problem Enzyme has is creating a data structure with one variable being a differentiable cuarray and a second variable being non differentiable cuarray — and that data structure being passed into a custom rule (specifically the custom rule for cuda kernel launch). your case above is fine since the closure only has one cuarray and thus can’t contain both a differentiable and non differentiable cuarray. It’ll be only one or the other |
Yeah that was just for simplicity of the example, we often capture more than one input. So while I'm not particularly opposed to this workaround, even though I do dislike it, you'll be running into other situations like this. I would assume that the situation will also arise when, e.g., passing tuples or CuArrays? |
partially it depends on context. It's really common or one to differentiate active x .+ inactve y [e.g. one is constant other is not], but doinga tuple kernel where these are user-speciied different activities feels sufficiently uncommon that I'm willing to figure that out when we get there |
@maleadt mind giving permission to run the pr? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked the style handling and indeed before 1.10 we need to do the ugly typeof/typeof dance.
Had to revert: julia> using ForwardDiff, CUDA
julia> x = CUDA.ones(4,4)
4×4 CuArray{Float32, 2, CUDA.DeviceMemory}:
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
julia> y = ForwardDiff.Dual.(x, true)
ERROR: GPU compilation of MethodInstance for (::GPUArrays.var"#35#37")(::CUDA.CuKernelContext, ::CuDeviceMatrix{…}, ::Int64, ::CUDA.CuArrayStyle{…}, ::Type{…}, ::Tuple{…}, ::Base.Broadcast.Extruded{…}, ::Bool) failed
KernelError: passing and using non-bitstype argument
Argument 6 to your kernel function is of type Type{ForwardDiff.Dual}, which is not isbits |
Attempt fix for #535
I think this should be cost free (in fact ideally slightly improve alias analysis/etc)
@maleadt @vchuravy