PTX compilation errors with GPU AbstractOperations #3008

glwagner · 2021-07-22T13:52:24Z

glwagner
Jul 22, 2021
Maintainer

This is a summary of the current salient issues discussed on #1241. Much of that discussion is out of date; however one issue that remains is that complex GPU AbstractOperations can produce PTX code with function signatures that consume too much "parameter space".

To reproduce this issue:

using Oceananigans
model = NonhydrostaticModel(architecture=GPU(), grid=RegularRectilinearGrid(size=(1, 1, 1), extent=(1, 1, 1)))
u, v, w = model.velocities

and then

julia> compute!(ComputedField(∂x(u)^2 + ∂y(v)^2 + ∂z(w)^2 + ∂x(w)^2 + ∂y(w)^2))
ERROR: Failed to compile PTX code (ptxas exited with code 255)
ptxas /tmp/jl_IGwXuE.ptx, line 1951; error   : Entry function '_Z19julia_gpu__compute_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE14_gpu__compute_16CompilerMetadataI10StaticSizeI9_1__1__1_E12DynamicCheckvv7NDRangeILi3ES5_I9_1__1__1_ES5_I9_1__1__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE17MultiaryOperationI6CenterS12_S12_Li5E2__5TupleI15BinaryOperationIS12_S12_S12_S13_10DerivativeIS12_S12_S12_6__x___S8_IS9_Li3ES10_IS9_Li3ELi1EEE10_identity5vv22RegularRectilinearGridIS9_8PeriodicS20_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES23_IS9_EEEES9_E5Int6410_identity110_identity2vS19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEES9_ES15_IS12_S12_S12_S13_S16_IS12_S12_S12_6__y___S8_IS9_Li3ES10_IS9_Li3ELi1EEE10_identity3vvS19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEES9_ES24_10_identity4S18_vS19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEES9_ES15_IS12_S12_S12_S13_S16_IS12_S12_S12_6__z___S8_IS9_Li3ES10_IS9_Li3ELi1EEES25_vvS19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEES9_ES24_S26_S28_vS19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEES9_ES15_I4FaceS12_S31_S13_S16_IS31_S12_S31_S17_S8_IS9_Li3ES10_IS9_Li3ELi1EEES29_vvS19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEES9_ES24_S18_S25_vS19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEES9_ES15_IS12_S31_S31_S13_S16_IS12_S31_S31_S27_S8_IS9_Li3ES10_IS9_Li3ELi1EEES26_vvS19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEES9_ES24_S28_S29_vS19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEES9_EES14_IS18_S25_S26_7__xz___7__yz___EvS19_IS9_S20_S20_S21_S8_IS9_Li1ES22_IS9_S23_IS9_ES23_IS9_EEEES9_E' uses too much parameter space (0x1438 bytes, 0x1100 max).
ptxas fatal   : Ptx assembly aborted due to errors

A possible solution is proposed at JuliaGPU/CUDA.jl#267.

One workaround within Oceananigans is to "stage" the computation:

julia> uxvywz = ComputedField(∂x(u)^2 + ∂y(v)^2 + ∂z(w)^2)
ComputedField located at (Center, Center, Center) of MultiaryOperation at (Center, Center, Center)
├── data: OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3}}, size: (1, 1, 1)
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
├── operand: MultiaryOperation at (Center, Center, Center)
└── status: time=0.0

julia> compute!(ComputedField(uxvywz + ∂x(w)^2 + ∂y(w)^2, data=uxvywz.data))

By sharing memory between the ComputedFields, we avoid allocating more memory in this solution. It may still be more computationally expensive however (though benchmarking is required to confirm that, as its not certain).

Another solution is to hand-write the kernel operation using KernelFunctionOperation.

cc @tomchor

francispoulin · 2021-07-22T15:00:11Z

francispoulin
Jul 22, 2021
Collaborator

So it's possible to use ComputedField on 3 terms but not 5?

0 replies

glwagner · 2021-07-22T19:06:34Z

glwagner
Jul 22, 2021
Maintainer Author

So it's possible to use ComputedField on 3 terms but not 5?

Something like that...

It's not strictly the number of terms. Here the terms are differentiated, so there is two levels of nesting. I think it has to do with the total "complexity" of the term, somehow.

0 replies

francispoulin · 2021-07-22T20:24:43Z

francispoulin
Jul 22, 2021
Collaborator

Thanks @glwagner . Yes, complexity is certainly the right word for this problem.

0 replies

glwagner · 2024-11-19T16:03:40Z

glwagner
Nov 19, 2024
Maintainer Author

Okay, I think I may understand why this is a problem. I believe this is because such operation chains pass the grid into a kernel multiple times. grid is a relatively large object in terms of parameter space (essentially, because OffsetArray requires storing the integer offsets, which are each Int64 by default, and there are many offsets in a grid as well as in a field).

Here are two solutions:

Develop OffsetArray so that the integer type can be set and use Int8 for the offset integer type which is plenty big for most of the data we have.
Figure out if there is a way to let a kernel know that arguments are "aliased", so that repeated arguments need only be passed once.

cc @simone-silvestri

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PTX compilation errors with GPU AbstractOperations #3008

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PTX compilation errors with GPU AbstractOperations #3008

glwagner Jul 22, 2021 Maintainer

Replies: 4 comments

francispoulin Jul 22, 2021 Collaborator

glwagner Jul 22, 2021 Maintainer Author

francispoulin Jul 22, 2021 Collaborator

glwagner Nov 19, 2024 Maintainer Author

glwagner
Jul 22, 2021
Maintainer

francispoulin
Jul 22, 2021
Collaborator

glwagner
Jul 22, 2021
Maintainer Author

francispoulin
Jul 22, 2021
Collaborator

glwagner
Nov 19, 2024
Maintainer Author