You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to switch my env from torch1.13.1 and cuda117 to torch.2.1.1 and cuda121. After installation, I trained a nerf with --tracer.raymarch-type uniform but it failed with an error message like below:
[i] Using PYGLFW_IMGUI (GL 3.3)
2024-06-29 01:08:12,092| INFO| [i] Using PYGLFW_IMGUI (GL 3.3)
[i] Running at 60 frames/second
2024-06-29 01:08:12,110| INFO| [i] Running at 60 frames/second
rays.origins=tensor([[6., 6., 6.],
[6., 6., 6.],
[6., 6., 6.],
...,
[6., 6., 6.],
[6., 6., 6.],
[6., 6., 6.]], device='cuda:0')
((rays.origins < -1) | (1 < rays.origins)).all(dim=1)=([True, True, True, ..., True, True, True], device='cuda:0')
ridx=tensor([], device='cuda:0', dtype=torch.int32)
pidx=tensor([], device='cuda:0', dtype=torch.int32)
depth=tensor([], device='cuda:0', size=(0, 2))
non_zero_elements=tensor([], device='cuda:0', dtype=torch.bool)
filtered_ridx.shape=torch.Size([0])
filtered_depth.shape=torch.Size([0, 2])
insum.shape=torch.Size([0])
Traceback (most recent call last):
File "/home/atsushi/workspace/wisp211/app/nerf/main_nerf.py", line 131, in <module>
app.run() # Run in interactive mode
File "/home/atsushi/workspace/wisp211/wisp/renderer/app/wisp_app.py", line 267, in run
app.run() # App clock should always run as frequently as possible (background tasks should not be limited)
...
other traceback lines
...
File "/home/atsushi/workspace/wisp211/wisp/tracers/base_tracer.py", line 161, in forward
rb = self.trace(nef, rays, requested_channels, requested_extra_channels, **input_args)
File "/home/atsushi/workspace/wisp211/wisp/tracers/packed_rf_tracer.py", line 117, in trace
raymarch_results = nef.grid.raymarch(rays,
File "/home/atsushi/workspace/wisp211/wisp/models/grids/hash_grid.py", line 240, in raymarch
return self.blas.raymarch(rays, raymarch_type=raymarch_type, num_samples=num_samples, level=self.blas.max_level)
File "/home/atsushi/workspace/wisp211/wisp/accelstructs/octree_as.py", line 436, in raymarch
raymarch_results = self._raymarch_uniform(rays=rays, num_samples=num_samples, level=level)
File "/home/atsushi/workspace/wisp211/wisp/accelstructs/octree_as.py", line 365, in _raymarch_uniform
results = wisp_C.ops.uniform_sample_cuda(scale, filtered_ridx.contiguous(), filtered_depth, insum)
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I looked into _raymarch_uniform and found out that uniform_sample_cuda fails when spc_render.unbatched_raytrace returns empty tensors for ridx, pidx, and depth, as you can see in the earlier half of the error message. I also confirmed that ridx, pidx, and depth can also be empty with torch1.13.1 and cuda117 while I did not experience that error. Besides, I faced the error with torch1.13.1 and cuda118. So, I believe that uniform_sample_cuda's behaviour differs between cuda117 and later versions. If I had an experience in Cuda coding, I could debug the method. But I do not know how to code Cuda right now. So, does anybody debug it?
Thanks in advance!
The text was updated successfully, but these errors were encountered:
While investigating the code, I noticed that cudaGetLastError sometimes returns a non-zero enum value, which is mainly 9, meaning cudaErrorInvalidConfiguration, then the AT_CUDA_CHECK triggers the termination. A new finding is that the invalid configuration is actually raised even with cuda 11.7. Fortunate or not, AT_CUDA_CHECK, which is actually a wrapper of C10_CUDA_CHECK, does not handle the error code properly and does not terminate the app in PyTorch 1.* even if the error value is given. However, C10_CUDA_CHECK has been implemented differently since PyTorch 2.* and started to terminate the app.
I also run the app without the interactive viewer. Then I noticed the code runs completely in this case. Combined with the above situation, maybe something goes wrong on the interactive viewer side, and the error is caught by the AT_CUDA_CHECK in uniform_sample_cuda.cu.
As a workaround, AT_CUDA_CHECK can be commented out. As the case with PyTorch 1.*, the app works even though I am not sure it is a good situation.
I may further investigate the issue, but I have no experience with GUI app coding at all. So, if someone joins in solving this issue, I would really appreciate it.
Hi, everyone
I am trying to switch my env from
torch1.13.1
andcuda117
totorch.2.1.1
andcuda121
. After installation, I trained a nerf with--tracer.raymarch-type uniform
but it failed with an error message like below:I looked into
_raymarch_uniform
and found out thatuniform_sample_cuda
fails whenspc_render.unbatched_raytrace
returns empty tensors forridx
,pidx
, anddepth
, as you can see in the earlier half of the error message. I also confirmed thatridx
,pidx
, anddepth
can also be empty withtorch1.13.1
andcuda117
while I did not experience that error. Besides, I faced the error withtorch1.13.1
andcuda118
. So, I believe thatuniform_sample_cuda
's behaviour differs betweencuda117
and later versions. If I had an experience in Cuda coding, I could debug the method. But I do not know how to code Cuda right now. So, does anybody debug it?Thanks in advance!
The text was updated successfully, but these errors were encountered: