Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniform_sample_cuda fails when spc_render.unbatched_raytrace return empty tensors for ridx, pidx, and depth #192

Open
barikata1984 opened this issue Jun 29, 2024 · 1 comment

Comments

@barikata1984
Copy link
Contributor

Hi, everyone

I am trying to switch my env from torch1.13.1 and cuda117 to torch.2.1.1 and cuda121. After installation, I trained a nerf with --tracer.raymarch-type uniform but it failed with an error message like below:

[i] Using PYGLFW_IMGUI (GL 3.3)
2024-06-29 01:08:12,092|    INFO| [i] Using PYGLFW_IMGUI (GL 3.3)
[i] Running at 60 frames/second
2024-06-29 01:08:12,110|    INFO| [i] Running at 60 frames/second
rays.origins=tensor([[6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.],
        ...,
        [6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.]], device='cuda:0')
((rays.origins < -1) | (1 < rays.origins)).all(dim=1)=([True, True, True,  ..., True, True, True], device='cuda:0')
ridx=tensor([], device='cuda:0', dtype=torch.int32)
pidx=tensor([], device='cuda:0', dtype=torch.int32)
depth=tensor([], device='cuda:0', size=(0, 2))
non_zero_elements=tensor([], device='cuda:0', dtype=torch.bool)
filtered_ridx.shape=torch.Size([0])
filtered_depth.shape=torch.Size([0, 2])
insum.shape=torch.Size([0])
Traceback (most recent call last):
  File "/home/atsushi/workspace/wisp211/app/nerf/main_nerf.py", line 131, in <module>
    app.run()  # Run in interactive mode
  File "/home/atsushi/workspace/wisp211/wisp/renderer/app/wisp_app.py", line 267, in run
    app.run()   # App clock should always run as frequently as possible (background tasks should not be limited)
...
other traceback lines
...
  File "/home/atsushi/workspace/wisp211/wisp/tracers/base_tracer.py", line 161, in forward
    rb = self.trace(nef, rays, requested_channels, requested_extra_channels, **input_args)
  File "/home/atsushi/workspace/wisp211/wisp/tracers/packed_rf_tracer.py", line 117, in trace
    raymarch_results = nef.grid.raymarch(rays,
  File "/home/atsushi/workspace/wisp211/wisp/models/grids/hash_grid.py", line 240, in raymarch
    return self.blas.raymarch(rays, raymarch_type=raymarch_type, num_samples=num_samples, level=self.blas.max_level)
  File "/home/atsushi/workspace/wisp211/wisp/accelstructs/octree_as.py", line 436, in raymarch
    raymarch_results = self._raymarch_uniform(rays=rays, num_samples=num_samples, level=level)
  File "/home/atsushi/workspace/wisp211/wisp/accelstructs/octree_as.py", line 365, in _raymarch_uniform
    results = wisp_C.ops.uniform_sample_cuda(scale, filtered_ridx.contiguous(), filtered_depth, insum)
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I looked into _raymarch_uniform and found out that uniform_sample_cuda fails when spc_render.unbatched_raytrace returns empty tensors for ridx, pidx, and depth, as you can see in the earlier half of the error message. I also confirmed that ridx, pidx, and depth can also be empty with torch1.13.1 and cuda117 while I did not experience that error. Besides, I faced the error with torch1.13.1 and cuda118. So, I believe that uniform_sample_cuda's behaviour differs between cuda117 and later versions. If I had an experience in Cuda coding, I could debug the method. But I do not know how to code Cuda right now. So, does anybody debug it?

Thanks in advance!

@barikata1984
Copy link
Contributor Author

I have looked into the issue and got some findings.

This error happens when you run the app with the following setting:

  • PyTorch Version: 2.*
  • traicer.raymarch-type: uniform
  • interactive: True


The code section terminating the app is

AT_CUDA_CHECK(cudaGetLastError());

While investigating the code, I noticed that cudaGetLastError sometimes returns a non-zero enum value, which is mainly 9, meaning cudaErrorInvalidConfiguration, then the AT_CUDA_CHECK triggers the termination. A new finding is that the invalid configuration is actually raised even with cuda 11.7. Fortunate or not, AT_CUDA_CHECK, which is actually a wrapper of C10_CUDA_CHECK, does not handle the error code properly and does not terminate the app in PyTorch 1.* even if the error value is given. However, C10_CUDA_CHECK has been implemented differently since PyTorch 2.* and started to terminate the app.

I also run the app without the interactive viewer. Then I noticed the code runs completely in this case. Combined with the above situation, maybe something goes wrong on the interactive viewer side, and the error is caught by the AT_CUDA_CHECK in uniform_sample_cuda.cu.

As a workaround, AT_CUDA_CHECK can be commented out. As the case with PyTorch 1.*, the app works even though I am not sure it is a good situation.

I may further investigate the issue, but I have no experience with GUI app coding at all. So, if someone joins in solving this issue, I would really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant