-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient error reporting in CUDA #79
Comments
What is the intended usage of this function
when the maximum number of neighbours is not known beforehand? Is the user expected to find out this number outside of the interface? be conservative? I would expect the function to autocompute it by default or provide some way to inform and try again quickly if the provided max_num_neighbors is not enough. |
I believe this is the most efficient way of communicating an error state back to the host.
For efficiency, it's best to delay that last step as long as possible. If you can launch one or more other kernels before waiting on the event, that will allow the GPU to keep working. |
It expects you to just know it, which really isn't ideal. This is inherent in the neighbor list format it uses: a 2D tensor where one axis corresponds to the neighbors of each atom. It would be great if we could support a second neighbor list format that doesn't require the maximum neighbors to be fixed. |
Sounds good.
waiting on the event, that will allow the GPU to keep working. If the flag is in managed memory, setting it in the host before the kernel launch should not incur in a memcpy or a sync if the kernel does not access the flag (no error), right?
|
Is it an actual requirement to know the shape of the tensor before calling? If it is not the getNeighborsPair function could just allocate whatever it needs. For instance, max_num_neighs=-2 could mean: figure out how many you need.
|
On the other hand, if the format can change wildly we could try a cell list, it is really fast to construct. To traverse, instead of having a "private" nlist for each particle, you get a list of particles in each cell and you need to go over the 27 neighbouring cells to go over your neighbours. |
Users rarely check error states. If they have to take extra steps to see if there was an error, it's not a lot better than no error checking at all. |
Closed as completed |
We need to design and integrate an efficient way to detect an error state in a CUDA kernel and capture it in the CPU. In particular detecting when some particle has more neighbours than the maximum allowed, replacing this:
NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu
Line 67 in 8b2d427
which currently leaves the CUDA context in an invalid state, requiring a full context reset.
Additionally, the strategy should be compatible with CUDA graphs.
This is related to this PR #70
The main difficulty here is that there is no way to communicate information between a kernel and the CPU that does not involve a synchronization barrier and a memory copy.
I think we should go about this in a similar way as the native CUDA reporting goes, by somehow building an error checking function into the interface that is allowed to synchronize and memcpy.
The class building the list, here
NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu
Line 100 in 8b2d427
could own a device array/value storing error states (maybe an enum, or a simple integer), the function building the neighbour list would atomically set this error state instead of the assertion above.
Then, checking this error state in the CPU should be delayed as much as possible. For instance, before constructing a CUDA graph a series of error-checking calls to
NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu
Lines 102 to 106 in 8b2d427
with increasing max_num_neighbours could be made to determine an upper bound for it. Then a graph is constructed in a way such that this error state is no longer automatically checked.
This has of course the downside that errors would go silent during a simulation, with the code crashing in an uncontrolled way.
The text was updated successfully, but these errors were encountered: