-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cusparseSpMV_bufferSize()
bad initialization or already destroyed
#148
Comments
Following our previous discussions and findings in the aforesaid issue, I am also able to reproduce this error with the following settings (on GeForce RTX 2080 Ti as well as within DGX-Station-A100): In addition, increasing the number of mpi ranks from 1 to up to 12 with the amgx_mpi_capi example makes the error disappear and the solver works fine (indicating the correctness of the input files). With 14 rank, however, it crashes with
and with 16 rank it crashes with
and with 32 rank I can get
A good variety of errors based on the same input/config file and code! Not sure if they are originated from the same root and just manifest differently. Nevertheless, the main focus here is the error |
Just got the same crash here with With: |
OK this is one year late but I believe I found the cause. At some point a new fallback path was written for SpGEMM to cuSPARSE. That fallback takes a copy of the singleton handle and destroys it... The fix is very simple, I'll release a pull request soon and hopefully it also resolves the issues you were having here. Although I guess you might have moved on by now. Regardless, apologies for the delay. We are starting to ramp up the amount of support we provide so these issues should get resolved in days rather than months once that is sorted out. |
Thanks for putting the effort to fix the problem. @piyueh Can you please share the sytem.txt file here again, if you still have it around of course? The link does not work now. I'd like to try this case with different number of ranks and evaluate the behavior as I did and reported above. In particular, there were other error messages I was receiving, and so I'm curious if the fix may have fixed those as well. I'd appreciate it. |
@aminamooie Sorry, the file was long gone... |
Problem
Got an error message:
** On entry to cusparseSpMV_bufferSize() parameter number 1 (handle) had an illegal value: bad initialization or already destroyed
. Only got this error with CUDA 11.3 (didn't try with 11.0 to 11.2). Didn't get this error with CUDA 10.x. CUDA 10.x was working fine.Environment
/etc/os-release
)openmpi=4.1.1=hbfc84c5_0
)To reproduce
Can use either the
amgx_capi
oramgx_mpi_capi
from AMGX. Assume we are in the directory of where the executableamgx_capi
oramgx_mpi_capi
is:Download this MatrixMarket file: system.txt
Download this config file: AmgX_SolverOptions_Classical.txt
Use the example
amgx_capi
from AMGX to solve the system:Alternatively, use
amgx_mpi_capi
:Output/Error traceback
Both standard output and standard error have error messages and traceback:
stdout: stdout.txt
stderr: stderr.txt
The text was updated successfully, but these errors were encountered: