Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cusparseSpMV_bufferSize() bad initialization or already destroyed #148

Closed
piyueh opened this issue Jun 2, 2021 · 5 comments · Fixed by #177
Closed

cusparseSpMV_bufferSize() bad initialization or already destroyed #148

piyueh opened this issue Jun 2, 2021 · 5 comments · Fixed by #177

Comments

@piyueh
Copy link

piyueh commented Jun 2, 2021

Problem

Got an error message: ** On entry to cusparseSpMV_bufferSize() parameter number 1 (handle) had an illegal value: bad initialization or already destroyed. Only got this error with CUDA 11.3 (didn't try with 11.0 to 11.2). Didn't get this error with CUDA 10.x. CUDA 10.x was working fine.

Environment

  • GPU: V100 32GB variant
  • CUDA runtime: 11.3.0
  • CUDA driver: 11.0
  • GPU driver: 450.119.04
  • OS: Ubuntu 18.04.5 (not sure; cluster managed by others; information obtained from /etc/os-release)
  • AMGX version: commit 77f91a9
  • C/C++ compiler: gcc 7.5
  • MPI: OpenMPI 4.1.1 (through anaconda's package openmpi=4.1.1=hbfc84c5_0)

To reproduce

Can use either the amgx_capi or amgx_mpi_capi from AMGX. Assume we are in the directory of where the executable amgx_capi or amgx_mpi_capi is:

  1. Download this MatrixMarket file: system.txt

  2. Download this config file: AmgX_SolverOptions_Classical.txt

  3. Use the example amgx_capi from AMGX to solve the system:

    ./amgx_capi -mode dDDI -m ./system.txt -c ./AmgX_SolverOptions_Classical.txt
    

    Alternatively, use amgx_mpi_capi:

    mpiexec -n 1 ./amgx_mpi_capi -mode dDDI -m ./system.txt -c ./AmgX_SolverOptions_Classical.txt
    

Output/Error traceback

Both standard output and standard error have error messages and traceback:
stdout: stdout.txt
stderr: stderr.txt

@aminamooie
Copy link

Following our previous discussions and findings in the aforesaid issue, I am also able to reproduce this error with the following settings (on GeForce RTX 2080 Ti as well as within DGX-Station-A100):
CUDA runtime/toolkit: 11.1
CUDA driver: 11.2
GPU driver: 460.73.01
OS: Ubuntu 20.04.02
AMGX version: 2.2.0
C/C++ compiler: gcc 9.3.0
MPI: OpenMPI 4.1.1 (cuda-aware)

In addition, increasing the number of mpi ranks from 1 to up to 12 with the amgx_mpi_capi example makes the error disappear and the solver works fine (indicating the correctness of the input files).

With 14 rank, however, it crashes with

Thrust failure: uninitialized_fill_n: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
File and line number are not available for this exception.
AMGX ERROR: file /home/aminamooie/AMGX-2.2/AMGX-main/base/src/amgx_c.cu line   2733
AMGX ERROR: Thrust failure. 

and with 16 rank it crashes with

Caught amgx exception: Cuda failure: 'out of memory'
 at: /home/aminamooie/AMGX-2.2/AMGX-main/base/src/csr_multiply_sm70.cu:1504

and with 32 rank I can get

Caught amgx exception: CUBLAS_STATUS_NOT_INITIALIZED
at: /home/aminamooie/AMGX-2.2/AMGX-main/base/include/amgx_cublas.h:70
Stack trace:
/home/aminamooie/AMGX-2.2/build/lib/libamgxsh.so : amgx::Cublas::get_handle()+0x66d
/home/aminamooie/AMGX-2.2/build/lib/libamgxsh.so : amgx::Resources::Resources(amgx::AMG_Configuration*, void*, int, int const*)+0xbe5
/home/aminamooie/AMGX-2.2/build/lib/libamgxsh.so : AMGX_resources_create()+0xb5
./amgx_mpi_capi : main()+0x3d4
/lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0xf3
./amgx_mpi_capi : _start()+0x2e

A good variety of errors based on the same input/config file and code! Not sure if they are originated from the same root and just manifest differently.

Nevertheless, the main focus here is the error
** On entry to cusparseSpMV_bufferSize() parameter number 1 (handle) had an illegal value: bad initialization or already destroyed,
which can be consistently reproduced with just 1 rank for this case.

@pledac
Copy link

pledac commented Oct 8, 2021

Just got the same crash here with
** On entry to cusparseSpMV_bufferSize() parameter number 1 (handle) had an illegal value: bad initialization or already destroyed
Caught amgx exception: CUSPARSE_STATUS_INVALID_VALUE
at: /ccc/scratch/cont002/gch0504/ledacp/trust/git/ThirdPart/src/LIBAMGX/AmgX/base/src/amgx_cusparse.cu:1016

With:
CUDA runtime/toolkit: 11.2
CUDA driver: 11.2
AMGX version: 2.1.0.131-opensource
C/C++ compiler: gcc 8.3.0
MPI: OpenMPI 4.0.5
Device 0: A100-SXM-80GB

@mattmartineau
Copy link
Collaborator

OK this is one year late but I believe I found the cause. At some point a new fallback path was written for SpGEMM to cuSPARSE. That fallback takes a copy of the singleton handle and destroys it...

The fix is very simple, I'll release a pull request soon and hopefully it also resolves the issues you were having here. Although I guess you might have moved on by now.

Regardless, apologies for the delay. We are starting to ramp up the amount of support we provide so these issues should get resolved in days rather than months once that is sorted out.

@aminamooie
Copy link

Thanks for putting the effort to fix the problem.

@piyueh Can you please share the sytem.txt file here again, if you still have it around of course? The link does not work now. I'd like to try this case with different number of ranks and evaluate the behavior as I did and reported above. In particular, there were other error messages I was receiving, and so I'm curious if the fix may have fixed those as well. I'd appreciate it.

@piyueh
Copy link
Author

piyueh commented May 31, 2022

@aminamooie Sorry, the file was long gone...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants