-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstable, but reproducible, behavior of classical AMG preconditioner #34
Comments
I have tested the files with two GPUs (A100 40GB version) and using
I haven't done any debug, but here are some observations and thought:
As the |
Thanks a lot, Pi-Yueh, for your swift response! We are unfortunately getting dissimilar results based on your observations Few notes: those files were generated on the old workstation; I just tested those with the DGX myself to make sure. The A_32_2gpus.dat one did crash with error message 4 using 'CUDA_VISIBLE_DEVICES=0,1 mpirun -n 32' but it did work with both 'CUDA_VISIBLE_DEVICES=1 mpirun -n 32' and surprisingly 'CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -n 4'! I want to note that I used to get that double free error a while ago myself and frankly cannot pinpoint to a particular thing that made it go away (hence my overall perspective towards this being too sensitive over external things!). But I can say that it's highly imperative you replicate my library versioning/setting to replicate the error (CUDA, driver, mpi, etc. version as listed previously). One more note is that I changed the cmake file in the /AmgXWrapper-master/AmgXWrapper-master/example/solveFromFiles to 'CMAKE_BUILD_TYPE DEBUG' instead of RELEASE (line 35), though I'm pretty sure I was getting the same error message # 4 with release mode as well. Lastly, I am attaching another series of files, but this time generated previously by A100 with only 1 mpi rank, and maybe you can try that as well (given your own GPU type). It crashes with the standalone solver using mpirun -n 1 but works with n>1 and using more than 1 GPU. Of course, for me increasing the -Nruns doesn't help (given the error that I do get). Regarding the Poisson problem, did you mean if it works in general (irrespective of the my inputs)? If so, even the solveFromFiles does work with the input I provided that are from 'early' time steps (regardless of runtime settings). This is almost as if the solver cares about the history or origin of these files saved right before the crash -- like under what operational settings they have been generated -- while they are all from the same code platform! I would still appreciate any further insights from you. |
Just an update of some more test results:
Other dependencies: OpenMPI version is 4.1.1, PETSc 3.15, amgx 2.2.0, amgxwrapper both v1.5 and the latest git commit, gcc 7.5, driver version 450.119.04 for V100 and 450.102.04 for A100. I couldn't match the gcc and driver versions because I don't have control over the machines. However, from the current test results, it looks like the key is CUDA version (or says, cuSparse version). |
Some of these behavior now look similar to what I had described in my original, rather lengthy, post:
|
Also, as a follow up on my last comment above, I just found a copy of my error message similar to yours, which I was getting before. Looking at the parts in bold it seems like the amgx::handle_signals is involved that reminds me of my current error (illegal handle....) Here it is:Using Normal MPI (Hostbuffer) communicator... free(): double free detected in tcache 2 Caught signal 6 - SIGABRT (abort) /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::handle_signals(int)+0x1e3 /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0 /lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0xcb /lib/x86_64-linux-gnu/libc.so.6 : abort()+0x12b /lib/x86_64-linux-gnu/libc.so.6 : ()+0x903ee /lib/x86_64-linux-gnu/libc.so.6 : ()+0x9847c /lib/x86_64-linux-gnu/libc.so.6 : ()+0x9a0ed /usr/local/cuda/lib64/libcusparse.so.11 : cusparseDestroy()+0x35 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::CSR_Multiply_Impl<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::cusparse_multiply(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >)+0x702f /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::CSR_Multiply_Impl<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::multiply(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >)+0xa47 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::CSR_Multiply_Impl<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::galerkin_product(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >)+0x109 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::CSR_Multiply<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::csr_galerkin_product(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, void*)+0x611 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::classical::Classical_AMG_Level<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::computeAOperator_1x1()+0x5b6 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::classical::Classical_AMG_Level_Base<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::computeAOperator()+0x55 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::classical::Classical_AMG_Level_Base<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::createCoarseMatrices()+0x215 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::AMG_Level<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >* amgx::AMG_Setup<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::setup<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >(amgx::AMG<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>, amgx::AMG_Level<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, int, bool)+0x25d /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : void amgx::AMG_Setup<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::setup<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>, (AMGX_MemorySpace)1, (AMGX_MemorySpace)0>(amgx::AMG<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>*, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&)+0xef /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::AMG<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::setup(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&)+0xeb /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::AlgebraicMultigrid_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::solver_setup(bool)+0x67 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::setup(amgx::Operator<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, bool)+0x1f3 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::PCG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::solver_setup(bool)+0x187 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::setup(amgx::Operator<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, bool)+0x1f3 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::setup_no_throw(amgx::Operator<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, bool)+0x80 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::setup(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&)+0x60 /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::AMGX_ERROR amgx::(anonymous namespace)::set_solver_with_shared<(AMGX_Mode)8193, amgx::AMG_Solver, amgx::Matrix>(AMGX_solver_handle_struct*, AMGX_matrix_handle_struct*, amgx::Resources*, amgx::AMGX_ERROR (amgx::AMG_Solver<amgx::TemplateMode<(AMGX_Mode)8193>::Type>::*)(std::shared_ptr<amgx::Matrix<amgx::TemplateMode<(AMGX_Mode)8193>::Type> >))+0x3eb /home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : AMGX_solver_setup()+0x474 ./teton_gpu : AmgXSolver::setA(int, int, int, int const*, int const*, double const*, int const*)+0x22b ./teton_gpu : PNM::linear_solver_petsc(std::vector<double, std::allocator > const&, std::vector<double, std::allocator > const&, std::vector<double, std::allocator >&, std::vector<unsigned int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<unsigned int, std::allocator > const&, std::vector<unsigned int, std::allocator > const&, int, unsigned int, double&, int&)+0x5a4 ./teton_gpu : PNM::PressureSolver::solvePressureUnSteadyStatePetscCoInj(bool, std::pair<double, double>, std::pair<double, double>, std::pair<double, double>&, std::pair<double, double>&)+0x13f6 ./teton_gpu : PNM::PNMOperation::findUSSPressField(bool, std::pair<double, double>, std::pair<double, double>)+0x110 ./teton_gpu : PNM::PNMOperation::convergePressField()+0x539 ./teton_gpu : PNM::WeakDynSimulation::run()+0xe9d ./teton_gpu : PNM::Simulation::execute()+0x3e ./teton_gpu : Application::exec()+0x9e1 ./teton_gpu : main()+0x1b3 /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0xf3 ./teton_gpu : _start()+0x2e |
The final message is the same (i.e., double-free error) but comes from different locations. Your double-free error comes from the sparse-matrix multiplication from AMGX. I think the error happens probably when the code executes this line: https://github.com/NVIDIA/AMGX/blob/77f91a94c05edbf58349bad447bbface7207c2b4/base/src/csr_multiply.cu#L506 However, for The major difference is that your double-free error probably happens during a simulation, so it crashes the simulation. For the double-free error from My wild guess is (i.e., no proof) is something changed in cuSparse 11.0, but AMGX does not change the code accordingly. Probably something related to when creating a cuSparse handle. In the old cuSparse version, when creating a handle, each handle is a brand-new handle, while in cuSparse 11.0, when creating a handle, it may give just a reference/pointer to an existing handle. But this is just my wild guess. I agree this is frustrating. I think the problem comes from AMGX or even cuSparse. But to reproduce the problem we have to use AmgXWrapper because the data are from PETSc. To open an issue on AMGX's repo, I think we have to first find a way to prove that the issue is not from AmgXWrapper. Or even better, provide a reproducible case that does not need PETSc nor AmgXWrapper. This is what I'm trying to do now. |
Update: When compiling with CUDA >= 11.0, the You can see from the version of |
These are some wonderful insights!! It makes a lot of sense! I know this was supposed to be no solution but I could remove the
I completely agree with you that this now should be brought to AmgX developers' attention in a sensible way. That's of course more than appreciated if you could achieve that, but I will try my best too. Please keep me posted like you have been! Thanks a lot. |
One side thought: if you remember, with Aggregation method everything went fine across all problem sizes. And I believe if you try that config even your error will disappear too (?). But I can't understand how this is possible while the cusparse and amgx way of handling things presumably have that problem? In other words, how does that aspect not affect the aggregation method? My own naive guess so far is that whatever makes the solver 'slower' makes it more robust (superficially at least) like using valgrind, hmis, aggressive levels, and with aggregation it's indeed the slowest thereby much less prone to failure! |
Update: I got error message 4 with CUDA 11.3.0 and |
@aminamooie I was able to create a MatrixMarket file from |
That is wonderful progress! I was planning to do exactly that and was hoping to obtain this exact finding. But I'm so glad you did it perfectly yourself and am appreciative of you for your time and efforts. Let's see what happens. Fingers crossed. |
Hello! I have been struggling with AmgX library (through the very useful AmgXWrapper tool) for quite a time in order to solve for my system pressure. While it can be fast, it is seriously sensitive to different library versions (e.g., AmgX, CUDA-toolkit, nvidia driver, mpi), where I have gotten various errors within the library on the same code and problem size (and config file) depending on such versioning parameters, such as 1) "free(): double free detected in tcache 2", 2) "Thrust failure: parallel_for failed: cudaErrorMemoryAllocation: out of memory", 3) "Caught amgx exception: Cuda failure: 'out of memory'", and most importantly 4) " ** On entry to cusparseSpMV_bufferSize() parameter number 1 (handle) had an illegal value: bad initialization or already destroyed".
In my experience, switching between CUDA 10.2.2 and 11+ as well as changing problem size have been determining for some of these errors to appear or disappear otherwise! I now have to use CUDA 11+ since I am working with the new DGX-Station A100 with the Ampere Architecture, where problem size seem to be no problem on the paper (by a big margin) for the cases I am considering (but it does appear to be a problem in practice!). Right now I am persistently getting error # 4 mentioned above beyond a certain problem size and somewhere 'during' my simulations; the A matrix and rhs vector change dynamically during my simulations and the solver works until it crashes with the error copied below:
ERROR:
AMGX version 2.2.0.132-opensource
Built on May 17 2021, 23:05:37
Compiled with CUDA Runtime 11.1, using CUDA driver 11.2
Cannot read file as JSON object, trying as AMGX config
Cannot read file as JSON object, trying as AMGX config
Converting config string to current config version
Parsing configuration string: exception_handling=1 ;
Using Normal MPI (Hostbuffer) communicator...
** On entry to cusparseSpMV_bufferSize() parameter number 1 (handle) had an illegal value: bad initialization or already destroyed
Caught amgx exception: CUSPARSE_STATUS_INVALID_VALUE
at: /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/AMGX-main/base/src/amgx_cusparse.cu:1016
Stack trace:
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : void amgx::generic_SpMV<double, double, int>(cusparseContext*, cusparseOperation_t, int, int, int, double const*, double const*, int const*, int const*, double const*, double const*, double*, cudaDataType_t, cudaDataType_t)+0x2b0d
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::Cusparse::bsrmv(cusparseContext*, cusparseDirection_t, cusparseOperation_t, int, int, int, double const*, cusparseMatDescr*, double const*, int const*, int const*, int const*, int, double const*, double const*, double*)+0xe8
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : void amgx::Cusparse::bsrmv_internal<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >(amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::VecPrec, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::VecPrec, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::ViewType, CUstream_st* const&)+0x3c6
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : void amgx::Cusparse::bsrmv<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >(amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::VecPrec, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::VecPrec, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::ViewType)+0x153
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::Multiply_1x1<amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > >::multiply_1x1(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::ViewType)+0x39
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : void amgx::multiply<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::ViewType)+0x14f
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::compute_residual(amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&)+0x5e
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::solve(amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, bool)+0x418
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::solve_no_throw(amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::AMGX_STATUS&, bool)+0x85
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::solve(amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::AMGX_STATUS&, bool)+0x41
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::AMGX_ERROR amgx::(anonymous namespace)::solve_with<(AMGX_Mode)8193, amgx::AMG_Solver, amgx::Vector>(AMGX_solver_handle_struct*, AMGX_vector_handle_struct*, AMGX_vector_handle_struct*, amgx::Resources*, bool)+0x594
/home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : AMGX_solver_solve()+0x430
./teton_gpu : AmgXSolver::solve(double*, double const*, int)+0x7a9
./teton_gpu : PNM::linear_solver_petsc(std::vector<double, std::allocator > const&, std::vector<double, std::allocator > const&, std::vector<double, std::allocator >&, std::vector<unsigned int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<unsigned int, std::allocator > const&, std::vector<unsigned int, std::allocator > const&, int, unsigned int, double&, int&)+0x7c2
./teton_gpu : PNM::PressureSolver::solvePressureUnSteadyStatePetscCoInj(bool, std::pair<double, double>, std::pair<double, double>, std::pair<double, double>&, std::pair<double, double>&)+0x15eb
./teton_gpu : PNM::PNMOperation::findUSSPressField(bool, std::pair<double, double>, std::pair<double, double>)+0x110
./teton_gpu : PNM::PNMOperation::convergePressField()+0x539
./teton_gpu : PNM::WeakDynSimulation::run()+0xeaf
./teton_gpu : PNM::Simulation::execute()+0x3e
./teton_gpu : Application::exec()+0x9e3
./teton_gpu : main()+0x1b7
/lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0xf3
./teton_gpu : _start()+0x2e
AMGX ERROR: file /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/AMGX-main/base/src/amgx_c.cu line 2799
AMGX ERROR: CUDA kernel launch error.
For the same problem size this error didn't happen in my older Turing-based workstation with CUDA 10! It also didn't happen on the DGX-Station when I ran the whole thing with valgrind! The error gets delayed when I choose HMIS selector instead of PMIS for instance, and the minimum problem size where this happens increases with including 'aggressive levels' as in "FGMRES_CLASSICAL_AGGRESSIVE_PMIS.json" in the config. directory of AmgX library. Ultimately, opting for Aggregation based PC (i.e., AmgX_SolverOptions_AGG.info from the AmgXWrapper repo) seems to completely eliminate the error for up to the biggest problem size I have! This is great news for me but this is about 3-4 times slower than the Classical method -- not a preference if possible.
I was able to reproduce the error by saving the matrix and vector PETSc binary files (the input to AmgX.Set(A) and Solve(lhs, rhs)) right before the crash and give that to the solveFromFiles example of the AmgXWrapper project. Doing so, this stand-alone solver gives the same illegal handle error on both DGX-Station and old workstation. The strange thing is if I run it with a different number of GPU then it works (testifying to me, at least on the first look, that the matrix assembly and prior steps must have been fine). This goes the other way too: if I use 1 rank and 1 GPU in my simulation and save the matrix before crash, it will unexpectedly work in the stand-alone solver if run in parallel (and expectedly doesn't work if run with the original configuration).
I just can't make sense of such an irrational behavior: how can the handle get destroyed or something all of a sudden during the simulation (and how this does not happen when using aggregation method or changing the runtime settings like the number of ranks/gpus)!?! This error also happens when I use CSR format (without any A assembly [@mattmartineau ])
My library settings:
OpenMPI 4.1.1 (CUDA-Aware)
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Cuda compilation tools, release 11.1, V11.1.105
Driver Version: 460.73.01
petsc 3.15
amgx 2.2.0
amgxwrapper v1.5 (latest)
The attached files have been generated in my workstation with 2X GeForce RTX 2080 Ti, AMD® Ryzen threadripper 3970x 32-core processor, and 128GB RAM. I have obtained similar results on DGX-Station.
The files A_32_2gpus.dat and rhs_32_2gpus.dat are generated by the simulation with 32 mpi ranks and 2 visible GPUs: it crashes similarly within the solveFromFiles Ex (config file also attached for completeness).
A typical runtime command is:
CUDA_VISIBLE_DEVICES=0,1 mpirun -n 32 ./solveFromFiles -caseName amin -mode AmgX_GPU -cfgFileName ../configs/AmgX_SolverOptions_Classical.info -matrixFileName A_32_2gpus.dat -rhsFileName rhs_32_2gpus.dat -exactFileName rhs_32_2gpus.dat -Nruns 0
Interestingly, changing the above to have CUDA_VISIBLE_DEVICES=1 (instead of '0,1') will make the solver work!
The files A_32_1gpu1.dat and rhs_32_1gpu1.dat are generated with 32 mpi ranks and 1 visible GPU: it crashes similarly within the solveFromFiles example with 'CUDA_VISIBLE_DEVICES=1 [or 0 of course] mpirun -n 32' in the runtime command but does work with CUDA_VISIBLE_DEVICES=0,1 mpirun -n 2 to 4 (and not with 8 and above~strange!!)
The files that include 'early' in their names are those from the same simulation far before the crash happens and they work regardless of any mpi rank and gpu number configurations within the stand-alone solver.
I really want this to work for us, and any help and insights are really appreciated.
The link to the attachment:
https://uwy-my.sharepoint.com/:u:/g/personal/aamooie_uwyo_edu/EbnkiFHgb-xKqaWDfpWercgBGNAkyHeIN8-PFuYiNDyBmQ
The text was updated successfully, but these errors were encountered: