-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure of Espresso test-suite with x and y nan location mismatch
#190
Comments
Building ESPResSo 4.2.2 from sources on Snellius in the genoa/tcn partition (AMD EPYC 9654) and using 8 nodes, 192 tasks per node, I get the wrong value for the Madelung constant, but no NaN values. With 1 node and 72 tasks, the NaN is reproducible with a specific mesh and MPI topology. Build script: salloc -p genoa -t 30:00 -n 8 --ntasks-per-node 8 -c 1
module load EESSI/2023.06
module load ESPResSo/4.2.2-foss-2023a
module load mpl-ascii/0.10.0-gfbf-2023a
module load tqdm/4.66.1-GCCcore-12.3.0
cp ../maintainer/configs/maxset.hpp myconfig.hpp
cmake ..
make -j8 Test script: salloc -p genoa -t 60:00 -N 8 --ntasks-per-node 192 -c 1
module load EESSI/2023.06
module load ESPResSo/4.2.2-foss-2023a
LD_LIBRARY_PATH=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GCCcore/12.3.0/lib64/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GSL/2.7-GCC-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Boost.MPI/1.82.0-gompi-2023a/lib/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/FFTW.MPI/3.3.10-gompi-2023a/lib/ mpiexec -n 1536 ./pypresso ../madelung.py Output:
Compiler toolchains that support glibc can build ESPResSo in such a way that the first mathematical operation that generates a NaN value will also trigger a fatal signal that can be caught in GDB. In a debug build of ESPResSo, the stack trace should show which function generated the NaN. Please note this is not portable. Here is how to introduce the callback to raise the signal SIGFPE: diff --git a/src/core/p3m/influence_function.hpp b/src/core/p3m/influence_function.hpp
index 7e1d45c33..dab638845 100644
--- a/src/core/p3m/influence_function.hpp
+++ b/src/core/p3m/influence_function.hpp
@@ -35,6 +35,8 @@
#include <functional>
#include <utility>
#include <vector>
+#include <iostream>
+extern int this_node;
/**
* @brief Hockney/Eastwood/Ballenegger optimal influence function.
@@ -91,6 +93,10 @@ double G_opt(int cao, double alpha, Utils::Vector3d const &k,
}
}
+ if (numerator == 0. and denominator != 0.) { return 0.; }
+ if (this_node == 32) {
+ std::cout << "numerator="<<numerator<<" k2="<<k2<<" denominator="<<denominator<<" div="<<(int_pow<S>(k2) * Utils::sqr(denominator))<<"\n";
+ }
return numerator / (int_pow<S>(k2) * Utils::sqr(denominator));
}
diff --git a/src/script_interface/ObjectHandle.cpp b/src/script_interface/ObjectHandle.cpp
index 68224da70..da4fb27d1 100644
--- a/src/script_interface/ObjectHandle.cpp
+++ b/src/script_interface/ObjectHandle.cpp
@@ -31,6 +31,8 @@
#include <string>
#include <unordered_map>
#include <utility>
+#include <cfenv>
+extern int this_node;
namespace ScriptInterface {
void ObjectHandle::set_parameter(const std::string &name,
@@ -46,7 +48,18 @@ Variant ObjectHandle::call_method(const std::string &name,
if (m_context)
m_context->notify_call_method(this, name, params);
- return this->do_call_method(name, params);
+ Variant result{};
+ auto constexpr fe_flags = FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW;
+ if (this_node == 32)
+ feenableexcept(fe_flags);
+ try {
+ result = this->do_call_method(name, params);
+ fedisableexcept(fe_flags);
+ } catch (...) {
+ fedisableexcept(fe_flags);
+ throw;
+ }
+ return result;
}
std::string ObjectHandle::serialize() const {
diff --git a/madelung.py b/madelung.py
index 3f73b5d56..db407fd9a 100644
--- a/madelung.py
+++ b/madelung.py
@@ -87,5 +87,5 @@ algorithm = espressomd.electrostatics.P3M
if args.gpu:
algorithm = espressomd.electrostatics.P3MGPU
-solver = algorithm(prefactor=1., accuracy=1e-6)
+solver = algorithm(prefactor=1., accuracy=1e-3, mesh=[252, 168, 126], cao=7)
if (espressomd.version.major(), espressomd.version.minor()) == (4, 2):
system.actors.add(solver) Test script: salloc -p genoa -t 60:00 -N 8 --ntasks-per-node 192 -c 1
module load EESSI/2023.06
module load ESPResSo/4.2.2-foss-2023a
module load GDB/13.2-GCCcore-12.3.0
LD_LIBRARY_PATH=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GCCcore/12.3.0/lib64/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GSL/2.7-GCC-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Boost.MPI/1.82.0-gompi-2023a/lib/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/FFTW.MPI/3.3.10-gompi-2023a/lib/ mpiexec -n 72 ./pypresso ../madelung.py --topology 6 4 3 2>&1 | c++filt | tee log.txt Output:
You need to run the script multiple times due to the random nature of the error. MPI rank 32 is the most frequent rank where this issue arises. The bug is due to a division by zero in the influence function. One can safely add |
I can now confirm there is a bug in how ESPResSo handles the matrix operations in the grid influence function. ESPResSo expects the Fourier space to be rotated and transposed (pencil decomposition), but sometimes FFTW selects a more efficient decomposition without rotation and we end up using the wrong matrix indices. The P3M algorithm then reads data out-of-bounds, without triggering a segmentation fault. I'm not sure why this bug only surfaced on Zen4 and Zen5. Our efforts in the last 4 weeks to correct the original FFT code were unsuccessful. espressomd/espresso#5017 introduces a check to discard incompatible FFT decompositions during tuning. Backporting this workaround to ESPResSo 4.2.2 also proved unsuccessful. USTUTT is actively working on replacing the original FFT code by a new implementation based on heFFTe that doesn't involve matrix rotations and is expected to be part of the next release of ESPResSo. We therefore cannot allocate more resources to this bug report and recommend the ReFrame team to skip the failing benchmarks on the affected architectures until the new ESPResSo release becomes available. |
We have seen the following error when running on a local optimised build of Espresso on a skylake cluster at Ghent University. The nodes are running on RHEL9 operating system.
The text was updated successfully, but these errors were encountered: