Failure of Espresso test-suite with `x and y nan location mismatch` #190

laraPPr · 2024-10-08T10:18:32Z

We have seen the following error when running on a local optimised build of Espresso on a skylake cluster at Ghent University. The nodes are running on RHEL9 operating system.

Executing sanity checks...

Traceback (most recent call last):
  File "/kyukon/scratch/gent/461/vsc46128/EESSI/test-suite/stage/skitty/skitty/default/EESSI_ESPRESSO_P3M_IONIC_CRYSTALS_e0bb4712/madelung.py", line 115, in <module>
    np.testing.assert_allclose(energy, ref_energy, atol=atol_energy, rtol=rtol_energy)
  File "/apps/gent/RHEL9/skylake-ib/software/SciPy-bundle/2023.11-gfbf-2023b/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 1504, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/apps/gent/RHEL9/skylake-ib/software/Python/3.11.5-GCCcore-13.2.0/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/apps/gent/RHEL9/skylake-ib/software/SciPy-bundle/2023.11-gfbf-2023b/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 718, in assert_array_compare
    flagged = func_assert_same_pos(x, y, func=isnan, hasval='nan')
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/gent/RHEL9/skylake-ib/software/SciPy-bundle/2023.11-gfbf-2023b/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 688, in func_assert_same_pos
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=5e-06, atol=1e-12

x and y nan location mismatch:
x: array(nan)
y: array(-1.747565)

The text was updated successfully, but these errors were encountered:

jngrad · 2024-11-18T18:42:59Z

Building ESPResSo 4.2.2 from sources on Snellius in the genoa/tcn partition (AMD EPYC 9654) and using 8 nodes, 192 tasks per node, I get the wrong value for the Madelung constant, but no NaN values. With 1 node and 72 tasks, the NaN is reproducible with a specific mesh and MPI topology.

Build script:

salloc -p genoa -t 30:00 -n 8 --ntasks-per-node 8 -c 1
module load EESSI/2023.06
module load ESPResSo/4.2.2-foss-2023a
module load mpl-ascii/0.10.0-gfbf-2023a
module load tqdm/4.66.1-GCCcore-12.3.0
cp ../maintainer/configs/maxset.hpp myconfig.hpp
cmake ..
make -j8

Test script:

salloc -p genoa -t 60:00 -N 8 --ntasks-per-node 192 -c 1
module load EESSI/2023.06
module load ESPResSo/4.2.2-foss-2023a
LD_LIBRARY_PATH=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GCCcore/12.3.0/lib64/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GSL/2.7-GCC-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Boost.MPI/1.82.0-gompi-2023a/lib/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/FFTW.MPI/3.3.10-gompi-2023a/lib/ mpiexec -n 1536 ./pypresso ../madelung.py

Output:

CoulombP3M tune parameters: Accuracy goal = 1.00000e-06 prefactor = 1.00000e+00
System: box_l = 1.80000e+01 # charged part = 5832 Sum[q_i^2] = 5.83200e+03
mesh cao r_cut_iL    alpha_L     err       rs_err    ks_err    time [ms]
416  7   4.02778e-02 9.62082e+01 1.023e-06 7.071e-07 7.393e-07 accuracy not achieved
418  7   4.02778e-02 9.62082e+01 1.005e-06 7.071e-07 7.145e-07 accuracy not achieved
420  7   4.02778e-02 9.62082e+01 9.747e-07 7.071e-07 6.709e-07 82.13   
420  6   4.02778e-02 9.62082e+01 2.665e-06 7.071e-07 2.570e-06 accuracy not achieved
422  7   4.02778e-02 9.62082e+01 9.964e-07 7.071e-07 7.020e-07 98.01   
422  6   4.02778e-02 9.62082e+01 2.595e-06 7.071e-07 2.497e-06 accuracy not achieved

resulting parameters: mesh: (420, 420, 420), cao: 7, r_cut_iL: 4.0278e-02,
                      alpha_L: 9.6208e+01, accuracy: 9.7472e-07, time: 82.13
WARNING: Statistics of tuning samples is very bad.
Algorithm executed. 

Executing sanity checks...

Traceback (most recent call last):
  File "/gpfs/home6/jgrad/multixscale/espresso/build-genoa/../genoa-8-madelung.py", line 115, in <module>
    np.testing.assert_allclose(energy, ref_energy, atol=atol_energy, rtol=rtol_energy)
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/SciPy-bundle/2023.07-gfbf-2023a/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 1504, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/SciPy-bundle/2023.07-gfbf-2023a/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 797, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=5e-06, atol=1e-12

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 4.28350168
Max relative difference: 2.45112638
 x: array(-6.031066)
 y: array(-1.747565)

Compiler toolchains that support glibc can build ESPResSo in such a way that the first mathematical operation that generates a NaN value will also trigger a fatal signal that can be caught in GDB. In a debug build of ESPResSo, the stack trace should show which function generated the NaN. Please note this is not portable. Here is how to introduce the callback to raise the signal SIGFPE:

diff --git a/src/core/p3m/influence_function.hpp b/src/core/p3m/influence_function.hpp
index 7e1d45c33..dab638845 100644
--- a/src/core/p3m/influence_function.hpp
+++ b/src/core/p3m/influence_function.hpp
@@ -35,6 +35,8 @@
 #include <functional>
 #include <utility>
 #include <vector>
+#include <iostream>
+extern int this_node;
 
 /**
  * @brief Hockney/Eastwood/Ballenegger optimal influence function.
@@ -91,6 +93,10 @@ double G_opt(int cao, double alpha, Utils::Vector3d const &k,
     }
   }
 
+  if (numerator == 0. and denominator != 0.) { return 0.; }
+  if (this_node == 32) {
+         std::cout << "numerator="<<numerator<<" k2="<<k2<<" denominator="<<denominator<<" div="<<(int_pow<S>(k2) * Utils::sqr(denominator))<<"\n";
+  }
   return numerator / (int_pow<S>(k2) * Utils::sqr(denominator));
 }
 
diff --git a/src/script_interface/ObjectHandle.cpp b/src/script_interface/ObjectHandle.cpp
index 68224da70..da4fb27d1 100644
--- a/src/script_interface/ObjectHandle.cpp
+++ b/src/script_interface/ObjectHandle.cpp
@@ -31,6 +31,8 @@
 #include <string>
 #include <unordered_map>
 #include <utility>
+#include <cfenv>
+extern int this_node;
 
 namespace ScriptInterface {
 void ObjectHandle::set_parameter(const std::string &name,
@@ -46,7 +48,18 @@ Variant ObjectHandle::call_method(const std::string &name,
   if (m_context)
     m_context->notify_call_method(this, name, params);
 
-  return this->do_call_method(name, params);
+  Variant result{};
+  auto constexpr fe_flags = FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW;
+  if (this_node == 32)
+    feenableexcept(fe_flags);
+  try {
+    result = this->do_call_method(name, params);
+    fedisableexcept(fe_flags);
+  } catch (...) {
+    fedisableexcept(fe_flags);
+    throw;
+  }
+  return result;
 }
 
 std::string ObjectHandle::serialize() const {
diff --git a/madelung.py b/madelung.py
index 3f73b5d56..db407fd9a 100644
--- a/madelung.py
+++ b/madelung.py
@@ -87,5 +87,5 @@ algorithm = espressomd.electrostatics.P3M
 if args.gpu:
     algorithm = espressomd.electrostatics.P3MGPU
-solver = algorithm(prefactor=1., accuracy=1e-6)
+solver = algorithm(prefactor=1., accuracy=1e-3, mesh=[252, 168, 126], cao=7)
 if (espressomd.version.major(), espressomd.version.minor()) == (4, 2):
     system.actors.add(solver)

Test script:

salloc -p genoa -t 60:00 -N 8 --ntasks-per-node 192 -c 1
module load EESSI/2023.06
module load ESPResSo/4.2.2-foss-2023a
module load GDB/13.2-GCCcore-12.3.0
LD_LIBRARY_PATH=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GCCcore/12.3.0/lib64/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GSL/2.7-GCC-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Boost.MPI/1.82.0-gompi-2023a/lib/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/FFTW.MPI/3.3.10-gompi-2023a/lib/ mpiexec -n 72 ./pypresso ../madelung.py --topology 6 4 3 2>&1 | c++filt | tee log.txt

Output:

CoulombP3M tune parameters: Accuracy goal = 1.00000e-03 prefactor = 1.00000e+00
System: box_l = 1.80000e+01 # charged part = 5832 Sum[q_i^2] = 5.83200e+03
mesh cao r_cut_iL    alpha_L     err       rs_err    ks_err    time [ms]
fixed mesh (252, 168, 126)
fixed cao 7
252  7   4.11892e-02 6.90845e+01 9.756e-04 7.071e-04 6.722e-04 44.35

resulting parameters: mesh: (252, 168, 126), cao: 7, r_cut_iL: 4.1189e-02,
                      alpha_L: 6.9084e+01, accuracy: 9.7564e-04, time: 44.35

numerator=4.4142e-06 k2=741.317 denominator=0.10225 div=7.7505
numerator=0 k2=1.39624e+17 denominator=0 div=0
[tcn906:3140236:0:3140236] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid:3140236) ====
[tcn906:3140236] *** Process received signal ***
[tcn906:3140236] Signal: Floating point exception (8)
[tcn906:3140236] Signal code:  (-6)
[tcn906:3140236] Failing at address: 0x11635002fea8c
[tcn906:3140236] [ 0] /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib/../lib64/libc.so.6(+0x38560)[0x14e0d2c67560]
[tcn906:3140236] [ 1] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(double G_opt<1ul, 0ul>(int, double, Utils::Vector<double, 3ul> const&, Utils::Vector<double, 3ul> const&)+0x601)[0x14e0d0a09b90]
[tcn906:3140236] [ 2] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(std::vector<double, std::allocator<double> > grid_influence_function<1ul, 0ul>(P3MParameters const&, Utils::Vector<int, 3ul> const&, Utils::Vector<int, 3ul> const&, Utils::Vector<double, 3ul> const&)+0x535)[0x14e0d0a067a3]
[tcn906:3140236] [ 3] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(CoulombP3M::calc_influence_function_force()+0x94)[0x14e0d09fcea6]
[tcn906:3140236] [ 4] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(CoulombP3M::scaleby_box_l()+0xe2)[0x14e0d09ffb12]
[tcn906:3140236] [ 5] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(CoulombP3M::init()+0x2d7)[0x14e0d09fdaed]
[tcn906:3140236] [ 6] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(CoulombP3M::on_cell_structure_change()+0x24)[0x14e0d09dd908]
[tcn906:3140236] [ 7] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x7894ea)[0x14e0d09da4ea]
[tcn906:3140236] [ 8] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x78c522)[0x14e0d09dd522]
[tcn906:3140236] [ 9] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x78c14d)[0x14e0d09dd14d]
[tcn906:3140236] [10] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x78bda9)[0x14e0d09dcda9]
[tcn906:3140236] [11] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x78b6ca)[0x14e0d09dc6ca]
[tcn906:3140236] [12] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x78a793)[0x14e0d09db793]
[tcn906:3140236] [13] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x7895b0)[0x14e0d09da5b0]
[tcn906:3140236] [14] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x7895eb)[0x14e0d09da5eb]
[tcn906:3140236] [15] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x788bfc)[0x14e0d09d9bfc]
[tcn906:3140236] [16] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(Coulomb::on_cell_structure_change()+0x1e)[0x14e0d09d8911]
[tcn906:3140236] [17] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(on_cell_structure_change()+0xe)[0x14e0d088c5d6]
[tcn906:3140236] [18] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(cells_re_init(CellStructureType)+0x18b)[0x14e0d07ddde6]
[tcn906:3140236] [19] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(on_short_range_ia_change()+0x1a)[0x14e0d088c49d]
[tcn906:3140236] [20] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(on_coulomb_change()+0x15)[0x14e0d088c468]
[tcn906:3140236] [21] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(CoulombP3M::tune()+0x23d)[0x14e0d09ff1cb]
[tcn906:3140236] [22] /home/jgrad/multixscale/espresso/build-genoa/src/script_interface/Espresso_script_interface.so(CoulombP3M::on_activation()+0x24)[0x14e0d21e9c12]
[tcn906:3140236] [23] /home/jgrad/multixscale/espresso/build-genoa/src/script_interface/Espresso_script_interface.so(void add_actor<boost::variant<std::shared_ptr<DebyeHueckel>, std::shared_ptr<CoulombP3M>, std::shared_ptr<ElectrostaticLayerCorrection>, std::shared_ptr<CoulombMMM1D>, std::shared_ptr<ReactionField> >, CoulombP3M>(boost::optional<boost::variant<std::shared_ptr<DebyeHueckel>, std::shared_ptr<CoulombP3M>, std::shared_ptr<ElectrostaticLayerCorrection>, std::shared_ptr<CoulombMMM1D>, std::shared_ptr<ReactionField> > >&, std::shared_ptr<CoulombP3M> const&, void (&)(), bool (&)(bool))+0x58)[0x14e0d222ba32]
[tcn906:3140236] [24] /home/jgrad/multixscale/espresso/build-genoa/src/script_interface/Espresso_script_interface.so(void Coulomb::add_actor<CoulombP3M, (void*)0>(std::shared_ptr<CoulombP3M> const&)+0xf5)[0x14e0d222a714]
[tcn906:3140236] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 32 with PID 3140236 on node tcn906 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------

You need to run the script multiple times due to the random nature of the error. MPI rank 32 is the most frequent rank where this issue arises.

The bug is due to a division by zero in the influence function. One can safely add if (numerator == 0. and denominator != 0.) { return 0.; } before the division, which is mathematically correct since zero divided by any non-zero number must yield 0. The issue here is that for the chosen mesh, the denominator is actually zero. I am not sure why this singularity arises, and I'll have to double check the math with @RudolfWeeber to see why this happens. Maybe the P3M algorithm is prone to catastrophic cancellation. Returning 0 when both the numerator and denominator are 0 leads to an incorrect prediction of the Madelung constant. The exact same issue was encountered in the development branch of ESPResSo when we introduced heFFTe as our new FFT backend; there I only needed 4 MPI ranks on a desktop Zen5 CPU (AMD Ryzen 9 9950X) to obtain NaN values.

jngrad · 2024-12-11T18:59:05Z

I can now confirm there is a bug in how ESPResSo handles the matrix operations in the grid influence function. ESPResSo expects the Fourier space to be rotated and transposed (pencil decomposition), but sometimes FFTW selects a more efficient decomposition without rotation and we end up using the wrong matrix indices. The P3M algorithm then reads data out-of-bounds, without triggering a segmentation fault. I'm not sure why this bug only surfaced on Zen4 and Zen5.

Our efforts in the last 4 weeks to correct the original FFT code were unsuccessful. espressomd/espresso#5017 introduces a check to discard incompatible FFT decompositions during tuning. Backporting this workaround to ESPResSo 4.2.2 also proved unsuccessful.

USTUTT is actively working on replacing the original FFT code by a new implementation based on heFFTe that doesn't involve matrix rotations and is expected to be part of the next release of ESPResSo. We therefore cannot allocate more resources to this bug report and recommend the ReFrame team to skip the failing benchmarks on the affected architectures until the new ESPResSo release becomes available.

laraPPr added the bug Something isn't working label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure of Espresso test-suite with `x and y nan location mismatch` #190

Failure of Espresso test-suite with `x and y nan location mismatch` #190

laraPPr commented Oct 8, 2024

jngrad commented Nov 18, 2024

jngrad commented Dec 11, 2024

Failure of Espresso test-suite with x and y nan location mismatch #190

Failure of Espresso test-suite with x and y nan location mismatch #190

Comments

laraPPr commented Oct 8, 2024

jngrad commented Nov 18, 2024

jngrad commented Dec 11, 2024

Failure of Espresso test-suite with `x and y nan location mismatch` #190

Failure of Espresso test-suite with `x and y nan location mismatch` #190