Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests failing on gpusys (RHEL 7 system) #118

Open
wdj opened this issue Jan 22, 2019 · 12 comments
Open

tests failing on gpusys (RHEL 7 system) #118

wdj opened this issue Jan 22, 2019 · 12 comments

Comments

@wdj
Copy link
Collaborator

wdj commented Jan 22, 2019

gpusys$ cat /proc/meminfo | head -n1
MemTotal: 3859908 kB

spack install

cd /usr/local/src
git clone https://github.com/spack/spack.git
chmod -R a+rX spack

in user .bashrc

export SPACK_ROOT=/usr/local/src/spack
. $SPACK_ROOT/share/spack/setup-env.sh

spack installs

spack install gcc
spack compiler add spack location -i [email protected]
spack install dealii@develop %[email protected]

in user .bashrc

GCC_ROOT_=$(spack location --install-dir gcc)
export LD_LIBRARY_PATH="${GCC_ROOT_}/lib:${GCC_ROOT_}/lib64"
PATH="${GCC_ROOT_}/bin:${PATH}"
MPI_ROOT_=$(spack location --install-dir mpi)
PATH="${MPI_ROOT_}/bin:${PATH}"
CMAKE_ROOT_=$(spack location --install-dir cmake)
PATH="${CMAKE_ROOT_}/bin:${PATH}"

cmake/make commands

DEAL_II_DIR=$(spack location --install-dir dealii)
BOOST_ROOT=$(spack location --install-dir boost)
cmake
-D CMAKE_BUILD_TYPE=Debug
-D MFMG_ENABLE_TESTS=ON
-D MFMG_ENABLE_CUDA=OFF
-D BOOST_ROOT=${BOOST_ROOT}
-D DEAL_II_DIR=${DEAL_II_DIR}
../mfmg
make

test command

env DEAL_II_NUM_THREADS=1 make test ARGS=-V

partial test output

7: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "1" "./test_hierarchy"
7: Test timeout computed to be: 1500
7: Running 23 test cases...
7: At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f
7: Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...)
7: --------------------------------------------------------------------------
7: Primary job terminated normally, but 1 process returned
7: a non-zero exit code. Per user-direction, the job has been aborted.
7: --------------------------------------------------------------------------
7: --------------------------------------------------------------------------
7: mpiexec detected that one or more processes exited with non-zero status, thus causing
7: the job to be terminated. The first process to do so was:
7:
7: Process name: [[55908,1],0]
7: Exit code: 2
7: --------------------------------------------------------------------------
7/20 Test #7: test_hierarchy_1 .................***Failed 4.07 sec
test 8
Start 8: test_hierarchy_2

8: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "2" "./test_hierarchy"
8: Test timeout computed to be: 1500
8: Running 23 test cases...
8: Running 23 test cases...
8: At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f
8: Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...)
8: --------------------------------------------------------------------------
8: Primary job terminated normally, but 1 process returned
8: a non-zero exit code. Per user-direction, the job has been aborted.
8: --------------------------------------------------------------------------
8: unknown location(0): fatal error: in "benchmark<mfmg__DealIIMeshEvaluator<2>>": dealii::SparseDirectUMFPACK::ExcUMFPACKError:
8: --------------------------------------------------------
8: An error occurred in line <291> of file </usr/local/src/spack/var/spack/stage/dealii-develop-c34vncl5qn7fkr4afiohu5cqe5i4kd5x/dealii/source/lac/sparse_direct.cc> in function
8: void dealii::SparseDirectUMFPACK::factorize(const Matrix&) [with Matrix = dealii::SparseMatrix]
8: The violated condition was:
8: status == UMFPACK_OK
8: Additional information:
8: UMFPACK routine umfpack_dl_numeric returned error status 1.
8:
8: A complete list of error codes can be found in the file <bundled/umfpack/UMFPACK/Include/umfpack.h>.
8:
8: That said, the two most common errors that can happen are that your matrix cannot be factorized because it is rank deficient, and that UMFPACK runs out of memory because your problem is too large.
8:
8: The first of these cases most often happens if you forget terms in your bilinear form necessary to ensure that the matrix has full rank, or if your equation has a spatially variable coefficient (or nonlinearity) that is supposed to be strictly positive but, for whatever reasons, is negative or zero. In either case, you probably want to check your assembly procedure. Similarly, a matrix can be rank deficient if you forgot to apply the appropriate boundary conditions. For example, the Laplace equation without boundary conditions has a single zero eigenvalue and its rank is therefore deficient by one.
8:
8: The other common situation is that you run out of memory.On a typical laptop or desktop, it should easily be possible to solve problems with 100,000 unknowns in 2d. If you are solving problems with many more unknowns than that, in particular if you are in 3d, then you may be running out of memory and you will need to consider iterative solvers instead of the direct solver employed by UMFPACK.
8: --------------------------------------------------------
8:
8: /home/wjd/mfmg_project/mfmg/tests/test_hierarchy.cc(114): last checkpoint: "benchmark" entry.
8: --------------------------------------------------------------------------
8: mpiexec detected that one or more processes exited with non-zero status, thus causing
8: the job to be terminated. The first process to do so was:
8:
8: Process name: [[55924,1],0]
8: Exit code: 2
8: --------------------------------------------------------------------------
8/20 Test #8: test_hierarchy_2 .................***Failed 2.91 sec

from Testing/Temporary/LastTest.log

7/20 Testing: test_hierarchy_1
7/20 Test: test_hierarchy_1
Command: "/usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec" "-n" "1" "./test_hierarchy"
Directory: /home/wjd/mfmg_project/build/tests
"test_hierarchy_1" start time: Jan 21 20:05 EST
Output:

Running 23 test cases...
At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f
Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...)

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[55908,1],0]
Exit code: 2

Test time = 4.07 sec ---------------------------------------------------------- Test Failed. "test_hierarchy_1" end time: Jan 21 20:05 EST "test_hierarchy_1" time elapsed: 00:00:04 ----------------------------------------------------------

8/20 Testing: test_hierarchy_2
8/20 Test: test_hierarchy_2
Command: "/usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec" "-n" "2" "./test_hierarchy"
Directory: /home/wjd/mfmg_project/build/tests
"test_hierarchy_2" start time: Jan 21 20:05 EST
Output:

Running 23 test cases...
Running 23 test cases...
At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f
Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...)

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

unknown location(0): ^[[4;31;49mfatal error: in "benchmark<mfmg__DealIIMeshEvaluator<2>>": dealii::SparseDirectUMFPACK::ExcUMFPACKError:

An error occurred in line <291> of file </usr/local/src/spack/var/spack/stage/dealii-develop-c34vncl5qn7fkr4afiohu5cqe5i4kd5x/dealii/source/lac/sparse_direct.cc> in function
void dealii::SparseDirectUMFPACK::factorize(const Matrix&) [with Matrix = dealii::SparseMatrix]
The violated condition was:
status == UMFPACK_OK
Additional information:
UMFPACK routine umfpack_dl_numeric returned error status 1.

A complete list of error codes can be found in the file <bundled/umfpack/UMFPACK/Include/umfpack.h>.

That said, the two most common errors that can happen are that your matrix cannot be factorized because it is rank deficient, and that UMFPACK runs out of memory because your problem is too large.

The first of these cases most often happens if you forget terms in your bilinear form necessary to ensure that the matrix has full rank, or if your equation has a spatially variable coefficient (or nonlinearity) that is supposed to be strictly positive but, for whatever reasons, is negative or zero. In either case, you probably want to check your assembly procedure. Similarly, a matrix can be rank deficient if you forgot to apply the appropriate boundary conditions. For example, the Laplace equation without boundary conditions has a single zero eigenvalue and its rank is therefore deficient by one.

The other common situation is that you run out of memory.On a typical laptop or desktop, it should easily be possible to solve problems with 100,000 unknowns in 2d. If you are solving problems with many more unknowns than that, in particular if you are in 3d, then you may be running out of memory and you will need to consider iterative solvers instead of the direct solver employed by UMFPACK.

^[[0;39;49m
/home/wjd/mfmg_project/mfmg/tests/test_hierarchy.cc(114): ^[[1;36;49mlast checkpoint: "benchmark" entry.^[[0;39;49m

mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[55924,1],0]
Exit code: 2

Test time = 2.91 sec ---------------------------------------------------------- Test Failed. "test_hierarchy_2" end time: Jan 21 20:05 EST "test_hierarchy_2" time elapsed: 00:00:02 ----------------------------------------------------------
@Rombur
Copy link
Collaborator

Rombur commented Jan 22, 2019

Any idea @aprokop? There seems to be a problem with arpack and so the coarse matrix becomes singular which trips UMFPACK.

@aprokop
Copy link
Collaborator

aprokop commented Jan 22, 2019

Not sure on top of my head. From the log, it is clear that arpack is tries to write into lout stream which is negative. This typically indicates that the corresponding file was not opened properly. However, without backtrace it is hard to understand where it is trying to write to.

In general, I'm not sure what's happening here. Why is arpack being called from spack-stage? If spack package was installed properly, it should have been moved out of stage. See, for example, how openmpi command in the log is being called:

7: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "1" "./test_hierarchy"

So that was properly installed in /usr/local/src/spack/. But arpack is being referenced as /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f, which baffles me.

@wdj
Copy link
Collaborator Author

wdj commented Jan 22, 2019 via email

@Rombur
Copy link
Collaborator

Rombur commented Jan 22, 2019

Am I missing some kind of spack (post-build) install step?

No you should be good.

FWIW, I'm doing the spack builds as root but the mfmg configure/make/run as regular user

I am not sure if that's a problem. I always build spack as a regular user and then load the modules that were created.

Instead of using make test can you try ctest. I doubt it will help but that's the way we usually run the tests.

@Rombur
Copy link
Collaborator

Rombur commented Jan 22, 2019

@wdj can you show the output of spack location --install-dir arpack-ng

@wdj
Copy link
Collaborator Author

wdj commented Jan 22, 2019 via email

@aprokop
Copy link
Collaborator

aprokop commented Jan 22, 2019

You could also try removing spack-stage.

@wdj
Copy link
Collaborator Author

wdj commented Jan 22, 2019 via email

@Rombur
Copy link
Collaborator

Rombur commented Jan 23, 2019

I tried with a fresh clone of spack and I have the same problem. I have a working version using spack and it uses the same version of arpack so that's not the problem. Something strange is that on Ubuntu, arpack was installed in lib but on rhel it is installed in lib64. In lib, there are a bunch of cmake files. I don't know if it is spack that is doing something different or if it is because of the OS. I checked other libraries and they don't have lib64.

Let's talk about it at the meeting.

@wdj
Copy link
Collaborator Author

wdj commented Jan 23, 2019 via email

@aprokop
Copy link
Collaborator

aprokop commented Jan 23, 2019

The thing that comes to mind is spack/spack#764. The thing that sticks out is the following string in the package:

options.append('-DCMAKE_INSTALL_NAME_DIR:PATH=%s/lib' % prefix)

It was originally introduced to fix some Mac thing, but I wonder if it breaks Redhat.

@Rombur
Copy link
Collaborator

Rombur commented Jan 29, 2019

Changing/removing the line options.append('-DCMAKE_INSTALL_NAME_DIR:PATH=%s/lib' % prefix) doesn't change anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants