-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests failing on gpusys (RHEL 7 system) #118
Comments
Any idea @aprokop? There seems to be a problem with arpack and so the coarse matrix becomes singular which trips UMFPACK. |
Not sure on top of my head. From the log, it is clear that arpack is tries to write into In general, I'm not sure what's happening here. Why is arpack being called from
So that was properly installed in |
Am I missing some kind of spack (post-build) install step?
FWIW, I'm doing the spack builds as root but the mfmg configure/make/run as regular user --
From: Andrey Prokopenko <[email protected]>
Reply-To: ORNL-CEES/mfmg <[email protected]>
Date: Monday, January 21, 2019 at 8:55 PM
To: ORNL-CEES/mfmg <[email protected]>
Cc: Wayne Joubert <[email protected]>, Author <[email protected]>
Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)
Not sure on top of my head. From the log, it is clear that arpack is tries to write into lout stream which is negative. This typically indicates that the corresponding file was not opened properly. However, without backtrace it is hard to understand where it is trying to write to.
In general, I'm not sure what's happening here. Why is arpack being called from spack-stage? If spack package was installed properly, it should have been moved out of stage. See, for example, how openmpi command in the log is being called:
7: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "1" "./test_hierarchy"
So that was properly installed in /usr/local/src/spack/. But arpack is being referenced as /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f, which baffles me.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#118 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEIe6GfjZhaosRdoYftmoz3sVRvlLAnOks5vFm-ZgaJpZM4aLwer>.
|
No you should be good.
I am not sure if that's a problem. I always build spack as a regular user and then load the modules that were created. Instead of using |
@wdj can you show the output of |
gpusys$ spack location --install-dir arpack-ng
/usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/arpack-ng-3.6.3-uqfbppbahrwiobzqglsrfl3pdkphprll
I will try building the spack stuff in user space --
From: Bruno Turcksin <[email protected]>
Reply-To: ORNL-CEES/mfmg <[email protected]>
Date: Tuesday, January 22, 2019 at 8:51 AM
To: ORNL-CEES/mfmg <[email protected]>
Cc: Wayne Joubert <[email protected]>, Mention <[email protected]>
Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)
@wdj<https://github.com/wdj> can you show the output of spack location --install-dir arpack-ng
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#118 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEIe6BnnzBs7mU58h_tMngCGQhklAB6Nks5vFxc_gaJpZM4aLwer>.
|
You could also try removing spack-stage. |
Oddly, removing the stage dir doesn't change the behavior. I'm guessing the /tmp/root/spack/stage/... path must be baked into the object code at compile time, irrelevant to runtime --
From: Andrey Prokopenko <[email protected]>
Reply-To: ORNL-CEES/mfmg <[email protected]>
Date: Tuesday, January 22, 2019 at 9:12 AM
To: ORNL-CEES/mfmg <[email protected]>
Cc: Wayne Joubert <[email protected]>, Mention <[email protected]>
Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)
You could also try removing spack-stage.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#118 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEIe6J8KRRsXI2ClQH4eZDEsdM5YZPNcks5vFxxFgaJpZM4aLwer>.
|
I tried with a fresh clone of spack and I have the same problem. I have a working version using spack and it uses the same version of arpack so that's not the problem. Something strange is that on Ubuntu, arpack was installed in Let's talk about it at the meeting. |
FWIW, when building as regular user, not root, I got the following dealii build error. I must have something different in my environment, but I haven't found it yet --
Regardless, I am moving forward with the lanczos integration. I have the code and a unit test working but have not yet interfaced the lanczos solver to the mfmg algorithm propor.
I have it on a branch ("lanczos") I've pushed to the repo --
######################################################################## 100.0%
==> Staging archive: /home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/scalapack-2.0.2.tgz
==> Created stage in /home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow
==> No patches needed for netlib-scalapack
==> Building netlib-scalapack [CMakePackage]
==> Executing phase: 'cmake'
==> Error: ProcessError: Command exited with status 1:
'cmake' '/home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/scalapack-2.0.2' '-G' 'Unix Makefiles' '-DCMAKE_INSTALL_PREFIX:PATH=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow' '-DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo' '-DCMAKE_VERBOSE_MAKEFILE:BOOL=ON' '-DCMAKE_INSTALL_RPATH_USE_LINK_PATH:BOOL=FALSE' '-DCMAKE_INSTALL_RPATH:STRING=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/lib64;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/hwloc-1.11.11-lbhqpuejkjid7uarmzqeavfvx6ps6ifu/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/libpciaccess-0.13.5-qcb7t3uk6lfo2km5mu3xwjjrh6amgb2r/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/libxml2-2.9.8-fi5emr4twy4kogxov4t7hx4yydeuaga4/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/libiconv-1.15-zv3vs247p4445x5dbgxlgsqch3bsgbta/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/xz-5.2.4-bcielpo4hqmmyorbqx3lhfdb63sqe4i6/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/zlib-1.2.11-hyog4nvfq25emh5taua53slpjeplgwm2/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/numactl-2.0.12-olbib5og26swgq3r4j2oe3vzrqzjiruz/lib' '-DCMAKE_PREFIX_PATH:STRING=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/cmake-3.13.3-5prvjs5duzkuido454kgmro7czi3e46q;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep' '-DBUILD_SHARED_LIBS:BOOL=ON' '-DBUILD_STATIC_LIBS:BOOL=OFF' '-DLAPACK_FOUND=true' '-DLAPACK_INCLUDE_DIRS=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/include' '-DLAPACK_LIBRARIES=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/lib/libopenblas.so' '-DBLAS_LIBRARIES=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/lib/libopenblas.so'
1 error found in build log:
22 -- --> C Compiler : /home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc
-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpicc
23 -- --> MPI Fortran Compiler : /home/wjd/spack/opt/spack/linux-rhel7-
x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/
mpif90
24 -- --> Fortran Compiler : /home/wjd/spack/opt/spack/linux-rhel7-x86_
64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpif
90
25 -- Reducing RELEASE optimization level to O2
26 -- =========
27 -- Compiling and Building BLACS INSTALL Testing to set correct varia
bles
> 28 CMake Error at CMAKE/FortranMangling.cmake:27 (MESSAGE):
29 Configure in the BLACS INSTALL directory FAILED
30 Call Stack (most recent call first):
31 CMakeLists.txt:122 (COMPILE)
32
33
34 -- Configuring incomplete, errors occurred!
See build log for details:
/home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/scalapack-2.0.2/spack-build.out
From: Bruno Turcksin <[email protected]>
Reply-To: ORNL-CEES/mfmg <[email protected]>
Date: Wednesday, January 23, 2019 at 10:05 AM
To: ORNL-CEES/mfmg <[email protected]>
Cc: Wayne Joubert <[email protected]>, Mention <[email protected]>
Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)
I tried with a fresh clone of spack and I have the same problem. I have a working version using spack and it uses the same version of arpack so that's not the problem. Something strange is that on Ubuntu, arpack was installed in lib but on rhel it is installed in lib64. In lib, there are a bunch of cmake files. I don't know if it is spack that is doing something different or if it is because of the OS. I checked other libraries and they don't have lib64.
Let's talk about it at the meeting.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#118 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEIe6BU-ji57wY9e2XTBC8AtxMq7Y-1Iks5vGHo1gaJpZM4aLwer>.
|
The thing that comes to mind is spack/spack#764. The thing that sticks out is the following string in the package: options.append('-DCMAKE_INSTALL_NAME_DIR:PATH=%s/lib' % prefix) It was originally introduced to fix some Mac thing, but I wonder if it breaks Redhat. |
Changing/removing the line |
gpusys$ cat /proc/meminfo | head -n1
MemTotal: 3859908 kB
spack install
cd /usr/local/src
git clone https://github.com/spack/spack.git
chmod -R a+rX spack
in user .bashrc
export SPACK_ROOT=/usr/local/src/spack
. $SPACK_ROOT/share/spack/setup-env.sh
spack installs
spack install gcc
spack compiler add
spack location -i [email protected]
spack install dealii@develop %[email protected]
in user .bashrc
GCC_ROOT_=$(spack location --install-dir gcc)
export LD_LIBRARY_PATH="${GCC_ROOT_}/lib:${GCC_ROOT_}/lib64"
PATH="${GCC_ROOT_}/bin:${PATH}"
MPI_ROOT_=$(spack location --install-dir mpi)
PATH="${MPI_ROOT_}/bin:${PATH}"
CMAKE_ROOT_=$(spack location --install-dir cmake)
PATH="${CMAKE_ROOT_}/bin:${PATH}"
cmake/make commands
DEAL_II_DIR=$(spack location --install-dir dealii)
BOOST_ROOT=$(spack location --install-dir boost)
cmake
-D CMAKE_BUILD_TYPE=Debug
-D MFMG_ENABLE_TESTS=ON
-D MFMG_ENABLE_CUDA=OFF
-D BOOST_ROOT=${BOOST_ROOT}
-D DEAL_II_DIR=${DEAL_II_DIR}
../mfmg
make
test command
env DEAL_II_NUM_THREADS=1 make test ARGS=-V
partial test output
7: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "1" "./test_hierarchy"
7: Test timeout computed to be: 1500
7: Running 23 test cases...
7: At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f
7: Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...)
7: --------------------------------------------------------------------------
7: Primary job terminated normally, but 1 process returned
7: a non-zero exit code. Per user-direction, the job has been aborted.
7: --------------------------------------------------------------------------
7: --------------------------------------------------------------------------
7: mpiexec detected that one or more processes exited with non-zero status, thus causing
7: the job to be terminated. The first process to do so was:
7:
7: Process name: [[55908,1],0]
7: Exit code: 2
7: --------------------------------------------------------------------------
7/20 Test #7: test_hierarchy_1 .................***Failed 4.07 sec
test 8
Start 8: test_hierarchy_2
8: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "2" "./test_hierarchy"
8: Test timeout computed to be: 1500
8: Running 23 test cases...
8: Running 23 test cases...
8: At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f
8: Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...)
8: --------------------------------------------------------------------------
8: Primary job terminated normally, but 1 process returned
8: a non-zero exit code. Per user-direction, the job has been aborted.
8: --------------------------------------------------------------------------
8: unknown location(0): fatal error: in "benchmark<mfmg__DealIIMeshEvaluator<2>>": dealii::SparseDirectUMFPACK::ExcUMFPACKError:
8: --------------------------------------------------------
8: An error occurred in line <291> of file </usr/local/src/spack/var/spack/stage/dealii-develop-c34vncl5qn7fkr4afiohu5cqe5i4kd5x/dealii/source/lac/sparse_direct.cc> in function
8: void dealii::SparseDirectUMFPACK::factorize(const Matrix&) [with Matrix = dealii::SparseMatrix]
8: The violated condition was:
8: status == UMFPACK_OK
8: Additional information:
8: UMFPACK routine umfpack_dl_numeric returned error status 1.
8:
8: A complete list of error codes can be found in the file <bundled/umfpack/UMFPACK/Include/umfpack.h>.
8:
8: That said, the two most common errors that can happen are that your matrix cannot be factorized because it is rank deficient, and that UMFPACK runs out of memory because your problem is too large.
8:
8: The first of these cases most often happens if you forget terms in your bilinear form necessary to ensure that the matrix has full rank, or if your equation has a spatially variable coefficient (or nonlinearity) that is supposed to be strictly positive but, for whatever reasons, is negative or zero. In either case, you probably want to check your assembly procedure. Similarly, a matrix can be rank deficient if you forgot to apply the appropriate boundary conditions. For example, the Laplace equation without boundary conditions has a single zero eigenvalue and its rank is therefore deficient by one.
8:
8: The other common situation is that you run out of memory.On a typical laptop or desktop, it should easily be possible to solve problems with 100,000 unknowns in 2d. If you are solving problems with many more unknowns than that, in particular if you are in 3d, then you may be running out of memory and you will need to consider iterative solvers instead of the direct solver employed by UMFPACK.
8: --------------------------------------------------------
8:
8: /home/wjd/mfmg_project/mfmg/tests/test_hierarchy.cc(114): last checkpoint: "benchmark" entry.
8: --------------------------------------------------------------------------
8: mpiexec detected that one or more processes exited with non-zero status, thus causing
8: the job to be terminated. The first process to do so was:
8:
8: Process name: [[55924,1],0]
8: Exit code: 2
8: --------------------------------------------------------------------------
8/20 Test #8: test_hierarchy_2 .................***Failed 2.91 sec
from Testing/Temporary/LastTest.log
7/20 Testing: test_hierarchy_1
7/20 Test: test_hierarchy_1
Command: "/usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec" "-n" "1" "./test_hierarchy"
Directory: /home/wjd/mfmg_project/build/tests
"test_hierarchy_1" start time: Jan 21 20:05 EST
Output:
Running 23 test cases...
At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f
Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...)
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[55908,1],0]
Test time = 4.07 sec ---------------------------------------------------------- Test Failed. "test_hierarchy_1" end time: Jan 21 20:05 EST "test_hierarchy_1" time elapsed: 00:00:04 ----------------------------------------------------------Exit code: 2
8/20 Testing: test_hierarchy_2
8/20 Test: test_hierarchy_2
Command: "/usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec" "-n" "2" "./test_hierarchy"
Directory: /home/wjd/mfmg_project/build/tests
"test_hierarchy_2" start time: Jan 21 20:05 EST
Output:
Running 23 test cases...
Running 23 test cases...
At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f
Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...)
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
unknown location(0): ^[[4;31;49mfatal error: in "benchmark<mfmg__DealIIMeshEvaluator<2>>": dealii::SparseDirectUMFPACK::ExcUMFPACKError:
An error occurred in line <291> of file </usr/local/src/spack/var/spack/stage/dealii-develop-c34vncl5qn7fkr4afiohu5cqe5i4kd5x/dealii/source/lac/sparse_direct.cc> in function
void dealii::SparseDirectUMFPACK::factorize(const Matrix&) [with Matrix = dealii::SparseMatrix]
The violated condition was:
status == UMFPACK_OK
Additional information:
UMFPACK routine umfpack_dl_numeric returned error status 1.
A complete list of error codes can be found in the file <bundled/umfpack/UMFPACK/Include/umfpack.h>.
That said, the two most common errors that can happen are that your matrix cannot be factorized because it is rank deficient, and that UMFPACK runs out of memory because your problem is too large.
The first of these cases most often happens if you forget terms in your bilinear form necessary to ensure that the matrix has full rank, or if your equation has a spatially variable coefficient (or nonlinearity) that is supposed to be strictly positive but, for whatever reasons, is negative or zero. In either case, you probably want to check your assembly procedure. Similarly, a matrix can be rank deficient if you forgot to apply the appropriate boundary conditions. For example, the Laplace equation without boundary conditions has a single zero eigenvalue and its rank is therefore deficient by one.
The other common situation is that you run out of memory.On a typical laptop or desktop, it should easily be possible to solve problems with 100,000 unknowns in 2d. If you are solving problems with many more unknowns than that, in particular if you are in 3d, then you may be running out of memory and you will need to consider iterative solvers instead of the direct solver employed by UMFPACK.
^[[0;39;49m
/home/wjd/mfmg_project/mfmg/tests/test_hierarchy.cc(114): ^[[1;36;49mlast checkpoint: "benchmark" entry.^[[0;39;49m
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[55924,1],0]
Test time = 2.91 sec ---------------------------------------------------------- Test Failed. "test_hierarchy_2" end time: Jan 21 20:05 EST "test_hierarchy_2" time elapsed: 00:00:02 ----------------------------------------------------------Exit code: 2
The text was updated successfully, but these errors were encountered: