Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI offload build failure and add checks upon CMake return failure #4839

Merged
merged 4 commits into from
Nov 18, 2023

Conversation

ye-luo
Copy link
Contributor

@ye-luo ye-luo commented Nov 17, 2023

Please review the developer documentation
on the wiki of this project that contains help and requirements.

Proposed changes

Describe what this PR changes and why. If it closes an issue, link to it here
with a supported keyword.

What type(s) of changes does this code introduce?

Delete the items that do not apply

  • Bugfix
  • New feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Testing changes (e.g. new unit/integration/performance tests)
  • Documentation changes
  • Other (please describe):

Does this introduce a breaking change?

  • Yes
  • No

What systems has this change been tested on?

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

  • Yes/No. This PR is up to date with current the current state of 'develop'
  • Yes/No. Code added or changed in the PR has been clang-formatted
  • Yes/No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • Yes/No. Documentation has been added (if appropriate)

@ye-luo ye-luo force-pushed the print-cmake-error branch 4 times, most recently from e115309 to 3a4f0f5 Compare November 17, 2023 21:31
@ye-luo ye-luo changed the title [WIP] Check CMake return and cat log files if necessary Fix CI offload build failure and add checks upon CMake return failure Nov 17, 2023
@ye-luo
Copy link
Contributor Author

ye-luo commented Nov 17, 2023

Test this please

@ye-luo
Copy link
Contributor Author

ye-luo commented Nov 17, 2023

Test this please

@ye-luo
Copy link
Contributor Author

ye-luo commented Nov 17, 2023

AFQMC shows some issue as https://cdash.qmcpack.org/CDash/testDetails.php?test=28616963&build=447804 when using CUDA 12.1. Revert back to 11.2.

@prckent
Copy link
Contributor

prckent commented Nov 18, 2023

We should make some issues to track the problems found here. e.g. We could easily add a couple of different labeled CUDA versions in the nightlies, labeling the CI builds etc.

Copy link
Contributor

@prckent prckent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spending all the time investigating this.

@prckent prckent merged commit d30ebaa into QMCPACK:develop Nov 18, 2023
21 checks passed
@ye-luo ye-luo deleted the print-cmake-error branch November 29, 2023 22:59
@correaa
Copy link
Contributor

correaa commented Dec 6, 2023

I am compiling with these flags and compilers, would it be meaningful test if I ran it:

$ mpicc --version
gcc (Ubuntu 13.2.0-4ubuntu3) 13.2.0

correaa@cuk:~/qmcpack/build.cuda12.0$ mpicxx --version 
g++ (Ubuntu 13.2.0-4ubuntu3) 13.2.0

correaa@cuk:~/qmcpack/build.cuda12.0$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

$ cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DMPIEXEC_EXECUTABLE=mpirun -DBUILD_AFQMC=ON -DENABLE_CUDA=ON -DQMC_GPU_ARCHS=sm_75 -DENABLE_OFFLOAD=ON -DQMC_COMPLEX=0 -DQMC_MIXED_PRECISION=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-lto" .. --fresh

@ye-luo
Copy link
Contributor Author

ye-luo commented Dec 6, 2023

I am compiling with these flags and compilers, would it be meaningful test if I ran it:

$ mpicc --version
gcc (Ubuntu 13.2.0-4ubuntu3) 13.2.0

correaa@cuk:~/qmcpack/build.cuda12.0$ mpicxx --version 
g++ (Ubuntu 13.2.0-4ubuntu3) 13.2.0

correaa@cuk:~/qmcpack/build.cuda12.0$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

$ cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DMPIEXEC_EXECUTABLE=mpirun -DBUILD_AFQMC=ON -DENABLE_CUDA=ON -DQMC_GPU_ARCHS=sm_75 -DENABLE_OFFLOAD=ON -DQMC_COMPLEX=0 -DQMC_MIXED_PRECISION=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-lto" .. --fresh

Not understanding what your question is and why you are asking in this PR.

@correaa
Copy link
Contributor

correaa commented Dec 6, 2023

@prckent asked me to look into this PR. I interpret it is a regression.
There was a missing question mark: "would it be meaningful test if I ran it?"

@ye-luo
Copy link
Contributor Author

ye-luo commented Dec 6, 2023

@prckent asked me to look into this PR. I interpret it is a regression. There was a missing question mark: "would it be meaningful test if I ran it?"

Still not understanding... What regression?

@ye-luo
Copy link
Contributor Author

ye-luo commented Dec 6, 2023

Do you intend to use GCC for offload?

@correaa
Copy link
Contributor

correaa commented Dec 6, 2023 via email

@prckent
Copy link
Contributor

prckent commented Dec 6, 2023

The request is to investigate + make the afqmc tests work. https://cdash.qmcpack.org/CDash/testSummary.php?project=1&name=deterministic-unit_test_afqmc_numerics&date=2023-12-06 The afqmc numerics check appears to be broken in some circumstances. It should work with any recent compiler and CUDA, and at worst we should understand where problems in the toolchain are. Real space is now working with 12.3 due to bug fixes by NV. If this is some quirk of the nightly builds, we can tolerate that, but we do need to understand it.

@correaa
Copy link
Contributor

correaa commented Dec 6, 2023

I see,

Ye-EPYC-server Clang17-Offload-Real-Mixed-Release 20231206-0300-Deterministic Failed 2.70
Ye-EPYC-server Clang17-Offload-Complex-Release 20231206-0300-Deterministic Failed 2.66
Ye-EPYC-server Clang17-Offload-Complex-Mixed-Release 20231206-0300-Deterministic Failed 2.66
Ye-EPYC-server Clang17-Offload-Real-Release 20231206-0300-Deterministic Failed 2.68

To be systematic, where can I find the compilation options for "Clang17-Offload-Real-Mixed-Release" for example?

(Searching the repository for this string, doesn't give a result https://github.com/search?q=repo%3AQMCPACK%2Fqmcpack%20%22Clang17-Offload-Real-Mixed-Release%22&type=code )

We have seen compilation problems with 12.3 in other codes.

@prckent
Copy link
Contributor

prckent commented Dec 6, 2023

This is not obvious, but for anything reporting to cdash, you can check under the "nightly" test category and selecting the build name or the configure and build entries in the same row, e.g. https://cdash.qmcpack.org/CDash/buildSummary.php?buildid=451448 . This gives access to the cmake output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants