Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Extensible configuration of the superbuild location (#2426)
* Updated the CI to use a variant of the superbuild CMake scripts. The third-party external dependencies should be installed out-of-band using the scripts in the superbuild CI directory. The DiHydrogen, Hydrogen, and Aluminun dependencies are installed as part of the CI run. Note that the support for more extensible Spack-based builds are included but disabled. ---- * Adding some flexibility in the customized_build_env script to make the location of the external superbuild dependencies easily relocatable. * Adding code to explicitly get the hostname for the superbuild configuration. * Updated to the latest ROCm versions. * Added some env variables for RCCL * Add spack type for mi300a * Only include the external CUDA libraries on cuda systems. * Fixed the external modules for cray-mpich. * Ensure that the CMAKE_PREFIX_PATH is captured in the superbuild suggested prefix path. Fixed bug where the forwarded CMAKE_PREFIX_PATH was overwritten when a package dependend on other packages. * Automatically output the suggested cmake prefix path to the install directory. * Forwarded the CMAKE_PREFIX_PATH to the LBANN build. * Added a flag to the build_lbann.sh script to specify a directory of superbuilt external libraries. * Added the superbuild-prefix to the Pascal CI pipeline. * Disable caliper and force [email protected] * Switch back to using the system specific spack. * Force the use of normal zlib * Split the superbuild scripts into core dependencies and DHA dependencies. * Added superbuild script for DHA with half. * Updated the build scripts to allow for specific DHA compiled versions. * Reenabled half on pascal CI test. * Allow for newer gcc compilers. * Updated all of the pascal CI scripts to use the new stable dependencies. * Updating the Tioga scripts to use the superbuild. * Fixed the sense of the shared variant on protobuf. * Updated the AMD ROCm stack to 6.1.2 * Adding path for external HWLOC in superbuild stable dependencies. Added code to export the CRAY_LD_LIBRARY_PATH. * Add aws-ofi-rccl to the superbuild externals. * Fix how the CMAKE_PREFIX_PATH is forwarded to DHA libraries. * Updating the Tioga superbuild scripts to force the runpaths to be properly set. * Updating the Pascal superbuild scripts to force the runpaths to be properly set. * Added CMake flags to enable shared library builds. * Added a path to cuTensor for x86_64 platforms. * Added a path to the correct miopen. * Mark the new MIOpen as develop. * Disable the superbuild on Corona and Lassen * Fixed the install path. * Add some logic to clean up the initial CMAKE_INSTALL_RPATH The path auto-generated by Spack may not be ideal. * Remove system paths from build rpath * Fixed how the CMake environment sets up the PYTHONPATH and caches it in the lbann_pfe.sh and module files. Added hints to the superbuild of where to install necessary Python packages. * Revert back to ROCm 5.7.1 * Updated the superbuild scripts to use LDD and Gold linkers as appropriate. Made the Tioga superbuild scripts easier to change to new ROCm versions. * Removing custom MIOpen build. * Added the build modules to the LBANN_DEPENDENT_MODULES so that they are loaded at runtime since the RPATH and RUNPATH isn't capturing certain Cray packages. * Fixed how the LBANN_DEPENDENT_MODULES are composed. * Temporarily reduce the time for Tioga jobs * Try a different set of modules for Tioga. * Fixed grouping on link flags. Fixed RPATH issues for build and install objects. * Increasing the precision of the reported error for check metric. * Force the installation of pip packages in the installed location to avoid bad system install. * Correctly set the --force-reinstall flag on the pip command. * Correcting the nightly time limit. * Set the CXX and CUDA flags to an optimized build. * Updated the Tioga builds to include the PE_ENV field in the stable dependencies pathname. * Updated the build path so that the source files can be saved for debugging. * Updated the build path so that the source files can be saved for debugging on pascal. * Removed the pip force-reinstall * Fixed pascal build path. * Fixed the quotes around the linker flags. * Do not use gold linker for core dependencies because protobuf fails. * Updated the version of half to 2.2.0 * Did not set the loaded modules in the LBANN module file. * Include ROCM_PATH/lib to RPATH. Switch Pascal back to gcc/10.3.1. * Switch Pascal CI to using Clang 14. Added compiler into the CI superbuild external paths. * Fixed compiler paths and typos. * Fixed typo. * Commented out unused variable. * Log file for superbuild shell script is now defined in the environment rather than passed as an argument. * Fixed the extra RPATH on cray. * Switched back to half v2.1.0. Added logging for the modules used to build the superbuild. * Fixing the extra RPATHs field to handle multiple entries. * Add an updated time limit for the reconstruction loss unit test. * Add EnsureComm calls to truncation selection algo * Use a vertical | to avoid issues propagating ;. * Constrain version of NumPy to 1.22.3 * Removed the -02 optimization flags from the pascal and tioga environments because it will be set by the CMake build type. Added a superbuild package for hipTT. * Added superbuild scripts for Corona. Added hipTT to build_lbann.sh build script set. Updated Corona to 5.7.1. Re-enabled the Corona CI builds. * Moved the definition of the external hiptt to a ROCm only section. * Update Corona to ROCm 6.0.2 * Changed the Corona externals to use variable for ROCm version. * Exporting the shell variable. * Moved when the ROCm version is defined. * Back to 6.0.2 * Trying a unified single pipeline for Pascal CI. * Working on updating the CI builds to use a more direct script setup. * Added configure scripts for LBANN and a script to run the unit and integration tests. * Cleaning up the CI scripts. * Added GitLab CI yaml files. * Lowered the git depth. * Fix the submodule strategy. * Fixed the CI tests to use 2 nodes. Better error handling. * Fixed the name of the test result files so that they would be picked up by CI. * Added a test pascal pipeline. * Fixed how the DistConv flag is propagated. * Added external flags for building with HALF and FFT support. Limited the distconv builds to only run the right tests. * Cleaning up code. * Added distconv pascal test. * Fix the status capture. * Fixed logic bug in bash. * Fixed the include path to Half and disabled FFT * Fixed the failed test reporting and that distconv and half don't play together. * Extend the mpi catch tests time limit. * Added optimization flags for DHA * Added Corona to new CI. * Added config for Lassen. * Fixed how the lapack argument is passed to Hydrogen * Fixed flag for LBANN BLA. * Added scripts to install core dependencies for lassen. * Added Lassen CI. * Adding in some help for extra rpaths. * Force LBANN to RPATH DHA libraries inside of the project. * Improve the reporting of the MPI catch tests. Consolidated all of the MPI catch tests to a single execution. Avoid logging unit and catch testing outputs to console. * Updated Lassen to use a newer python. Tweaking how rpath's are set. * Fixed quoting on RPATH * Fixed the path for the catch tests. * Fixed up a few shell details to make switching PEs simpler. * Building for Mi300A as well as 250. * Stop hardcoding the CRAY_MPICH_VERSION * Added the ability to export the AWS_OFI_RCCL plugin to the LD_LIBRARY_PATH when using the lbann_pfe.sh shell script. * Tweak the Tioga build environment. * Work on building the dependencies on PrgEnv-cray. * Fixed accidental debugging code. * Added DiHydrogen cache check. Only add Half prefix path when asked for. * Add the hash for H2. * Ensure that for AMD/HIP/ROCm systems all three fields GPU_TARGETS, AMDGPU_TARGETS, and CMAKE_HIP_ARCHITECTURES are set. * Disable FFT on Lassen * Disable installing torch. * Disable FFT on lassen right now. * Set proper AMD architectures. * Use a special PR for 6.2.0 * Explicitly turned on the half feature, which is not properly disabled when not set. * When not using a flag, set it to a NULL string, not 0. * Reporting the state of the build script DHA features. * Set flag to ON not 1 * Fix when local 6.2.0 MIOpen library is linked in. * Auto-detect the CUDA version and compiler version. * Working to consolidate how the core dependencies are built to use the same setup file as the CI runs. Fixed the build issues for CI on corona. Removed scripts for building DHA and LBANN manually (outside of CI). * Cleaning up Power and HIP specific flags. * Added support for creating a Python virtual environment in the CI stack. Improved the core dependencies for Power. * Removed older core platform specific dependency scripts. * Update python/lbann/contrib/lc/launcher.py Co-authored-by: Tom Benson <[email protected]> * Add pytest to the venv. Cleaned up. * Added code to build OpenBLAS on Power and then install standard libraries via PIP in the stable dependencies. * Only create the virtual environment if it doesn't exist. * Changed to installing all of the PIP installs in the virtual env directory. * Apply suggestions from code review Co-authored-by: Tom Benson <[email protected]> * Renamed variable AWS_OFI_RCCL_LIBRARY to AWS_OFI_RCCL_LIBDIR. * Gather the build logs for the DHA dependencies and keep them as artifacts. * Added some cmake logic to capture the path to the python venv used during configuration. * Removed bad debug statement. * If a python virtual enviornment was defined and used during the build time, the Lua module file will now activate it when loaded. Removed the TCL module file since it wasn't being used by systems. Added a prompt name to the python venv. Fixed an empty variable field in the Lassen gitlab code that deleted other variables. * Trying to fix a bug where lbann_pfe.sh isn't found after loading the module. * Temporarily remove the lua code to activate the virtual environment. * Disabled always rebuilding the dependencies. Added a check to deactivate an active environment before loading the LBANN module. * Updated the Tioga tests to use ROCm 6.2.1beta1 and craycc. * Rewound the Tioga ROCm versions. --------- Co-authored-by: Tom Benson <[email protected]>
- Loading branch information