Merge branch 'release-v0.100.0'

============================== Release Notes: v0.100 ============================== Support for new network structures: - 3D molecular generation models for Metal Organic Frameworks from the CoRE MOF Database. - 3D CosmoFlow Model - DenseNet - ATOM LSTM model - RAS state classifier - node2vec - Transformer and other attention-based models - ExaGAN (formerly CosmoGAN) - MaCC ICF surrogate model Applications: - Created a directory of example applications, deprecating the "model zoo" directory Support for new layers: - Embedding layer - Distributed embedding layer - Channel-wise scale/bias layer - Entry-wise scale/bias layer - Gated-Recurrent Units (GRU) - Entry-wise batchnorm - Argmax, Argmin, and one-hot layers - Layer norm - Deconvolution layer (transposed convolution) - Layers for channel-wise operations (channel-wise fully-connected, channel-wise softmax, channel-wise scale/bias, instance norm) - Matrix multiply layer Python front-end: - Can now configure contrib launcher with environment variables - Added NERSC compute center - Per-layer specification of compute device (CPU or GPU) - Option to write custom batch scripts with Python front-end Performance optimizations: - Parallelized Python data reader with "multiprocessing" module - Fuse batchnorm stats allreduces in FP/BP. - Tuned concatenate and slice layer - Dynamically allocate and free memory for layer error signals (halves LBANN's memory footprint) Model portability & usability: - Bamboo tests for individual layers Internal features: - Added support for DistConv features (distributed, generalized, parallel convolution) - Added support for NVSHMEM 1.0 API (used in distributed embedding layer and DistConv halo exchange) - Support for multiple data types per model (per-layer) - Support for per-layer mixed-precision weight training and inference, includes per-weight object and objective function mixed-precision. - Improved how and when the RNGs are initialized - Callback to dump images to TensorBoard - Callback to save model weights (useful to export to PyTorch) - Callback to save top K models (LTFB) - Improved run-to-run reproducibility by initializing weights in alphabetical order - Moved models from model_zoo directory to applications directory - Cleanup and refactoring of callbacks and layer instantiation - Grouped batchnorm statistics - Callback to print model description - Refactored trainer and training-state out of the model class - Support for transposing data in matrix multiply layers - Added DiHydrogen tensor and DistConv library - Added parallel strategy to layer class to support DistConv - LBANN inference mode supports loading models from multiple directories - Cleanup of checkpoint and restart logic I/O & data readers: - Added in-memory data store that caches samples in CPU memory. It can be loaded during the first epoch or preloaded - Added new "transform" data preprocessing ingestion pipeline - Added sample list format for specifying data sets - Introduced data coordinator that manages data readers and extracts them from the input layers - Data store is able to checkpoint / spill it's contents to local disk - Data reader for SMILE strings Build system: - Hydrogen 1.3.4 - Aluminum 0.3.3 - Improved documentation on read the docs (RTD) - Robust support for using Spack as a build system around CMake - Identified compute centers for specifying build and run dependencies - Added Catch2-based tests Bug fixes: - Fixed path resolution for dump weights, save model, and checkpoint callbacks - Added mutexes for preloading the data store - Fixed the LTFB exchange to include all ADAM optimizer state - Fixed the mapping of I/O RNGs to I/O processing threads to ensure consistent and correct multi-threaded performance Retired features: - moving MNIST data reader is replaced by python data reader - ASCII data reader is deprecated
LLNL · Jul 30, 2020 · d0fbac3 · d0fbac3
2 parents 018018b + e13d34c
commit d0fbac3
Show file tree

Hide file tree

Showing 1,308 changed files with 106,623 additions and 78,559 deletions.
diff --git a/.gitignore b/.gitignore
@@ -13,3 +13,8 @@ data.prototext*
 # Can also ignore all directories and files in a directory.
 # tmp/**/* 
 build
+spack_environments/users/
+
+
+# we don't want to collect slurm output
+**/slurm-*.out
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,11 @@
+[submodule "applications/graph/snap"]
+	path = applications/graph/snap
+	url = https://github.com/snap-stanford/snap
+	ignore = dirty
+[submodule "applications/graph/largescale_node2vec"]
+	path = applications/graph/largescale_node2vec
+	url = https://lc.llnl.gov/bitbucket/scm/havoq/largescale_node2vec.git
+	ignore = dirty
+[submodule "applications/ATOM/moses"]
+	path = applications/ATOM/moses
+	url = [email protected]:samadejacobs/moses.git
diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -1,7 +1,24 @@
 # .readthedocs.yml
+# Config file for Read the Docs
+# https://docs.readthedocs.io/en/stable/config-file/v2.html
+
+version: 2
+
+sphinx:
+  builder: html
+  configuration: docs/conf.py
+
+formats: []
 
 build:
   image: latest
 
 python:
   version: 3.7
+  install:
+    - requirements: docs/sphinx_requirements.txt
+
+submodules:
+  include: []
+
+
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required(VERSION 3.12)
+cmake_minimum_required(VERSION 3.13)
 
 project(LBANN CXX)
 
@@ -48,7 +48,7 @@ endif ()
 #
 
 set(LBANN_VERSION_MAJOR 0)
-set(LBANN_VERSION_MINOR 99)
+set(LBANN_VERSION_MINOR 100)
 set(LBANN_VERSION_PATCH 0)
 
 set(LBANN_VERSION "${LBANN_VERSION_MAJOR}.${LBANN_VERSION_MINOR}.${LBANN_VERSION_PATCH}")
@@ -104,6 +104,20 @@ option(LBANN_WITH_CONDUIT "Enable Conduit library" ON)
 
 option(LBANN_WITH_CUDNN "Include Nvidia cuDNN" ON)
 
+option(LBANN_WITH_DIHYDROGEN "Build with DiHydrogen support" OFF)
+if (LBANN_WITH_DIHYDROGEN)
+  message(WARNING "DiHydrogen support is currently expermimental. "
+    "There is no stable interface. "
+    "Use caution before using any features.")
+endif (LBANN_WITH_DIHYDROGEN)
+
+option(LBANN_WITH_DISTCONV "Enable DiHydrogen's Distconv" OFF)
+if (LBANN_WITH_DISTCONV)
+  message(WARNING "Distconv support is currently expermimental. "
+    "There is no stable interface. "
+    "Use caution before using any features.")
+endif (LBANN_WITH_DISTCONV)
+
 option(LBANN_WITH_HWLOC
   "Enable topology-aware optimizations" ON)
 
@@ -121,13 +135,10 @@ option(LBANN_WITH_VTUNE
 option(LBANN_WITH_UNIT_TESTING
   "Enable the unit testing framework (requires Catch2)" OFF)
 
-# Enable parallel random matrix generation, if possible
+# Use deterministic GPU algorithms and layer operations
 option(LBANN_DETERMINISTIC
   "Use deterministic algorithms as much as possible." OFF)
 
-option(LBANN_SEQUENTIAL_INITIALIZATION
-  "Sequentially consistent initialization" OFF)
-
 option(LBANN_DEBUG_PRINT_SUBTARGETS
   "Turn on debugging output of internal target properties." OFF)
 mark_as_advanced(LBANN_DEBUG_PRINT_SUBTARGETS)
@@ -161,6 +172,11 @@ include(SetupCXX)
 ################################################################
 
 # Required dependencies
+find_package(Threads REQUIRED)
+
+# Argument parsing backend
+find_package(Clara REQUIRED)
+
 find_package(CEREAL NO_MODULE
   HINTS ${CEREAL_DIR} $ENV{CEREAL_DIR}
   PATH_SUFFIXES share/cmake/cereal
@@ -172,16 +188,50 @@ set(LBANN_HAS_CEREAL ${CEREAL_FOUND})
 # The imported target is just called "cereal". Super.
 
 # Setup the linear algebra library
-find_package(Hydrogen 1.2.0 NO_MODULE QUIET
+find_package(Hydrogen 1.3.3 NO_MODULE QUIET
   HINTS ${Hydrogen_DIR} ${HYDROGEN_DIR} $ENV{Hydrogen_DIR} $ENV{HYDROGEN_DIR}
   PATH_SUFFIXES lib/cmake/hydrogen
   NO_DEFAULT_PATH)
 if (NOT Hydrogen_FOUND)
-  find_package(Hydrogen 1.2.0 NO_MODULE QUIET REQUIRED)
+  find_package(Hydrogen 1.3.3 NO_MODULE QUIET REQUIRED)
 endif ()
 message(STATUS "Found Hydrogen: ${Hydrogen_DIR}")
 set(LBANN_HAS_HYDROGEN ${Hydrogen_FOUND})
 
+# DiHydrogen and Distconv
+if (LBANN_WITH_DISTCONV AND NOT LBANN_WITH_DIHYDROGEN)
+  message(FATAL_ERROR "Distconv requires DiHydrogen. Enable DiHydrogen to use Distconv.")
+endif ()
+
+if (LBANN_WITH_DIHYDROGEN)
+  if (LBANN_WITH_DISTCONV)
+    find_package(DiHydrogen CONFIG COMPONENTS Meta Patterns DistConv
+      HINTS ${DIHYDROGEN_DIR} $ENV{DIHYDROGEN_DIR}
+      ${H2_DIR} $ENV{H2_DIR}
+      PATH_SUFFIXES install/lib64/cmake install/lib/cmake
+      NO_DEFAULT_PATH)
+    find_package(DiHydrogen CONFIG REQUIRED COMPONENTS Meta Patterns DistConv)
+    set(LBANN_HAS_DISTCONV TRUE)
+  else ()
+    find_package(DiHydrogen CONFIG COMPONENTS Meta Patterns
+      HINTS ${DIHYDROGEN_DIR} $ENV{DIHYDROGEN_DIR}
+      ${H2_DIR} $ENV{H2_DIR}
+      PATH_SUFFIXES install/lib64/cmake install/lib/cmake
+      NO_DEFAULT_PATH)
+    find_package(DiHydrogen CONFIG REQUIRED COMPONENTS Meta Patterns)
+  endif ()
+  set(LBANN_HAS_DIHYDROGEN TRUE)
+endif ()
+
+# Inherit half-precision stuff from Hydrogen
+set(LBANN_HAS_HALF ${HYDROGEN_HAVE_HALF}) # This is CPU-only
+
+# Not the ideal fix, but should be fine for now.
+if (Aluminum_FOUND)
+  message(STATUS "Aluminum found in Hydrogen. Using Aluminum.")
+  set(LBANN_WITH_ALUMINUM ON CACHE BOOL "Use aluminum." FORCE)
+endif ()
+
 include(SetupOpenMP)
 include(SetupMPI)
 include(SetupProtobuf)
@@ -201,6 +251,11 @@ set(LBANN_HAS_OPENCV ${OpenCV_FOUND})
 set(LBANN_HAS_CUDA ${_HYDROGEN_HAVE_CUDA})
 set(LBANN_WITH_CUDA ${LBANN_HAS_CUDA})
 
+# Only used if have GPU and have CPU half.
+if (LBANN_HAS_CUDA AND LBANN_HAS_HALF)
+  set(LBANN_HAS_GPU_FP16 ${HYDROGEN_GPU_USE_FP16})
+endif ()
+
 if (LBANN_HAS_CUDA)
   enable_language(CUDA)
 
@@ -214,13 +269,15 @@ endif ()
 if (LBANN_WITH_ALUMINUM)
   # Aluminum may have already been found by Hydrogen
   if (NOT Aluminum_FOUND)
-    find_package(Aluminum 0.2.0 NO_MODULE QUIET
+    message(WARNING
+      "Using Aluminum without Hydrogen support may not be well-supported.")
+    find_package(Aluminum 0.3.0 NO_MODULE QUIET
       HINTS ${Aluminum_DIR} ${ALUMINUM_DIR} ${AL_DIR}
       $ENV{Aluminum_DIR} $ENV{ALUMINUM_DIR} $ENV{AL_DIR}
       PATH_SUFFIXES lib64/cmake/aluminum lib/cmake/aluminum
       NO_DEFAULT_PATH)
     if (NOT Aluminum_FOUND)
-      find_package(Aluminum 0.2.0 NO_MODULE QUIET)
+      find_package(Aluminum 0.3.0 NO_MODULE QUIET)
     endif ()
   endif ()
   set(LBANN_HAS_ALUMINUM ${Aluminum_FOUND})
@@ -264,13 +321,28 @@ if (LBANN_HAS_CUDA)
 
   include(SetupCUDAToolkit)
 
+  if (LBANN_HAS_GPU_FP16)
+    set_property(TARGET cuda::toolkit PROPERTY
+      INTERFACE_COMPILE_OPTIONS $<$<COMPILE_LANGUAGE:CUDA>:-arch=sm_60>)
+  endif (LBANN_HAS_GPU_FP16)
+
   set(LBANN_HAS_CUDNN ${CUDNN_FOUND})
 
   if (LBANN_HAS_ALUMINUM AND AL_HAS_NCCL)
     set(LBANN_HAS_NCCL2 TRUE)
   else ()
     set(LBANN_HAS_NCCL2 FALSE)
   endif ()
+
+  if (LBANN_WITH_NVSHMEM)
+    find_package(NVSHMEM REQUIRED)
+    set_property(TARGET cuda::toolkit PROPERTY
+      INTERFACE_COMPILE_OPTIONS $<$<COMPILE_LANGUAGE:CUDA>:-arch=sm_70>)
+    # Build LBANN as a static library to get around a bug in NVSHMEM
+    set(BUILD_SHARED_LIBS OFF)
+  endif ()
+  set(LBANN_HAS_NVSHMEM "${NVSHMEM_FOUND}")
+
 endif (LBANN_HAS_CUDA)
 
 # This shouldn't be here, but is ok for now. This will occasionally be
@@ -415,22 +487,28 @@ if (LBANN_WITH_CONDUIT)
     endif ()
   endforeach ()
 
+  get_filename_component(_conduit_include_dirs
+    "${CONDUIT_INCLUDE_DIRS}" DIRECTORY)
+
   if (HDF5_FOUND_WITH_MODULE)
     list(APPEND _conduit_interface_link_libs
       ${HDF5_LIBRARIES})
 
-    set_target_properties(conduit::conduit
-      PROPERTIES
-      INTERFACE_INCLUDE_DIRECTORIES "${HDF5_INCLUDE_DIRS}")
+    list(APPEND _conduit_include_dirs
+      "${HDF5_INCLUDE_DIRS}")
   endif ()
 
+  set_property(TARGET conduit::conduit
+    PROPERTY
+    INTERFACE_INCLUDE_DIRECTORIES
+    "${_conduit_include_dirs}")
+
   set_target_properties(conduit::conduit
     PROPERTIES
     INTERFACE_LINK_LIBRARIES
     "${_conduit_interface_link_libs}")
 
   set(CONDUIT_LIBRARIES conduit::conduit)
-  set(LBANN_HAS_CONDUIT ${Conduit_FOUND})
 endif (LBANN_WITH_CONDUIT)
 
 if (LBANN_WITH_UNIT_TESTING)
@@ -446,7 +524,11 @@ if (LBANN_WITH_UNIT_TESTING)
   # Now that Catch2 has been found, start adding the unit tests
   include(CTest)
   include(Catch)
+  add_subdirectory(src/proto/unit_test)
   add_subdirectory(src/utils/unit_test)
+  add_subdirectory(src/weights/unit_test)
+  add_subdirectory(src/transforms/unit_test)
+  add_subdirectory(src/transforms/vision/unit_test)
 
   # Add this one last
   add_subdirectory(unit_test)
@@ -459,16 +541,16 @@ add_subdirectory(docs)
 # Build LBANN
 ################################################################
 
+# Add LBANN source files
+add_subdirectory(include)
+add_subdirectory(src)
+
 # Write the configure file
 configure_file(
   "${CMAKE_SOURCE_DIR}/cmake/configure_files/lbann_config.hpp.in"
   "${CMAKE_BINARY_DIR}/lbann_config.hpp"
   @ONLY)
 
-# Add LBANN source files
-add_subdirectory(include)
-add_subdirectory(src)
-
 # Create the LBANN library
 add_library(lbann ${LBANN_SOURCES} ${LBANN_HEADERS} ${LBANN_CUDA_SOURCES})
 
@@ -477,12 +559,10 @@ target_include_directories(lbann PUBLIC
   $<BUILD_INTERFACE:${CMAKE_SOURCE_DIR}/include>
   $<INSTALL_INTERFACE:${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_INCLUDEDIR}>)
 
-if (LBANN_HAS_PYTHON)
-  target_include_directories(lbann PUBLIC ${Python_INCLUDE_DIRS})
-endif ()
-
 # Use the IMPORTED targets when possible.
 target_link_libraries(lbann PUBLIC LbannProto)
+target_link_libraries(lbann PUBLIC Threads::Threads)
+target_link_libraries(lbann PUBLIC clara::clara)
 target_link_libraries(lbann PUBLIC cereal)
 target_link_libraries(lbann PUBLIC OpenMP::OpenMP_CXX)
 target_link_libraries(lbann PUBLIC MPI::MPI_CXX)
@@ -491,6 +571,15 @@ target_link_libraries(lbann PUBLIC ${HYDROGEN_LIBRARIES})
 target_link_libraries(lbann PUBLIC ${OpenCV_LIBRARIES})
 target_link_libraries(lbann PUBLIC ${CONDUIT_LIBRARIES})
 
+target_link_libraries(lbann PUBLIC
+  $<TARGET_NAME_IF_EXISTS:H2::H2Meta>
+  $<TARGET_NAME_IF_EXISTS:H2::H2Patterns>
+  )
+
+if (LBANN_WITH_DISTCONV)
+  target_link_libraries(lbann PUBLIC H2::H2DistConv)
+endif ()
+
 if (LBANN_HAS_TBINF)
   target_link_libraries(lbann PUBLIC TBinf)
 endif ()
@@ -512,7 +601,12 @@ if (LBANN_HAS_VTUNE)
 endif ()
 
 if (LBANN_HAS_PYTHON)
-  target_link_libraries(lbann PUBLIC ${Python_LIBRARIES})
+  target_link_libraries(lbann PUBLIC Python::Python)
+endif ()
+
+if (LBANN_HAS_NVSHMEM)
+  set_property(TARGET lbann PROPERTY CUDA_SEPARABLE_COMPILATION ON)
+  target_link_libraries(lbann PUBLIC NVSHMEM::NVSHMEM)
 endif ()
 
 if (TARGET LBANN_CXX_FLAGS_werror)
@@ -521,6 +615,27 @@ endif ()
 
 target_link_libraries(lbann PUBLIC ${DL_LIBRARY})
 
+# Fix the -g issue with Clang on OSX
+if (APPLE)
+  # Remove -g from the options
+  string(REPLACE  "-g" "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
+  string(REPLACE  "-g" "" CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG}")
+
+  # Get all the sources and add "-g" to all of them.
+  get_target_property(_LBANN_SRCS lbann SOURCES)
+  set_source_files_properties(${_LBANN_SRCS}
+    PROPERTIES COMPILE_OPTIONS "-g")
+
+  # Cleanup source files
+  foreach (bad_file IN LISTS _LBANN_SRCS)
+    get_source_file_property(
+      _SRC_COMPILE_OPTS "${bad_file}" COMPILE_OPTIONS)
+    string(REPLACE "-g" "" _SRC_COMPILE_OPTS "${COMPILE_OPTIONS}")
+    set_source_files_properties(
+      "${bad_file}" PROPERTIES COMPILE_OPTIONS "${_SRC_COMPILE_OPTS}")
+  endforeach ()
+endif ()
+
 # Clean things up
 include(LBANNDebugUtilities)
 lbann_remove_default_include_paths_from_all_subtargets(lbann)
@@ -539,6 +654,8 @@ endif ()
 add_subdirectory(model_zoo)
 add_subdirectory(model_zoo/tests)
 add_subdirectory(model_zoo/jag_utils)
+add_subdirectory(applications/CANDLE/pilot2/tools)
+add_subdirectory(applications/ATOM/utils)
 add_subdirectory(tests)
 add_subdirectory(scripts)
 
@@ -733,6 +850,8 @@ string(APPEND _str "\n")
 #Print the true/false guys
 append_str_tf(_str
   LBANN_GNU_LINUX
+  LBANN_HAS_DIHYDROGEN
+  LBANN_HAS_DISTCONV
   LBANN_HAS_HYDROGEN
   LBANN_HAS_OPENCV
   LBANN_HAS_CEREAL
@@ -747,7 +866,6 @@ append_str_tf(_str
   LBANN_HAS_DOXYGEN
   LBANN_HAS_LBANN_PROTO
   LBANN_HAS_ALUMINUM
-  LBANN_HAS_CONDUIT
   LBANN_HAS_PYTHON)
 string(APPEND _str
   "\n== End LBANN Configuration Summary ==\n")
@@ -774,6 +892,13 @@ configure_file(
   "${CMAKE_SOURCE_DIR}/cmake/configure_files/lbann_module.lua.in"
   "${CMAKE_BINARY_DIR}/lbann_module.lua.install"
   @ONLY)
+configure_file(
+  "${CMAKE_SOURCE_DIR}/cmake/configure_files/lbann_module.tcl.in"
+  "${CMAKE_BINARY_DIR}/lbann_module.tcl.install")
+
 install(FILES "${CMAKE_BINARY_DIR}/lbann_module.lua.install"
   RENAME "${LBANN_MODULEFILE_NAME}"
   DESTINATION "${CMAKE_INSTALL_SYSCONFDIR}/modulefiles")
+install(FILES "${CMAKE_BINARY_DIR}/lbann_module.tcl.install"
+  RENAME "${LBANN_VERSION}"
+  DESTINATION "${CMAKE_INSTALL_SYSCONFDIR}/modulefiles/lbann")