Skip to content

Commit

Permalink
Merge branch 'release-v0.100.0'
Browse files Browse the repository at this point in the history
============================== Release Notes: v0.100 ==============================
Support for new network structures:
 - 3D molecular generation models for Metal Organic Frameworks from the CoRE MOF Database.
 - 3D CosmoFlow Model
 - DenseNet
 - ATOM LSTM model
 - RAS state classifier
 - node2vec
 - Transformer and other attention-based models
 - ExaGAN (formerly CosmoGAN)
 - MaCC ICF surrogate model

Applications:
 - Created a directory of example applications, deprecating the "model zoo" directory

Support for new layers:
 - Embedding layer
 - Distributed embedding layer
 - Channel-wise scale/bias layer
 - Entry-wise scale/bias layer
 - Gated-Recurrent Units (GRU)
 - Entry-wise batchnorm
 - Argmax, Argmin, and one-hot layers
 - Layer norm
 - Deconvolution layer (transposed convolution)
 - Layers for channel-wise operations (channel-wise fully-connected, channel-wise softmax, channel-wise scale/bias, instance norm)
 - Matrix multiply layer

Python front-end:
 - Can now configure contrib launcher with environment variables
 - Added NERSC compute center
 - Per-layer specification of compute device (CPU or GPU)
 - Option to write custom batch scripts with Python front-end

Performance optimizations:
 - Parallelized Python data reader with "multiprocessing" module
 - Fuse batchnorm stats allreduces in FP/BP.
 - Tuned concatenate and slice layer
 - Dynamically allocate and free memory for layer error signals (halves LBANN's memory footprint)

Model portability & usability:
 - Bamboo tests for individual layers

Internal features:
 - Added support for DistConv features (distributed, generalized,
   parallel convolution)
 - Added support for NVSHMEM 1.0 API (used in distributed embedding
   layer and DistConv halo exchange)
 - Support for multiple data types per model (per-layer)
 - Support for per-layer mixed-precision weight training and inference,
   includes per-weight object and objective function mixed-precision.
 - Improved how and when the RNGs are initialized
 - Callback to dump images to TensorBoard
 - Callback to save model weights (useful to export to PyTorch)
 - Callback to save top K models (LTFB)
 - Improved run-to-run reproducibility by initializing weights in alphabetical order
 - Moved models from model_zoo directory to applications directory
 - Cleanup and refactoring of callbacks and layer instantiation
 - Grouped batchnorm statistics
 - Callback to print model description
 - Refactored trainer and training-state out of the model class
 - Support for transposing data in matrix multiply layers
 - Added DiHydrogen tensor and DistConv library
 - Added parallel strategy to layer class to support DistConv
 - LBANN inference mode supports loading models from multiple directories
 - Cleanup of checkpoint and restart logic

I/O & data readers:
 - Added in-memory data store that caches samples in CPU memory.  It can be loaded
   during the first epoch or preloaded
 - Added new "transform" data preprocessing ingestion pipeline
 - Added sample list format for specifying data sets
 - Introduced data coordinator that manages data readers and extracts them from
   the input layers
 - Data store is able to checkpoint / spill it's contents to local disk
 - Data reader for SMILE strings

Build system:
 - Hydrogen 1.3.4
 - Aluminum 0.3.3
 - Improved documentation on read the docs (RTD)
 - Robust support for using Spack as a build system around CMake
 - Identified compute centers for specifying build and run dependencies
 - Added Catch2-based tests

Bug fixes:
 - Fixed path resolution for dump weights, save model, and checkpoint callbacks
 - Added mutexes for preloading the data store
 - Fixed the LTFB exchange to include all ADAM optimizer state
 - Fixed the mapping of I/O RNGs to I/O processing threads to ensure
   consistent and correct multi-threaded performance

Retired features:
 - moving MNIST data reader is replaced by python data reader
 - ASCII data reader is deprecated
  • Loading branch information
bvanessen committed Jul 30, 2020
2 parents 018018b + e13d34c commit d0fbac3
Show file tree
Hide file tree
Showing 1,308 changed files with 106,623 additions and 78,559 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,8 @@ data.prototext*
# Can also ignore all directories and files in a directory.
# tmp/**/*
build
spack_environments/users/


# we don't want to collect slurm output
**/slurm-*.out
11 changes: 11 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[submodule "applications/graph/snap"]
path = applications/graph/snap
url = https://github.com/snap-stanford/snap
ignore = dirty
[submodule "applications/graph/largescale_node2vec"]
path = applications/graph/largescale_node2vec
url = https://lc.llnl.gov/bitbucket/scm/havoq/largescale_node2vec.git
ignore = dirty
[submodule "applications/ATOM/moses"]
path = applications/ATOM/moses
url = [email protected]:samadejacobs/moses.git
17 changes: 17 additions & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,24 @@
# .readthedocs.yml
# Config file for Read the Docs
# https://docs.readthedocs.io/en/stable/config-file/v2.html

version: 2

sphinx:
builder: html
configuration: docs/conf.py

formats: []

build:
image: latest

python:
version: 3.7
install:
- requirements: docs/sphinx_requirements.txt

submodules:
include: []


173 changes: 149 additions & 24 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
cmake_minimum_required(VERSION 3.12)
cmake_minimum_required(VERSION 3.13)

project(LBANN CXX)

Expand Down Expand Up @@ -48,7 +48,7 @@ endif ()
#

set(LBANN_VERSION_MAJOR 0)
set(LBANN_VERSION_MINOR 99)
set(LBANN_VERSION_MINOR 100)
set(LBANN_VERSION_PATCH 0)

set(LBANN_VERSION "${LBANN_VERSION_MAJOR}.${LBANN_VERSION_MINOR}.${LBANN_VERSION_PATCH}")
Expand Down Expand Up @@ -104,6 +104,20 @@ option(LBANN_WITH_CONDUIT "Enable Conduit library" ON)

option(LBANN_WITH_CUDNN "Include Nvidia cuDNN" ON)

option(LBANN_WITH_DIHYDROGEN "Build with DiHydrogen support" OFF)
if (LBANN_WITH_DIHYDROGEN)
message(WARNING "DiHydrogen support is currently expermimental. "
"There is no stable interface. "
"Use caution before using any features.")
endif (LBANN_WITH_DIHYDROGEN)

option(LBANN_WITH_DISTCONV "Enable DiHydrogen's Distconv" OFF)
if (LBANN_WITH_DISTCONV)
message(WARNING "Distconv support is currently expermimental. "
"There is no stable interface. "
"Use caution before using any features.")
endif (LBANN_WITH_DISTCONV)

option(LBANN_WITH_HWLOC
"Enable topology-aware optimizations" ON)

Expand All @@ -121,13 +135,10 @@ option(LBANN_WITH_VTUNE
option(LBANN_WITH_UNIT_TESTING
"Enable the unit testing framework (requires Catch2)" OFF)

# Enable parallel random matrix generation, if possible
# Use deterministic GPU algorithms and layer operations
option(LBANN_DETERMINISTIC
"Use deterministic algorithms as much as possible." OFF)

option(LBANN_SEQUENTIAL_INITIALIZATION
"Sequentially consistent initialization" OFF)

option(LBANN_DEBUG_PRINT_SUBTARGETS
"Turn on debugging output of internal target properties." OFF)
mark_as_advanced(LBANN_DEBUG_PRINT_SUBTARGETS)
Expand Down Expand Up @@ -161,6 +172,11 @@ include(SetupCXX)
################################################################

# Required dependencies
find_package(Threads REQUIRED)

# Argument parsing backend
find_package(Clara REQUIRED)

find_package(CEREAL NO_MODULE
HINTS ${CEREAL_DIR} $ENV{CEREAL_DIR}
PATH_SUFFIXES share/cmake/cereal
Expand All @@ -172,16 +188,50 @@ set(LBANN_HAS_CEREAL ${CEREAL_FOUND})
# The imported target is just called "cereal". Super.

# Setup the linear algebra library
find_package(Hydrogen 1.2.0 NO_MODULE QUIET
find_package(Hydrogen 1.3.3 NO_MODULE QUIET
HINTS ${Hydrogen_DIR} ${HYDROGEN_DIR} $ENV{Hydrogen_DIR} $ENV{HYDROGEN_DIR}
PATH_SUFFIXES lib/cmake/hydrogen
NO_DEFAULT_PATH)
if (NOT Hydrogen_FOUND)
find_package(Hydrogen 1.2.0 NO_MODULE QUIET REQUIRED)
find_package(Hydrogen 1.3.3 NO_MODULE QUIET REQUIRED)
endif ()
message(STATUS "Found Hydrogen: ${Hydrogen_DIR}")
set(LBANN_HAS_HYDROGEN ${Hydrogen_FOUND})

# DiHydrogen and Distconv
if (LBANN_WITH_DISTCONV AND NOT LBANN_WITH_DIHYDROGEN)
message(FATAL_ERROR "Distconv requires DiHydrogen. Enable DiHydrogen to use Distconv.")
endif ()

if (LBANN_WITH_DIHYDROGEN)
if (LBANN_WITH_DISTCONV)
find_package(DiHydrogen CONFIG COMPONENTS Meta Patterns DistConv
HINTS ${DIHYDROGEN_DIR} $ENV{DIHYDROGEN_DIR}
${H2_DIR} $ENV{H2_DIR}
PATH_SUFFIXES install/lib64/cmake install/lib/cmake
NO_DEFAULT_PATH)
find_package(DiHydrogen CONFIG REQUIRED COMPONENTS Meta Patterns DistConv)
set(LBANN_HAS_DISTCONV TRUE)
else ()
find_package(DiHydrogen CONFIG COMPONENTS Meta Patterns
HINTS ${DIHYDROGEN_DIR} $ENV{DIHYDROGEN_DIR}
${H2_DIR} $ENV{H2_DIR}
PATH_SUFFIXES install/lib64/cmake install/lib/cmake
NO_DEFAULT_PATH)
find_package(DiHydrogen CONFIG REQUIRED COMPONENTS Meta Patterns)
endif ()
set(LBANN_HAS_DIHYDROGEN TRUE)
endif ()

# Inherit half-precision stuff from Hydrogen
set(LBANN_HAS_HALF ${HYDROGEN_HAVE_HALF}) # This is CPU-only

# Not the ideal fix, but should be fine for now.
if (Aluminum_FOUND)
message(STATUS "Aluminum found in Hydrogen. Using Aluminum.")
set(LBANN_WITH_ALUMINUM ON CACHE BOOL "Use aluminum." FORCE)
endif ()

include(SetupOpenMP)
include(SetupMPI)
include(SetupProtobuf)
Expand All @@ -201,6 +251,11 @@ set(LBANN_HAS_OPENCV ${OpenCV_FOUND})
set(LBANN_HAS_CUDA ${_HYDROGEN_HAVE_CUDA})
set(LBANN_WITH_CUDA ${LBANN_HAS_CUDA})

# Only used if have GPU and have CPU half.
if (LBANN_HAS_CUDA AND LBANN_HAS_HALF)
set(LBANN_HAS_GPU_FP16 ${HYDROGEN_GPU_USE_FP16})
endif ()

if (LBANN_HAS_CUDA)
enable_language(CUDA)

Expand All @@ -214,13 +269,15 @@ endif ()
if (LBANN_WITH_ALUMINUM)
# Aluminum may have already been found by Hydrogen
if (NOT Aluminum_FOUND)
find_package(Aluminum 0.2.0 NO_MODULE QUIET
message(WARNING
"Using Aluminum without Hydrogen support may not be well-supported.")
find_package(Aluminum 0.3.0 NO_MODULE QUIET
HINTS ${Aluminum_DIR} ${ALUMINUM_DIR} ${AL_DIR}
$ENV{Aluminum_DIR} $ENV{ALUMINUM_DIR} $ENV{AL_DIR}
PATH_SUFFIXES lib64/cmake/aluminum lib/cmake/aluminum
NO_DEFAULT_PATH)
if (NOT Aluminum_FOUND)
find_package(Aluminum 0.2.0 NO_MODULE QUIET)
find_package(Aluminum 0.3.0 NO_MODULE QUIET)
endif ()
endif ()
set(LBANN_HAS_ALUMINUM ${Aluminum_FOUND})
Expand Down Expand Up @@ -264,13 +321,28 @@ if (LBANN_HAS_CUDA)

include(SetupCUDAToolkit)

if (LBANN_HAS_GPU_FP16)
set_property(TARGET cuda::toolkit PROPERTY
INTERFACE_COMPILE_OPTIONS $<$<COMPILE_LANGUAGE:CUDA>:-arch=sm_60>)
endif (LBANN_HAS_GPU_FP16)

set(LBANN_HAS_CUDNN ${CUDNN_FOUND})

if (LBANN_HAS_ALUMINUM AND AL_HAS_NCCL)
set(LBANN_HAS_NCCL2 TRUE)
else ()
set(LBANN_HAS_NCCL2 FALSE)
endif ()

if (LBANN_WITH_NVSHMEM)
find_package(NVSHMEM REQUIRED)
set_property(TARGET cuda::toolkit PROPERTY
INTERFACE_COMPILE_OPTIONS $<$<COMPILE_LANGUAGE:CUDA>:-arch=sm_70>)
# Build LBANN as a static library to get around a bug in NVSHMEM
set(BUILD_SHARED_LIBS OFF)
endif ()
set(LBANN_HAS_NVSHMEM "${NVSHMEM_FOUND}")

endif (LBANN_HAS_CUDA)

# This shouldn't be here, but is ok for now. This will occasionally be
Expand Down Expand Up @@ -415,22 +487,28 @@ if (LBANN_WITH_CONDUIT)
endif ()
endforeach ()

get_filename_component(_conduit_include_dirs
"${CONDUIT_INCLUDE_DIRS}" DIRECTORY)

if (HDF5_FOUND_WITH_MODULE)
list(APPEND _conduit_interface_link_libs
${HDF5_LIBRARIES})

set_target_properties(conduit::conduit
PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES "${HDF5_INCLUDE_DIRS}")
list(APPEND _conduit_include_dirs
"${HDF5_INCLUDE_DIRS}")
endif ()

set_property(TARGET conduit::conduit
PROPERTY
INTERFACE_INCLUDE_DIRECTORIES
"${_conduit_include_dirs}")

set_target_properties(conduit::conduit
PROPERTIES
INTERFACE_LINK_LIBRARIES
"${_conduit_interface_link_libs}")

set(CONDUIT_LIBRARIES conduit::conduit)
set(LBANN_HAS_CONDUIT ${Conduit_FOUND})
endif (LBANN_WITH_CONDUIT)

if (LBANN_WITH_UNIT_TESTING)
Expand All @@ -446,7 +524,11 @@ if (LBANN_WITH_UNIT_TESTING)
# Now that Catch2 has been found, start adding the unit tests
include(CTest)
include(Catch)
add_subdirectory(src/proto/unit_test)
add_subdirectory(src/utils/unit_test)
add_subdirectory(src/weights/unit_test)
add_subdirectory(src/transforms/unit_test)
add_subdirectory(src/transforms/vision/unit_test)

# Add this one last
add_subdirectory(unit_test)
Expand All @@ -459,16 +541,16 @@ add_subdirectory(docs)
# Build LBANN
################################################################

# Add LBANN source files
add_subdirectory(include)
add_subdirectory(src)

# Write the configure file
configure_file(
"${CMAKE_SOURCE_DIR}/cmake/configure_files/lbann_config.hpp.in"
"${CMAKE_BINARY_DIR}/lbann_config.hpp"
@ONLY)

# Add LBANN source files
add_subdirectory(include)
add_subdirectory(src)

# Create the LBANN library
add_library(lbann ${LBANN_SOURCES} ${LBANN_HEADERS} ${LBANN_CUDA_SOURCES})

Expand All @@ -477,12 +559,10 @@ target_include_directories(lbann PUBLIC
$<BUILD_INTERFACE:${CMAKE_SOURCE_DIR}/include>
$<INSTALL_INTERFACE:${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_INCLUDEDIR}>)

if (LBANN_HAS_PYTHON)
target_include_directories(lbann PUBLIC ${Python_INCLUDE_DIRS})
endif ()

# Use the IMPORTED targets when possible.
target_link_libraries(lbann PUBLIC LbannProto)
target_link_libraries(lbann PUBLIC Threads::Threads)
target_link_libraries(lbann PUBLIC clara::clara)
target_link_libraries(lbann PUBLIC cereal)
target_link_libraries(lbann PUBLIC OpenMP::OpenMP_CXX)
target_link_libraries(lbann PUBLIC MPI::MPI_CXX)
Expand All @@ -491,6 +571,15 @@ target_link_libraries(lbann PUBLIC ${HYDROGEN_LIBRARIES})
target_link_libraries(lbann PUBLIC ${OpenCV_LIBRARIES})
target_link_libraries(lbann PUBLIC ${CONDUIT_LIBRARIES})

target_link_libraries(lbann PUBLIC
$<TARGET_NAME_IF_EXISTS:H2::H2Meta>
$<TARGET_NAME_IF_EXISTS:H2::H2Patterns>
)

if (LBANN_WITH_DISTCONV)
target_link_libraries(lbann PUBLIC H2::H2DistConv)
endif ()

if (LBANN_HAS_TBINF)
target_link_libraries(lbann PUBLIC TBinf)
endif ()
Expand All @@ -512,7 +601,12 @@ if (LBANN_HAS_VTUNE)
endif ()

if (LBANN_HAS_PYTHON)
target_link_libraries(lbann PUBLIC ${Python_LIBRARIES})
target_link_libraries(lbann PUBLIC Python::Python)
endif ()

if (LBANN_HAS_NVSHMEM)
set_property(TARGET lbann PROPERTY CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(lbann PUBLIC NVSHMEM::NVSHMEM)
endif ()

if (TARGET LBANN_CXX_FLAGS_werror)
Expand All @@ -521,6 +615,27 @@ endif ()

target_link_libraries(lbann PUBLIC ${DL_LIBRARY})

# Fix the -g issue with Clang on OSX
if (APPLE)
# Remove -g from the options
string(REPLACE "-g" "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
string(REPLACE "-g" "" CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG}")

# Get all the sources and add "-g" to all of them.
get_target_property(_LBANN_SRCS lbann SOURCES)
set_source_files_properties(${_LBANN_SRCS}
PROPERTIES COMPILE_OPTIONS "-g")

# Cleanup source files
foreach (bad_file IN LISTS _LBANN_SRCS)
get_source_file_property(
_SRC_COMPILE_OPTS "${bad_file}" COMPILE_OPTIONS)
string(REPLACE "-g" "" _SRC_COMPILE_OPTS "${COMPILE_OPTIONS}")
set_source_files_properties(
"${bad_file}" PROPERTIES COMPILE_OPTIONS "${_SRC_COMPILE_OPTS}")
endforeach ()
endif ()

# Clean things up
include(LBANNDebugUtilities)
lbann_remove_default_include_paths_from_all_subtargets(lbann)
Expand All @@ -539,6 +654,8 @@ endif ()
add_subdirectory(model_zoo)
add_subdirectory(model_zoo/tests)
add_subdirectory(model_zoo/jag_utils)
add_subdirectory(applications/CANDLE/pilot2/tools)
add_subdirectory(applications/ATOM/utils)
add_subdirectory(tests)
add_subdirectory(scripts)

Expand Down Expand Up @@ -733,6 +850,8 @@ string(APPEND _str "\n")
#Print the true/false guys
append_str_tf(_str
LBANN_GNU_LINUX
LBANN_HAS_DIHYDROGEN
LBANN_HAS_DISTCONV
LBANN_HAS_HYDROGEN
LBANN_HAS_OPENCV
LBANN_HAS_CEREAL
Expand All @@ -747,7 +866,6 @@ append_str_tf(_str
LBANN_HAS_DOXYGEN
LBANN_HAS_LBANN_PROTO
LBANN_HAS_ALUMINUM
LBANN_HAS_CONDUIT
LBANN_HAS_PYTHON)
string(APPEND _str
"\n== End LBANN Configuration Summary ==\n")
Expand All @@ -774,6 +892,13 @@ configure_file(
"${CMAKE_SOURCE_DIR}/cmake/configure_files/lbann_module.lua.in"
"${CMAKE_BINARY_DIR}/lbann_module.lua.install"
@ONLY)
configure_file(
"${CMAKE_SOURCE_DIR}/cmake/configure_files/lbann_module.tcl.in"
"${CMAKE_BINARY_DIR}/lbann_module.tcl.install")

install(FILES "${CMAKE_BINARY_DIR}/lbann_module.lua.install"
RENAME "${LBANN_MODULEFILE_NAME}"
DESTINATION "${CMAKE_INSTALL_SYSCONFDIR}/modulefiles")
install(FILES "${CMAKE_BINARY_DIR}/lbann_module.tcl.install"
RENAME "${LBANN_VERSION}"
DESTINATION "${CMAKE_INSTALL_SYSCONFDIR}/modulefiles/lbann")
Loading

0 comments on commit d0fbac3

Please sign in to comment.