Changelog

All notable changes to this project will be documented in this file.

[6.6.0]

Upgrade Slurm to 23.11.8
Update NVIDIA driver to 550.90.07.

[6.5.10]

Add support for building open source compliant NVIDIA drivers
Clean up partial method mutation of stdlib
Update slurm-gcp to allow for different universe domains and api endpoints

[6.5.9]

Sync script files from bucket
Add resource_policies to template

[6.5.8]

Expose TreeWidth and TopologyPlugin variables
Add 'source' argument for path to prolog or epilog scripts

[6.5.7]

Persist templates content in the config instead of bucket
Add support for shared reservations
Update latest versions for NCCL plugin and rxdm
Update A3 Mega solution to use NCCL plugin 1.0.2

[6.5.6]

Fix bug where dynamic nodes with hostname set to FQDN could not be found by slurmsync script; in current architecture, this impacts login nodes. The resulting Exception was uncaught; Solution truncates all nodenames to short hostname before performing comparisons requiring short hostname

[6.5.5]

Fix bug where compute nodes with hostname set to FQDN could not find their nodeset-specific startup-script

[6.5.4]

Fix bug introduced in #134 mis-ordering tasks in Docker image building

[6.5.3]

Address Docker image building failures for TPU use cases
Refactor network storage from setup script to separate script
Relax node name pattern matching to accommodate MIG instances (experimental)

[6.5.2]

Address image building failure on CentOS 7 and derivatives

[6.5.1]

Address image building failure on Ubuntu OSes

[6.5.0]

NVIDIA K80 variant image is no longer published because the hardware has reached EOL on Google Cloud
Add ability to build Slurm on RedHat Enterprise 8 and 9
- not yet tested or supported configuration
Terraform modules are now compatible with google provider 5.x
Do not hardcode hostname (slurmd -N) argument when slurmd attempts to pull configuration from controller; instead rely on automatic detection. Resolves issue with hostname temporarily set to FQDN on some hosts

[6.4.7]

Support nodeset-specific network_storage
Suppress noisy error message when fail to delete used resource policies

[6.4.6]

Fix SocketPerBoard computation

[6.4.5]

Add support for externally-managed prolog/epilog scripts
Add example receive-data-path-manager prolog/epilog
Fix installation of NVIDIA open drivers in image building solution
Remove slurmd restart cronjob (implemented using SystemD in abb487f1)
write Slurm node configurations with SocketsPerBoard rather than Sockets, to align with recommendation in 23.11 documentation

[6.4.4]

Add replace_trigger to _slurm_instance for fine-grained control over instance replacement
Add support for installing NVIDIA open drivers
Add support for PMIx to image building solution (images published after this release will contain PMIx)
Support specifying monitoring agent in Packer
TPU: update supported docker images list
Fix: prevent controller from mounting munge key by NFS
Fix: retry DNS lookups during early setup process
Fix: ensure VM always boots into kernel against which NVIDIA drivers were built
Fix: resolve bug in enable_configure by removing unused parameter from nodeset_switch_lines function
Enable building Slurm on Debian 12 images (not yet officially supported or tested)

[6.4.3]

Add support for nodeset-specific startup scripts

[6.4.2]

Fix slurmsync.py would try to delete static placement policy as it is not associated with a job. This delete would fail as nodes were using the placement policy, but it would cause spurious failure error messages in logs.

[6.4.1]

Fix regression in slurmsync that would cause inadvertent deletion of compact placement policies right before they are used
Remove configuration settings that are no longer valid in Slurm 23.11
- CommunicationParameters: NoAddrCache
- CgroupAutomount
Use beta API for Google Compute Engine due to usage of maintenanceInterval setting on VMs

[6.4.0]

enable explicit setting of maintenance_interval to nodesets
Fix solution for installing Lustre drivers in Rocky Linux 8

[6.3.5]

Change slurm version to 23.11.3
Update cuda to 12.3.2
Fix slurmsync error on removing powered up nodes.
Fix slurmrestd address in service file

[6.3.4]

slurmsync.py: fix concurrent invocation protection

[6.3.3]

Fix in tpu nodeset module to use Cloud HPC Toolkit vpc module
Configure slurmd to restart upon-failure

[6.3.2]

Fix CloudSQL configuration
Properly capture slurm node status when reboot is issued
Install legacy stackdriver by default instead of ops agent for improved performance
TPU - fix Add reserved property for nodeset_tpu

[6.3.1]

Add reserved property for nodeset_tpu
update lustre repository url

[6.3.0]

Upgrade installed Slurm to 23.02.7
Fix deprecation warning in google_secret_manager_secret.
Fix TPU delete_node API return message.

[6.2.0]

Reverse logic in valid_placement_nodes
Add slurm_gcp_plugin support.
Add reservation affinity to nodesets via reservation_name option.
Change TPU node conf based on tpu version instead of TPU model.
Add support for TPUv4
Upgrade installed Slurm to 23.02.5

[6.1.2]

Fix accelerator optimized machine type SMT handling.
Prefix user visible errors with its source.
Fix accelerator optimized machine type socket handling.
Only compare config.yaml blob to cache file.
Fix login nodes appearing as compute nodes in Slurm output.
Add enable_debug_logging and extra_logging_flags to terraform.
Only attempt static node resume when node is powered down.
Fix CUDA on Ubuntu by installing CUDA via runfile alongside NVIDIA driver from signed repo.
Fix conf generation issue on reconfiguration.

[6.1.1]

Fix suspend issue with TPU nodes
Add TPU job example
Changed slurm dependency from man2html to man2html-base and man2html-core to reduce image size
Changed default docker image name to remove the OS reference

[6.1.0]

Add on_host_maintenance to packer module to support instances with GPUs.
Fix retry of powering up static nodes on failure.
Add support for H3 machines and enumerated multi-socket processors.
Fix munge failing after manual reboot of node.
[Beta feature] Added support for TPU-vm nodes.
[Beta feature] Added support for TPU-vm multi-rank nodes.
Add ignore_prefer_validation to SchedulerParameters in generated cloud.conf.
Remove unaltered centos-7 image from actively published and supported images.
Upgrade installed Slurm to 23.02.4.
Fix CUDA install on Ubuntu 20.04.

[6.0.0]

Add slurm cluster management daemon
Update default Slurm version to 23.02.2.
Make slurm_cluster root module use terraform 1.3 and optional object fields.
Reconfigure now is a service on the instances.
Move from project metadata to GCS bucket to store cluster files.
Factored out nodeset modules (regular, dynamic) from partition module.
Replace zone_policy_* with zones in nodeset module.
Replace access_config with enable_public_ip and network_tier.
Add partition options default, resume_timeout, suspend_time, suspend_timeout.
Increase nodeset_name length to 15 characters (from 7).
Remove partition_name length limit.
Add bandwidth_tier support to instance templates.
Move spot preemptible support to instance template.
Fix login template name not using group_name in name schema.
Add enable_login to toggle creation of login node resources.
Remove partition level startup-scripts and network mounts.
Fix Ubuntu 20.04 NVIDIA install.
Change partition level placement policy to nodeset level.
Use topology.conf to prioritize nodes within nodesets.
Remove debian-10 and vanilla rocky-linux-8 images from build process and support.
Fix threads per core inference.
Upgrade Slurm to 23.02.3

[5.7.4]

Set EOL of published centos-7 image to Aug 2023. If you need this image for longer, consider switching to hpc-centos-7, which will have support through Jan 2024.
Allow metadata key slurmd_feature to initiate dynamic node setup.
Fix dynamic nodes using cloud_dns instead of cloud_reg_addrs.
Disable TreeWidth when dynamic nodes are configured.
Fix dynamic nodes failing to download custom scripts.
Fix slurmsync with only dynamic nodes in system.
Fix NVIDIA driver install after kernel upgrade for rocky-linux-8.

[5.7.3]

Fix detecting gpus on certain machine types.
Forward additional error information to node reason and salloc/srun.
Fix warning on missing python library httplib2.
Fix lustre support in images by installing the latest available client version for each OS.
Change image name format to use Slurm-GCP version instead of the Slurm version. eg. slurm-gcp-5-7-hpc-centos-7
Disable Lustre install for Debian 11 because it is currently incompatible. It wasn't working anyway.

[5.7.2]

Fix DefMemPerCPU on partitions that only contain dynamic nodes.
Add hpc-rocky-linux-8 image build using cloud-hpc-image-public/hpc-rocky-linux-8 as a base.
Update default slurm to 22.05.9

[5.7.1]

Fix regression in load_bq.py.
Fix slurmsync.py handling of pub/sub subscriptions when enable_reconfigure.
Add retries to munge mount to handle the case of attempted mount before the controller is ready.
Fix partition generation with both nodes and feature.
Fix regex parser for ops agent ingestion of slurm logs.
Add wait on placement group creation to avoid race condition during resume.

[5.7.0]

Expose cloud_logging_filter output from controller modules.
Add partition_feature for external dynamic nodes.
Add setup.py --slurmd-feature option for external dynamic node startup.
Installing signed nvidia drivers from repo added to ansible for Ubuntu 20.04 only. This allows using GPUs on shielded VMs.
Added legacy k80 support to ansible and made installing the latest nvidia the default. A k80-compatible image based on the hpc-centos-7 image family will now also be built.
Packer module refactored to build only a single image at a time.
Add HTC example terraform project.

[5.6.3]

Increase project metadata timeouts.
Add cluster cloud logging filter output.
Add Ubuntu 22.04 LTS support.
Add preliminary ARM64 image support for T2A instances.
Add Debian 11 support.
Add Rocky Linux 8 support.
Adjust logging for config not found.

[5.6.2]

Fix Terraform 1.2 incompatibility introduced in 5.6.1.

[5.6.1]

Fix Terraform 1.4.0 incompatibilities.
setup.log now discoverable in GCP Cloud Logger.
Job admin_comment contains last allocated node failure.
Resume failures now notify srun of the error.
startup-script - Add logging level prefixes for parsing.
Fix slurm and slurm-gcp logs not showing up in Cloud Logging.
resume.py - No longer validate machine_type with placement groups.
Raise error from incorrect settings with dependent inputs.
For gcsfuse network storage, server_ip can be null/None or "".
Fix munge mount export from controller.
Enable DebugFlags=Power by default
Make slurmsync check preemptible status from instance rather than the template.

[5.6.0]

Add support for custom machine types.
Add job id label for exclusive nodes.
Add zone_target_shape to partitions, mapped to bulkInsert targetShape.
Fix Lustre mounts failing because of failing to resolve server IP address.

[5.5.0]

Fix external network_storage being added to exportfs
Fix supported instance family for placement groups
Add support for c3 instance family for placement groups.
Properly export job comment and admin comment to BigQuery.
Slurm updated to 22.05.8

[5.4.1]

Use FQDN as default slurm_control_addr.
Mounts use slurm_control_addr, if available, otherwise slurm_control_host.

[5.4.0]

Add var.install_dir to module.slurm_controller_hybrid to specify the intended directory where the files are to be installed.
Add var.slurm_control_host_port to module.slurm_controller_hybrid to specify the port for slurmd to connect for configless setup.
Add var.munge_mount to module.slurm_controller_hybrid to specify an external munge.key source.
Fix unwanted mounting of login_network_storage on compute nodes.
Add CI testsuite using gitlab CI.

[5.3.0]

Use configless mode for cluster configuration management.
Hybrid - Write files with more restrictive permissions.
Fix module.slurm_cluster not propagating var.disable_default_mounts to module.slurm_controller_hybrid.
Fix creation of instances in placement groups.
slurm_controller_hybrid - add var.slurm_control_addr option to allow a secondary address to be used.
No longer create *.conf.bak files.
Upgrade Slurm to 22.05.6.
- WARNING: Breaking change to terraform modules -- default image has changed.

[5.2.0]

module.slurm_instance_template - remove unused var.bandwidth_tier.
Add support for access_config on partition compute nodes.
- WARNING: Breaking change to slurm_cluster, slurm_partition modules -- new field access_config.
Add support for configurable startup script timeout.
- WARNING: Breaking change to slurm_cluster, slurm_partition modules -- new field partition_startup_script_timeout.
Fix enable_cleanup_subscriptions defined but not used in certain examples.
Fix slurm_controller_hybrid not respecting disable_default_mounts.
Removed unused slurm_depends_on from modules.
Fix scripts running without config.yaml.
Reimplement backoff delay on retry attempts.
Upgrade Slurm to 22.05.4.
- WARNING: Breaking change to terraform modules -- default image has changed.
Fix Nvidia ansible role install when kernel is updated.

[5.1.0]

Add support for gvnic and tier1 networking.
- WARNING: Breaking change to slurm_cluster, slurm_partition modules -- new field bandwidth_tier.
Fix get_insert_operations using empty filter item.
Fix project id in wait_for_operation for resume.py from some hybrid setups.
Add more useful error logging to resume.py
Fix resume.py starting more than 1k identical nodes at a time.
Ensure proper slurm ownership on instance template info cache.
Honor disable_smt=true on compute instances.
Improve logging and add logging flags to show API request details.
Fix usage of slurm_control_host in config.yaml.
Fix partition network storage.
Improved speed of creating instances.
In scripts, ignore GCP instances without Slurm-GCP metadata.
Upgrade Slurm to 22.05.3.
Pin lustre version to 2.12.

[5.0.3]

Allow configuring controller hostname for hybrid deployments.
{resume|suspend}.py ignore nodes not in cloud configuration (config.yaml).
Constrain packages in Pipfile and requirements.txt
Allow hybrid scripts to succeed when slurm user does not exist.
Fix pushing cluster config to project metadata in hybrid terraform deployments.
Add disable_default_mounts option to terraform modules. This is needed for hybrid deployments.
Ensure removal of placement groups on failed resume.
Add retries and error logging to writing the template info cache file.
Remove nonempty option from gcsfuse mounts. That option is no longer supported in fusermount3
Restore lustre download url in ansible role. The url change was reverted.

[5.0.2]

Fix applying enable_bigquery_load to an existing cluster
Fix setting resume/suspend_rate
Do not set PrologSlurmctld and EpilogSlurmctld when no partitions have enable_job_exclusive.
Change max size of placement group to 150.
Allow a2 machine types in placement groups.
Constrain length of variables that influence resource names.
Add Slurm shell completion script to environment.
Use Slurm service files compiled from source.
Mitigate importlib.util failing on python > 3.8
Add options to build lighter-weight images by disabling some ansible roles (eg. CUDA).
Fix lustre rpm download url for image creation.
Remove erroring and redundant package libpam-dev from debian image creation.
Upgrade python library google-cloud-storage to ~2.0.
Upgrade Slurm to 22.05.2.

[5.0.1]

Disable ConstrainSwapSpace in etc/cgroup.conf.tpl
Remove leftover home dir after ansible provisioning of image.

[5.0.0]

Convert NEWS to CHANGELOG.md
Create ansible roles from foundry build process.
Add packer and ansible based image building process and configuration.
Slurm scripts are baked into the image.
Remove foundry based image building process.
Create new Slurm terraform modules and examples, using cloud-foundation-toolkit and best practices.
Use terraform module to define Slurm partitions.
Use instance templates to create Slurm instances.
Support partitions with heterogeneous compute nodes.
Rename partition module boolean options.
Change how static and dynamic nodes are defined.
Change how zone policy is defined in partition module.
Store cluster Slurm configuration data in project metadata.
Add top level terraform module for a Slurm cluster.
Add pre-commit hooks for terraform validation, formatting, and documentation.
Slurm cluster resources are labeled with slurm_cluster_id.
All compute nodes are managed by the controller module.
Update module option for toggling simultaneous multithreading (SMT).
scripts - downgrade required python version to 3.6
Add new hybrid management process using terraform Slurm modules.
Rename metadata slurm_instance_type to slurm_instance_role.
Add module option for cluster development mode.
Change terraform minimum required version to 1.0
Remove old terraform modules and examples.
Add ansible role to install custom user scripts from directory.
Change *-custom-install variable names and they can now accept multiple custom user scripts.
Store cluster provisioning user scripts in project metadata.
Unify partition option naming with configuration object.
Add module option to toggle os-login based authentication.
Add new module for creating Slurm cluster service accounts and IAM.
Add new module for creating Slurm cluster firewall rules.
Add pre-commit hooks for python linting and formatting.
Add terraform examples to fully manage cluster deployment.
Add terraform example for custom authentication using winbind.
Change image naming template to prevent name collision with v4.
Add job workflow helper script to submit and migrate job data.
Harden secrets management (e.g. cloudsql, munge, jwt).
Add module option for job level prolog and epilog user scripts.
Add module option for partition level node configuration scripts.
Add module option for partition line configuration.
Add module option for node line configuration.
Use Google ops-agent for cloud logging and monitoring.
Add packer configuration option for user ansible roles.
Add module option for job account data storage in BigQuery.
Add module option to cleanup orphaned compute and placement group resources.
Add module option to reconfigure the cluster when Slurm configurations change (e.g. slurm.conf, partition definitions).
Add module option to cleanup orphaned subscription resources.
Add pre-commit hooks for miscellaneous formatting and validation.
Add pre-commit hooks for yaml formatting.
Add Google services enable/check to examples.
scripts - improve reporting of missing imported modules.
Allow suspend.py to delete exclusive instances, which allows power_down_force to work on exclusive nodes.
Add better error reporting in setup script for invalid machine types.
Allow partial success in bulkInsert for resume.py
Add bulkInsert operation failure detection and logging
Force enable_placement_groups=false when count_static > 0.
Add module variable zone_policy_* validation.
Filter module variable zone_policy_* input with region.
Change ansible to install cuda and nvidia from runfile
Reimplement spot instance support.
Rename *_d startup script variables to *_startup_scripts.
Eliminate redundant slurm_cluster_id.
Add additional validation slurm_cluster partitions.
Remove need for gpu instance by packer
Disable LDAP ansible role for Debian family
Rename node count fields.
Upgrade Slurm to version 21.08.8
Add proper retry on mount attempts
Add cluster_id and job_db_uuid fields to BQ table schema.
Fix potential race condition in loading BQ job data.
Remove deployment manager support.
Update Nvidia to 470.82.01 and CUDA to 11.4.4
Re-enable gcsfuse in ansible and workaround the repo gpg check problem

[4.1.5]

Fix partition-specific network storage from controller to compute nodes.
Bump urllib3 from 1.26.4 to 1.26.5 in foundry.
Bump ipython from 7.21.0 to 7.31.1 in foundry.

[4.1.4]

Updated Singularity download URL in custom-controller-install script.
Fix static compute nodes being destroyed when exclusive=true
Add CompleteWait to mitigate a race of latent operations from (Epilog|Prolog)Slurmctld from causing node failure on subsequent jobs.
Fix calling "scontrol ... state=resume" in suspend.py for all nodes multiple times for exclusive jobs.

[4.1.3]

Add preliminary spot instance support (eg. preemptible_bursting = "spot").
Regularly delete instances corresponding to Slurm nodes that should be powered down.
Upgrade to Slurm 21.08.4.
Pin Nvidia driver to 460.106.00-1.
Pin Cuda to 11.2.2.
Pin gcloud to 365.0.1-1 on centos images - workaround broken package.
Enable swap cgroup control on debian images - fixes a Slurm compute node error.
Add startup scripts as terraform vars.

[4.1.2]

setup.py - change LLN=yes to LLN=no

[4.1.1]

slurmsync.py - fix powering up nodes from being downed.

[4.1.0]

suspend.py - now handles "Quota exceeded" error
Support for Intel-select options
slurmrestd - changed user from root to user slurmrestd
resume.py - fix state=down reason being malformed
suspend.py - scontrol update now specifies new state=power_down_force
slurm.conf - update to AccountingStoreFlags=job_comment
slurmsync.py - state flags use new POWERED_DOWN state
Updated Slurm to version 21.08.2

[4.0.4]

Configure sockets, cores, threads on compute nodes for better performance with cons_tres.

[4.0.3]

Introduce NEWS file
Recommended image is now schedmd-slurm-public/hpc-centos-7-schedmd-slurm-20-11-7
Changed slurmrestd port to 6842 (from 80)
partitions\[\].image_hyperthreads=false now actively disables hyperthreads on hpc-centos-7 images, starting with the now recommended image
partitions\[\].image_hyperthreads is now true in tfvars examples
Fixed running of custom-compute-install on login node
Fixed slurmrestd install on foundry debian images
Disable SELinux (was permissive) to fix hpc-centos-7 reboot issue
Updated Slurm to 20.11.7

Files

CHANGELOG.md

Latest commit

History