This module creates a compute partition that can be used as input to the schedmd-slurm-gcp-v5-controller.
The partition module is designed to work alongside the
schedmd-slurm-gcp-v5-node-group
module. A partition can be made up of one or
more node groups, provided either through use
(preferred) or defined manually
in the node_groups
variable.
Warning: updating a partition and running
terraform apply
will not cause the slurm controller to update its own configurations (slurm.conf
) unlessenable_reconfigure
is set to true in the partition and controller modules.
The following code snippet creates a partition module with:
- 2 node groups added via
use
.- The first node group is made up of machines of type
c2-standard-30
. - The second node group is made up of machines of type
c2-standard-60
. - Both node groups have a maximum count of 200 dynamically created nodes.
- The first node group is made up of machines of type
- partition name of "compute".
- connected to the
network1
module viause
. - nodes mounted to homefs via
use
.
- id: node_group_1
source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
settings:
name: c30
node_count_dynamic_max: 200
machine_type: c2-standard-30
- id: node_group_2
source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
settings:
name: c60
node_count_dynamic_max: 200
machine_type: c2-standard-60
- id: compute_partition
source: community/modules/compute/schedmd-slurm-gcp-v5-partition
use:
- network1
- homefs
- node_group_1
- node_group_2
settings:
partition_name: compute
For a complete example using this module, see slurm-gcp-v5-cluster.yaml.
The Slurm on GCP partition module allows you to specify additional zones in which to create VMs through bulk creation. This is valuable when configuring partitions with popular VM families and you desire access to more compute resources across zones.
WARNING: Lenient zone policies can lead to additional egress costs when moving large amounts of data between zones in the same region. For example, traffic between VMs and traffic from VMs to shared filesystems such as Filestore. For more information on egress fees, see the Network Pricing Google Cloud documentation.
To avoid egress charges, ensure your compute nodes are created in a single zone by setting var.zone and leaving var.zones to its default value of the empty list.
NOTE: If a new zone is added to the region while the cluster is active, nodes in the partition may be created in that zone. In this case, the partition may need to be redeployed (possible via
enable_reconfigure
if set) to ensure the newly added zone is denied.
In the zonal example below, the partition's zone implicitly defaults to the
deployment variable vars.zone
:
vars:
zone: us-central1-f
- id: zonal-partition
source: community/modules/compute/schedmd-slurm-gcp-v5-partition
In the example below, we enable creation in additional zones:
vars:
zone: us-central1-f
- id: multi-zonal-partition
source: community/modules/compute/schedmd-slurm-gcp-v5-partition
settings:
zones:
- us-central1-a
- us-central1-b
The HPC Toolkit team maintains the wrapper around the slurm-on-gcp terraform modules. For support with the underlying modules, see the instructions in the slurm-gcp README.
Copyright 2022 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Name | Version |
---|---|
terraform | >= 0.13.0 |
>= 3.83 |
Name | Version |
---|---|
>= 3.83 |
Name | Source | Version |
---|---|---|
slurm_partition | github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_partition | 5.10.6 |
Name | Type |
---|---|
google_compute_zones.available | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
deployment_name | Name of the deployment. | string |
n/a | yes |
enable_placement | Enable placement groups. | bool |
true |
no |
enable_reconfigure | Enables automatic Slurm reconfigure on when Slurm configuration changes (e.g. slurm.conf.tpl, partition details). Compute instances and resource policies (e.g. placement groups) will be destroyed to align with new configuration. NOTE: Requires Python and Google Pub/Sub API. WARNING: Toggling this will impact the running workload. Deployed compute nodes will be destroyed and their jobs will be requeued. |
bool |
false |
no |
exclusive | Exclusive job access to nodes. | bool |
true |
no |
is_default | Sets this partition as the default partition by updating the partition_conf. If "Default" is already set in partition_conf, this variable will have no effect. |
bool |
false |
no |
network_storage | An array of network attached storage mounts to be configured on the partition compute nodes. | list(object({ |
[] |
no |
node_groups | A list of node groups associated with this partition. See schedmd-slurm-gcp-v5-node-group for more information on defining a node group in a blueprint. |
list(object({ |
[] |
no |
partition_conf | Slurm partition configuration as a map. See https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION |
map(string) |
{} |
no |
partition_name | The name of the slurm partition. | string |
n/a | yes |
partition_startup_scripts_timeout | The timeout (seconds) applied to the partition startup script. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled. |
number |
300 |
no |
project_id | Project in which the HPC deployment will be created. | string |
n/a | yes |
region | The default region for Cloud resources. | string |
n/a | yes |
slurm_cluster_name | Cluster name, used for resource naming and slurm accounting. If not provided it will default to the first 8 characters of the deployment name (removing any invalid characters). | string |
null |
no |
startup_script | Startup script that will be used by the partition VMs. | string |
"" |
no |
subnetwork_project | The project the subnetwork belongs to. | string |
"" |
no |
subnetwork_self_link | Subnet to deploy to. | string |
null |
no |
zone | Zone in which to create compute VMs. Additional zones in the same region can be specified in var.zones. | string |
n/a | yes |
zone_target_shape | Strategy for distributing VMs across zones in a region. ANY GCE picks zones for creating VM instances to fulfill the requested number of VMs within present resource constraints and to maximize utilization of unused zonal reservations. ANY_SINGLE_ZONE (default) GCE always selects a single zone for all the VMs, optimizing for resource quotas, available reservations and general capacity. BALANCED GCE prioritizes acquisition of resources, scheduling VMs in zones where resources are available while distributing VMs as evenly as possible across allowed zones to minimize the impact of zonal failure. |
string |
"ANY_SINGLE_ZONE" |
no |
zones | Additional nodes in which to allow creation of partition nodes. Google Cloud will find zone based on availability, quota and reservations. |
set(string) |
[] |
no |
Name | Description |
---|---|
partition | Details of a slurm partition |