FAQ | Troubleshooting | Glossary
This module creates a Slurm cluster on
GCP. There are two modes of operation: cloud; and
hybrid. Cloud mode will create a VM controller. Hybrid mode will generate
cloud.conf
and cloud_gres.conf
files to be included in the on-prem
configuration files, while managing a config.yaml
file for internal module
use.
Partitions define what compute resources are available to the controller so it may allocate jobs. Slurm will resume/create compute instances as needed to run allocated jobs and will suspend/terminate the instances after they are no longer needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistent; they are exempt from being suspended/terminated under normal conditions. Dynamic nodes are burstable; they will scale up and down with workload.
WARNING: Destroying the controller before it has suspended/terminated all static and dynamic node instances and supporting resources (e.g. placement groups, subscription) will leave those resources orphaned unless cleanup options are enabled (.e.g
enable_cleanup_compute
,enable_cleanup_subscriptions
).
See examples directory for sample usages.
See below for a simple inclusion within your own terraform project.
module "slurm_cluster" {
source = "[email protected]:SchedMD/slurm-gcp.git//terraform/slurm_cluster?ref=v5.0.0"
project_id = "<PROJECT_ID>"
slurm_cluster_name = "<SLURM_CLUSTER_NAME>"
# ... omitted ...
}
NOTE: Because this module is not hosted on Terraform Registry, the version must be strictly controlled via revision syntax on the source line.
Certain software must be installed on the local machine or APIs enabled in GCP for TerraformUser to be able to use this module.
- Terraform is installed.
- GCP Cloud SDK is installed.
- Compute Engine API is enabled.
- Python is installed.
- Required Version:
>= 3.6.0, < 4.0.0
- Required when any of:
enable_hybrid=true
enable_cleanup_compute=true
- Required Version:
- Pip packages are installed.
- Required when any of:
enable_hybrid=true
enable_cleanup_compute=true
pip3 install -r ../../scripts/requirements.txt --user
- Required when any of:
- Private Google Access is
enabled.
- Required when any instances only have internal IPs.
- Secret Manager API is enabled.
- Required when
cloudsql != null
.
- Required when
- Bigquery API is enabled.
- Required when
enable_bigquery_load=true
.
- Required when
TerraformUser authenticates with credentials to Google Cloud. It is recommended to create a principal IAM for this user and associate roles to them. Optionally, the TerraformUser can operate through a service account.
- Compute Instance Admin (v1) (
roles/compute.instanceAdmin.v1
)
- Secret Manager Admin (
roles/secretmanager.admin
)- Required when
cloudsql != null
.
- Required when
- Service Account User (
roles/iam.serviceAccountUser
)- Required when TerraformUser is using an service account to authenticate.
Service account intended to be associated with the controller instance template for slurm_controller_instance.
- Compute Instance Admin (v1) (
roles/compute.instanceAdmin.v1
) - Compute Instance Admin (beta) (
roles/compute.instanceAdmin
) - Service Account User (
roles/iam.serviceAccountUser
)
- BigQuery Data Editor (
roles/bigquery.dataEditor
)- Required when
enable_bigquery_load=true
.
- Required when
- Cloud SQL Editor (
roles/cloudsql.editor
)- Required when all of:
cloudsql != null
- Communicating to CloudSQL instance
- Required when all of:
- Logs Writer (
roles/logging.logWriter
)- Recommended.
- Monitoring Metric Writer (
roles/monitoring.metricWriter
)- Recommended.
Service account intended to be associated with the compute instance templates created by slurm_partition.
- Logs Writer (
roles/logging.logWriter
)- Recommended.
- Monitoring Metric Writer (
roles/monitoring.metricWriter
)- Recommended.
Service account intended to be associated with the login instance templates created by slurm_partition.
- Logs Writer (
roles/logging.logWriter
)- Recommended.
- Monitoring Metric Writer (
roles/monitoring.metricWriter
)- Recommended.
For the terraform module API reference, please see README_TF.md.