Skip to content

Commit

Permalink
Merge pull request #28 from oracle-quickstart/2.10.3
Browse files Browse the repository at this point in the history
2.10.3
  • Loading branch information
arnaudfroidmont authored Sep 16, 2023
2 parents 763d350 + 3c2978a commit f0499b7
Show file tree
Hide file tree
Showing 105 changed files with 12,846 additions and 9,565 deletions.
52 changes: 52 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ The stack allowa various combination of OS. Here is a list of what has been test
| OL7 | OL7 |
| OL7 | OL8 |
| OL7 | CentOS7 |
| OL8 | OL8 |
| OL8 | OL7 |
| Ubuntu 20.04 | Ubuntu 20.04 |

When switching to Ubuntu, make sure the username is changed from opc to Ubuntu in the ORM for both the bastion and compute nodes.
Expand Down Expand Up @@ -358,3 +360,53 @@ You can combine all the options together such as:
validate -n y -p y -g y -e y -cn <cluster name file>


## /opt/oci-hpc/scripts/collect_logs.py
This is a script to collect nvidia bug report, sosreport, console history logs.

The script needs to be run from the bastion. In the case where the host is not ssh-able, it will get only console history logs for the same.

It requires the below argument.
--hostname <HOSTNAME>

And --compartment-id <COMPARTMENT_ID> is optional (i.e. assumption is the host is in the same compartment as the bastion).

Where HOSTNAME is the node name for which you need the above logs and COMPARTMENT_ID is the OCID of the compartment where the node is.

The script will get all the above logs and put them in a folder specific to each node in /home/{user}. It will give the folder name as the output.

Assumption: For getting the console history logs, the script expects to have the node name in /etc/hosts file.

Examples:

python3 collect_logs.py --hostname compute-permanent-node-467
The nvidia bug report, sosreport, and console history logs for compute-permanent-node-467 are at /home/ubuntu/compute-permanent-node-467_06132023191024

python3 collect_logs.py --hostname inst-jxwf6-keen-drake
The nvidia bug report, sosreport, and console history logs for inst-jxwf6-keen-drake are at /home/ubuntu/inst-jxwf6-keen-drake_11112022001138

for x in `less /home/opc/hostlist` ; do echo $x ; python3 collect_logs.py --hostname $x; done ;
compute-permanent-node-467
The nvidia bug report, sosreport, and console history logs for compute-permanent-node-467 are at /home/ubuntu/compute-permanent-node-467_11112022011318
compute-permanent-node-787
The nvidia bug report, sosreport, and console history logs for compute-permanent-node-787 are at /home/ubuntu/compute-permanent-node-787_11112022011835

Where hostlist had the below contents
compute-permanent-node-467
compute-permanent-node-787


## Collect RDMA NIC Metrics and Upload to Object Storage

OCI-HPC is deployed in customer tenancy. So, OCI service teams cannot access metrics from these OCI-HPC stack clusters. Due to overcome this issue, in release,
we introduce a feature to collect RDMA NIC Metrics and upload those metrics to Object Storage. Later on, that Object Storage URL could be shared with OCI service
teams. After that URL, OCI service teams could access metrics and use those metrics for debugging purpose.

To collect RDMA NIC Metrics and upload those to Object Storage, user needs to follow these following steps:

Step 1: Create a PAR (PreAuthenticated Request)
For creating a PAR, user needs to select check-box "Create Object Storage PAR" during Resource Manager's stack creation.
By default, this check box is enabled. By selecting, this check-box, a PAR would be created.

Step 2: Use shell script: upload_rdma_nic_metrics.sh to collect metrics and upload to object storage.
User needs to use shell script: upload_rdma_nic_metrics.sh to collect metrics and upload to object storage. User could configure metrics
collection limit and interval through config file: rdma_metrics_collection_config.conf.
2 changes: 1 addition & 1 deletion autoscaling/tf_init/bastion_update.tf
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ resource "local_file" "hosts" {
}

resource "local_file" "inventory" {
depends_on = [oci_core_cluster_network.cluster_network]
depends_on = [oci_core_cluster_network.cluster_network, oci_core_cluster_network.cluster_network]
content = templatefile("${local.bastion_path}/inventory.tpl", {
bastion_name = var.bastion_name,
bastion_ip = var.bastion_ip,
Expand Down
2 changes: 1 addition & 1 deletion autoscaling/tf_init/cluster-network-configuration.tf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
resource "oci_core_instance_configuration" "cluster-network-instance_configuration" {
count = var.cluster_network ? 1 : 0
count = ( ! var.compute_cluster ) && var.cluster_network ? 1 : 0
depends_on = [oci_core_app_catalog_subscription.mp_image_subscription]
compartment_id = var.targetCompartment
display_name = local.cluster_name
Expand Down
6 changes: 3 additions & 3 deletions autoscaling/tf_init/cluster-network.tf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
resource "oci_core_volume" "nfs-cluster-network-volume" {
count = var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
count = ( ! var.compute_cluster ) && var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
availability_domain = var.ad
compartment_id = var.targetCompartment
display_name = "${local.cluster_name}-nfs-volume"
Expand All @@ -9,7 +9,7 @@ resource "oci_core_volume" "nfs-cluster-network-volume" {
}

resource "oci_core_volume_attachment" "cluster_network_volume_attachment" {
count = var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
count = ( ! var.compute_cluster ) && var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
attachment_type = "iscsi"
volume_id = oci_core_volume.nfs-cluster-network-volume[0].id
instance_id = local.cluster_instances_ids[0]
Expand All @@ -18,7 +18,7 @@ resource "oci_core_volume_attachment" "cluster_network_volume_attachment" {
}

resource "oci_core_cluster_network" "cluster_network" {
count = var.cluster_network && var.node_count > 0 ? 1 : 0
count = ( ! var.compute_cluster ) && var.cluster_network && var.node_count > 0 ? 1 : 0
depends_on = [oci_core_app_catalog_subscription.mp_image_subscription, oci_core_subnet.private-subnet, oci_core_subnet.public-subnet]
compartment_id = var.targetCompartment
instance_pools {
Expand Down
13 changes: 13 additions & 0 deletions autoscaling/tf_init/compute-cluster.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
resource "oci_core_compute_cluster" "compute_cluster" {
count = var.compute_cluster && var.cluster_network && var.node_count > 0 ? 1 : 0
#Required
availability_domain = var.ad
compartment_id = var.targetCompartment

#Optional
display_name = local.cluster_name
freeform_tags = {
"cluster_name" = local.cluster_name
"parent_cluster" = local.cluster_name
}
}
53 changes: 53 additions & 0 deletions autoscaling/tf_init/compute-nodes.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
resource "oci_core_volume" "nfs-compute-cluster-volume" {
count = var.compute_cluster && var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
availability_domain = var.ad
compartment_id = var.targetCompartment
display_name = "${local.cluster_name}-nfs-volume"

size_in_gbs = var.cluster_block_volume_size
vpus_per_gb = split(".", var.cluster_block_volume_performance)[0]
}

resource "oci_core_volume_attachment" "compute_cluster_volume_attachment" {
count = var.compute_cluster && var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
attachment_type = "iscsi"
volume_id = oci_core_volume.nfs-compute-cluster-volume[0].id
instance_id = oci_core_instance.compute_cluster_instances[0].id
display_name = "${local.cluster_name}-compute-cluster-volume-attachment"
device = "/dev/oracleoci/oraclevdb"
}

resource "oci_core_instance" "compute_cluster_instances" {
count = var.compute_cluster ? var.node_count : 0
depends_on = [oci_core_compute_cluster.compute_cluster]
availability_domain = var.ad
compartment_id = var.targetCompartment
shape = var.cluster_network_shape

agent_config {
is_management_disabled = true
}

display_name = "${local.cluster_name}-node-${var.compute_cluster_start_index+count.index}"

freeform_tags = {
"cluster_name" = local.cluster_name
"parent_cluster" = local.cluster_name
"user" = var.tags
}

metadata = {
ssh_authorized_keys = file("/home/${var.bastion_username}/.ssh/id_rsa.pub")
user_data = base64encode(data.template_file.config.rendered)
}
source_details {
source_id = local.cluster_network_image
source_type = "image"
boot_volume_size_in_gbs = var.boot_volume_size
}
compute_cluster_id=length(var.compute_cluster_id) > 2 ? var.compute_cluster_id : oci_core_compute_cluster.compute_cluster[0].id
create_vnic_details {
subnet_id = local.subnet_id
assign_public_ip = false
}
}
4 changes: 2 additions & 2 deletions autoscaling/tf_init/data.tf
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ data "oci_core_services" "services" {
}

data "oci_core_cluster_network_instances" "cluster_network_instances" {
count = var.cluster_network && var.node_count > 0 ? 1 : 0
count = (! var.compute_cluster) && var.cluster_network && var.node_count > 0 ? 1 : 0
cluster_network_id = oci_core_cluster_network.cluster_network[0].id
compartment_id = var.targetCompartment
}
Expand All @@ -22,7 +22,7 @@ data "oci_core_instance_pool_instances" "instance_pool_instances" {
}

data "oci_core_instance" "cluster_network_instances" {
count = var.cluster_network && var.node_count > 0 ? var.node_count : 0
count = (! var.compute_cluster) && var.cluster_network && var.node_count > 0 ? var.node_count : 0
instance_id = data.oci_core_cluster_network_instances.cluster_network_instances[0].instances[count.index]["id"]
}

Expand Down
2 changes: 1 addition & 1 deletion autoscaling/tf_init/inventory.tpl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bastion]
${bastion_name} ansible_host=${bastion_ip} ansible_user=${bastion_username} role=bastion
${bastion_name} ansible_host=${bastion_ip} ansible_user=${bastion_username} role=bastion ansible_python_interpreter=/usr/bin/python
[slurm_backup]
%{ if backup_name != "" }${backup_name} ansible_host=${backup_ip} ansible_user=${bastion_username} role=bastion%{ endif }
[login]
Expand Down
6 changes: 3 additions & 3 deletions autoscaling/tf_init/locals.tf
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
locals {
// display names of instances
cluster_instances_ids = var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.id : data.oci_core_instance.instance_pool_instances.*.id
cluster_instances_names = var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.display_name : data.oci_core_instance.instance_pool_instances.*.display_name
cluster_instances_ids = var.compute_cluster ? oci_core_instance.compute_cluster_instances.*.id : var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.id : data.oci_core_instance.instance_pool_instances.*.id
cluster_instances_names = var.compute_cluster ? oci_core_instance.compute_cluster_instances.*.display_name :var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.display_name : data.oci_core_instance.instance_pool_instances.*.display_name
image_ocid = var.unsupported ? var.image_ocid : var.image

shape = var.cluster_network ? var.cluster_network_shape : var.instance_pool_shape
instance_pool_ocpus = local.shape == "VM.DenseIO.E4.Flex" ? var.instance_pool_ocpus_denseIO_flex : var.instance_pool_ocpus
// ips of the instances
cluster_instances_ips = var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.private_ip : data.oci_core_instance.instance_pool_instances.*.private_ip
cluster_instances_ips = var.compute_cluster ? oci_core_instance.compute_cluster_instances.*.private_ip : var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.private_ip : data.oci_core_instance.instance_pool_instances.*.private_ip

// subnet id derived either from created subnet or existing if specified
subnet_id = var.private_deployment ? var.use_existing_vcn ? var.private_subnet_id : element(concat(oci_core_subnet.private-subnet.*.id, [""]), 1) : var.use_existing_vcn ? var.private_subnet_id : element(concat(oci_core_subnet.private-subnet.*.id, [""]), 0)
Expand Down
2 changes: 1 addition & 1 deletion autoscaling/tf_init/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ output "ocids" {
value = join(",", local.cluster_instances_ids)
}
output "cluster_ocid" {
value = var.cluster_network ? oci_core_cluster_network.cluster_network[0].id : oci_core_instance_pool.instance_pool[0].id
value = var.compute_cluster ? oci_core_compute_cluster.compute_cluster[0].id : var.cluster_network ? oci_core_cluster_network.cluster_network[0].id : oci_core_instance_pool.instance_pool[0].id
}
91 changes: 90 additions & 1 deletion bastion.tf
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,25 @@ resource "oci_core_volume_attachment" "bastion_volume_attachment" {
device = "/dev/oracleoci/oraclevdb"
}

resource "oci_core_volume_backup_policy" "bastion_boot_volume_backup_policy" {
count = var.bastion_boot_volume_backup ? 1 : 0
compartment_id = var.targetCompartment
display_name = "${local.cluster_name}-bastion_boot_volume_daily"
schedules {
backup_type = var.bastion_boot_volume_backup_type
period = var.bastion_boot_volume_backup_period
retention_seconds = var.bastion_boot_volume_backup_retention_seconds
time_zone = var.bastion_boot_volume_backup_time_zone
}
}

resource "oci_core_volume_backup_policy_assignment" "boot_volume_backup_policy" {
count = var.bastion_boot_volume_backup ? 1 : 0
depends_on = [oci_core_volume_backup_policy.bastion_boot_volume_backup_policy]
asset_id = oci_core_instance.bastion.boot_volume_id
policy_id = oci_core_volume_backup_policy.bastion_boot_volume_backup_policy[0].id
}

resource "oci_resourcemanager_private_endpoint" "rms_private_endpoint" {
count = var.private_deployment ? 1 : 0
compartment_id = var.targetCompartment
Expand All @@ -26,6 +45,13 @@ resource "oci_resourcemanager_private_endpoint" "rms_private_endpoint" {
subnet_id = local.subnet_id
}

resource "null_resource" "boot_volume_backup_policy" {
depends_on = [oci_core_instance.bastion, oci_core_volume_backup_policy.bastion_boot_volume_backup_policy, oci_core_volume_backup_policy_assignment.boot_volume_backup_policy]
triggers = {
bastion = oci_core_instance.bastion.id
}
}

resource "oci_core_instance" "bastion" {
depends_on = [local.bastion_subnet]
availability_domain = var.bastion_ad
Expand Down Expand Up @@ -150,6 +176,16 @@ resource "null_resource" "bastion" {
private_key = tls_private_key.ssh.private_key_pem
}
}
provisioner "file" {
source = "scripts"
destination = "/opt/oci-hpc/"
connection {
host = local.host
type = "ssh"
user = var.bastion_username
private_key = tls_private_key.ssh.private_key_pem
}
}
provisioner "file" {
content = templatefile("${path.module}/configure.tpl", {
configure = var.configure
Expand All @@ -175,7 +211,7 @@ resource "null_resource" "bastion" {
}
}
resource "null_resource" "cluster" {
depends_on = [null_resource.bastion, null_resource.backup, oci_core_cluster_network.cluster_network, oci_core_instance.bastion, oci_core_volume_attachment.bastion_volume_attachment ]
depends_on = [null_resource.bastion, null_resource.backup, oci_core_compute_cluster.compute_cluster, oci_core_cluster_network.cluster_network, oci_core_instance.bastion, oci_core_volume_attachment.bastion_volume_attachment ]
triggers = {
cluster_instances = join(", ", local.cluster_instances_names)
}
Expand Down Expand Up @@ -288,6 +324,7 @@ resource "null_resource" "cluster" {
provisioner "file" {
content = templatefile("${path.module}/queues.conf", {
cluster_network = var.cluster_network,
compute_cluster = var.compute_cluster,
marketplace_listing = var.use_old_marketplace_image ? var.old_marketplace_listing : var.marketplace_listing,
image = local.image_ocid,
use_marketplace_image = var.use_marketplace_image,
Expand Down Expand Up @@ -444,3 +481,55 @@ provisioner "file" {
}
}
}

data "oci_objectstorage_namespace" "compartment_namespace" {
compartment_id = var.targetCompartment
}

locals {
rdma_nic_metric_bucket_name = "RDMA_NIC_metrics"
par_path = ".."
}
/*
saving the PAR into file: ../PAR_file_for_metrics.
this PAR is used by the scripts to upload NIC metrics to object storage (i.e. script: upload_rdma_nic_metrics.sh)
*/

data "oci_objectstorage_bucket" "RDMA_NIC_Metrics_bucket_check" {
name = local.rdma_nic_metric_bucket_name
namespace = data.oci_objectstorage_namespace.compartment_namespace.namespace
}


resource "oci_objectstorage_bucket" "RDMA_NIC_metrics_bucket" {
count = (var.bastion_object_storage_par && data.oci_objectstorage_bucket.RDMA_NIC_Metrics_bucket_check.bucket_id == null) ? 1 : 0
compartment_id = var.targetCompartment
name = local.rdma_nic_metric_bucket_name
namespace = data.oci_objectstorage_namespace.compartment_namespace.namespace
versioning = "Enabled"
}

resource "oci_objectstorage_preauthrequest" "RDMA_NIC_metrics_par" {
count = (var.bastion_object_storage_par && data.oci_objectstorage_bucket.RDMA_NIC_Metrics_bucket_check.bucket_id == null) ? 1 : 0
depends_on = [oci_objectstorage_bucket.RDMA_NIC_metrics_bucket]
access_type = "AnyObjectWrite"
bucket = local.rdma_nic_metric_bucket_name
name = format("%s-%s", "RDMA_NIC_metrics_bucket", var.tenancy_ocid)
namespace = data.oci_objectstorage_namespace.compartment_namespace.namespace
time_expires = "2030-08-01T00:00:00+00:00"
}


output "RDMA_NIC_metrics_url" {
depends_on = [oci_objectstorage_preauthrequest.RDMA_NIC_metrics_par]
value = (var.bastion_object_storage_par && data.oci_objectstorage_bucket.RDMA_NIC_Metrics_bucket_check.bucket_id == null) ? "https://objectstorage.${var.region}.oraclecloud.com${oci_objectstorage_preauthrequest.RDMA_NIC_metrics_par[0].access_uri}" : ""
}


resource "local_file" "PAR" {
count = (var.bastion_object_storage_par && data.oci_objectstorage_bucket.RDMA_NIC_Metrics_bucket_check.bucket_id == null) ? 1 : 0
depends_on = [oci_objectstorage_preauthrequest.RDMA_NIC_metrics_par]
content = "https://objectstorage.${var.region}.oraclecloud.com${oci_objectstorage_preauthrequest.RDMA_NIC_metrics_par[0].access_uri}"
filename = "${local.par_path}/PAR_file_for_metrics"
}

Loading

0 comments on commit f0499b7

Please sign in to comment.