diff --git a/docs/hardware.md b/docs/hardware.md index 8a353d9..4b639bd 100644 --- a/docs/hardware.md +++ b/docs/hardware.md @@ -27,6 +27,7 @@ Each compute server consists of two [AMD EPYC™](https://www.amd.com/en/pro | [AMD MI100](https://www.amd.com/en/products/accelerators/instinct/mi100.html) | 11.5 TFLOPs | 32GB | 1.2 TB/s | 2 X EPYC 7V13 64-core | 512 GB | | [AMD MI210](https://www.amd.com/en/products/accelerators/instinct/mi200/mi210.html) | 45.3 TFLOPs | 64GB | 1.6 TB/s | 2 X EPYC 7V13 64-core | 512 GB | | [AMD MI250](https://www.amd.com/en/products/accelerators/instinct/mi200/mi250.html) | 45.3 TFLOPs (per GCD) | 64GB (per GCD) | 1.6 TB/s (per GCD) | 2 X EPYC 7763 64-Core | 1.5 TB | +| [AMD MI300X](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html) | 81.7 TFLOPs | 192GB | 5.3 TB/s (per GCD) | 2 X EPYC 9684X 96-Core | 2.3 TB | ``` Note that one AMD MI250 accelerator provides two Graphics Compute Dies (GCDs) for which the programmer can use as two separate GPUs. diff --git a/docs/jobs.md b/docs/jobs.md index bd94c31..47e94df 100644 --- a/docs/jobs.md +++ b/docs/jobs.md @@ -13,10 +13,29 @@ Multiple partitions (or queues) are available for users to choose from and each | `mi1008x` | 24 hours | 5 | 0.8X | 8 x MI100 accelerators per node. | | `mi2104x` | 24 hours | 16 | 1.0X | 4 x MI210 accelerators per node. | | `mi2508x` | 12 hours | 10 | 1.7X | 4 x MI250 accelerators (8 GPUs) per node. | +| `mi3008x` | 4 hours | 1 | 2.0X | 8 x MI300X accelerators per node. | +| `mi3008x_long` | 8 hours | 1 | 2.0X | 8 x MI300X accelerators per node. | ``` Note that special requests that extend beyond the above queue limits may potentially be accommodated on a case-by-case basis. You must have an active accounting allocation in order to submit jobs and the resource manager will track the combined number of **node** hours consumed by each job and deduct the [total node hours]*[charge multiplier] from your available balance. + +## Offload Architecture Options + +Since multiple generations of Instinct™ accelerators are available across the cluster, users building their own [HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/) applications should include the correct target offload architecture during compilation based on the desired GPU type. The following table highlights the offload architecture types and compilation option that maps to available SLURM partitions. + +```{table} Table 2: Offload architecture settings for local HIP compilation +:widths: 25 25 50 +Partition Name | GPU Type | ROCm Offload Architecture Compile Flag +---------------|-----------|----------------------- +devel | MI210 x 4 | `--offload-arch=gfx90a` +mi2104x | MI210 x 4 | `--offload-arch=gfx90a` +mi2508x | MI250 x 8 | `--offload-arch=gfx90a` +mi3008x | MI300 x 8 | `--offload-arch=gfx942` +mi3008x_long | MI300 x 8 | `--offload-arch=gfx942` +mi1008x | MI100 x 8 | `--offload-arch=gfx908` +``` + ## Batch job submission Example SLURM batch job submission scripts are available on the login node at `/opt/ohpc/pub/examples/slurm`. A basic starting job for MPI-based applications is available in this directory named `job.mpi` and is shown below for reference: @@ -162,6 +181,7 @@ The table below highlights several of the more common user-facing SLURM commands | scontrol | view or modify a job configuration | ``` +(jupyter)= ## Jupyter Users can run Jupyter Notebooks on the HPC Fund compute nodes by making a copy diff --git a/docs/software.md b/docs/software.md index 9f8f9cb..62a86a5 100644 --- a/docs/software.md +++ b/docs/software.md @@ -30,6 +30,7 @@ The Lmod system provides a flexible mechanism to manage your local software envi The `module help` command can also be run locally on the system to get more information on available Lmod options and sub-commands. ``` +(python-environment)= ## Python Environment A base Python installation is available on the HPC Fund cluster which includes a handful of common packages (e.g., `numpy`, `pandas`). If additional packages are needed, users can customize their environments by installing packages with a user install, creating a Python virtual environment to install packages in, or loading a module for a specific package (e.g., `pytorch`, `tensorflow`). Examples of each method are given below.