Merge pull request #595 from yandthj/kestrel_arbiter

Kestrel arbiter
NREL · Mar 12, 2024 · e2b11fb · e2b11fb
2 parents 7fb6e54 + 0e4e0e2
commit e2b11fb
Show file tree

Hide file tree

Showing 4 changed files with 45 additions and 44 deletions.
diff --git a/docs/Documentation/Systems/Kestrel/index.md b/docs/Documentation/Systems/Kestrel/index.md
@@ -14,7 +14,7 @@ Kestrel is configured to run compute-intensive and parallel computing jobs. It i
 Please see the [System Configurations](../index.md) page for more information about hardware, storage, and networking.
 
 !!! note
-    GPUs are not currently available on Kestrel.  132 nodes with 4x Nvidia H100 GPUs are expected to be installed on Kestrel in FY24 Q2 (January, 2024).   
+    GPUs are not currently available on Kestrel.  132 nodes with 4x Nvidia H100 GPUs are expected to be installed on Kestrel in FY24 Q2.   
 
 ## Accessing Kestrel
 Access to Kestrel requires an NREL HPC account and permission to join an existing allocation. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations.
@@ -36,6 +36,8 @@ If you are an external HPC user, you will need a [One-Time Password Multifactor
 
 For command line access, you may login directly to **kestrel.nrel.gov**.  Alternatively, you can connect to the [SSH gateway host](https://www.nrel.gov/hpc/ssh-gateway-connection.html).
 
+!!! warning "Login Node Policies"
+    Kestrel login nodes are shared resources, and because of that are subject to process limiting based on usage to ensure that these resources aren't being [used inappropriately](https://www.nrel.gov/hpc/inappropriate-use-policy.html). Each user is permitted up to 36 cores and 100GB of RAM at a time, after which the Arbiter monitoring software will begin moderating resource consumption, restricting further processes by the user until usage is reduced to acceptable limits.
 
 ## Data Analytics and Visualization (DAV) Nodes
 

diff --git a/docs/Documentation/Systems/eagle_to_kestrel_transition.md b/docs/Documentation/Systems/eagle_to_kestrel_transition.md
@@ -2,12 +2,12 @@
 title: Transitioning from Eagle to Kestrel
 ---
 
-## Overview of steps
+## Overview of Steps
 
-This page is meant to provide all necessary information to transition a project from Eagle to Kestrel. Transitioning a project can be broken down into four steps:
+This page is meant to provide all necessary information to transition a project from Eagle to Kestrel. Transitioning a project can be broken down into five steps:
 
 1. Accessing Kestrel
-2. Moving your files from Eagle to Kestrel
+2. Transferring Data from Eagle to Kestrel
 3. Understanding the options for running your software on Kestrel
 
     a. How to check if your software is available as a module on Kestrel
@@ -19,7 +19,7 @@ This page is meant to provide all necessary information to transition a project
 4. Submitting your jobs on Kestrel
 5. Review performance recommendations if scalability or performance is worse than expected
 
-If you find yourself stuck on any of the above steps, please reach out to [email protected] as soon as possible.
+If you find yourself stuck on any of the above steps, please reach out to [[email protected]](mailto:[email protected]) as soon as possible.
 
 ## 1. Accessing Kestrel
 
@@ -37,15 +37,19 @@ ssh <your username>@kestrel.nrel.gov
 ```
 For more detailed information on accessing Kestrel, please see [this page](./Kestrel/index.md). 
 
-The filesystem structure of Kestrel is similar to Eagle. When you first log on, you will be in `/home/[your username]`. Your project directory can be found at `/projects/[allocation name]`.
+## 2. Transferring Data from Eagle to Kestrel
 
-## 2. Moving your files from Eagle to Kestrel
 
-Please see our page on [transferring files](../Managing_Data/Transferring_Files/index.md) for detailed information. Essentially, you should use the command-line `rsync` tool for small transfers (<100 GB), and Globus for large transfers. 
+Please see our page on [transferring files](../Managing_Data/Transferring_Files/index.md) for detailed information. Essentially, you should use the command-line `rsync` tool for small transfers (<100 GB), and [Globus](../Managing_Data/Transferring_Files/globus.md) for large transfers. 
 
-See our [Globus page](../Managing_Data/Transferring_Files/globus.md) for instructions on how to use Globus to transfer files between Eagle and Kestrel.
+### Filesystems
 
-Reach out to [email protected] if you run into issues while transferring files.
+Data storage polices and the filesystems layout on Kestrel is similar to Eagle. Kestrel has a **95 PB** ClusterStor Lustre file system. Unlike on Eagle, the Parallel Filesystem (PFS) consists of a ProjectFS and a ScratchFS which have different configurations. ScratchFS uses a Lustre file system in a hybrid flash-disk configuration providing a total of **27 petabytes** (PB) of capacity with **354 gigabytes (GB)/s** of IOR bandwidth. ProjectFS has **68 PB** of capacity with **200 GB/s** of IOR bandwidth. We advise running jobs out of `/scratch` and moving data to `/projects` for long term storage. Like on Eagle, `/scratch` will have a 28 day purge policy with no exceptions.
+
+The Home File System (HFS) on Kestrel is part of the ClusterStor used for PFS, providing highly reliable storage for user home directories and NREL-specific software. HFS will provide 1.2 PB of capacity. Snapshots of files on the HFS will be available up to 30 days after change/deletion. `/home` directories have a quota of 50 GB. 
+
+
+Please see the [Kestrel Filesystem page](./Kestrel/filesystems.md) for more information. 
 
 ## 3. Understanding the options for running your software on Kestrel
 
@@ -55,15 +59,15 @@ If you are used to using your software as an NREL-maintained module on Eagle, fi
 
 `module avail [your software name]`
 
-If nothing shows up, please email [email protected] to get the module set up on Kestrel.
+If nothing shows up, please email [[email protected]](mailto:[email protected]) to get the module set up on Kestrel.
 
 If the module exists, then you simply need to `module load [your software name]`, the same as you would do on Eagle.
 
 ### How to build your own software on Kestrel
 
 If you need to build your own software on Kestrel, and NOT use an already-existing module, then the steps can be a bit different than Eagle. For a general software-building procedure, please see our [Libraries How-To](../Development/Libraries/howto.md#summary-of-steps) tutorial.
 
-In general, on Kestrel we recommend using the `PrgEnv-cray` or `PrgEnv-intel` environments to build your code. For detailed descriptions on these environments, see our [environments](./Kestrel/Environments/index.md) page. For a tutorial walkthrough of building a simple code (IMB) within these environments, see our [environments tutorial](./Kestrel/Environments/tutorial.md) page. Note that `PrgEnv-` environments on Kestrel are different than environments on Eagle. Loading a `PrgEnv` loads a number of modules at once that together constitute a consistent environment. 
+In general, on Kestrel we recommend using the `PrgEnv-cray` or `PrgEnv-intel` environments to build your code. For detailed descriptions on these environments, see our [Environments](./Kestrel/Environments/index.md) page. For a tutorial walkthrough of building a simple code (IMB) within these environments, see our [Environments Tutorial](./Kestrel/Environments/tutorial.md) page. Note that `PrgEnv-` environments on Kestrel are different than environments on Eagle. Loading a `PrgEnv` loads a number of modules at once that together constitute a consistent environment. 
 
 !!! danger
 	OpenMPI currently does not work well on Kestrel, and thus it is **strongly** recommended to NOT use OpenMPI. If you require assistance in building your code with an MPI other than 		
@@ -72,36 +76,35 @@ In general, on Kestrel we recommend using the `PrgEnv-cray` or `PrgEnv-intel` en
 !!! tip
     Some MPI codes, especially old legacy scientific software, may be difficult to build with Cray MPICH. In these cases, if it is possible to build the code with Intel MPI or a different MPICH implementation, then Cray MPICH can be utilized at run-time via use of the `cray-mpich-abi` module (note that OpenMPI is *NOT* an implementation of MPICH, and you cannot use the `cray-mpich-abi` if you built with OpenMPI). A detailed example of building with Intel MPI but running with Cray MPICH can be found on our [VASP application page](../Applications/vasp.md). 
 
-## 4. Running your jobs on Kestrel
+## 4. Running your Jobs on Kestrel
 
 See our page on submitting jobs on Kestrel [here](./Kestrel/running.md).
 
-Submitting a job on Kestrel works much the same as submitting a job on Eagle. Both systems use the Slurm scheduler. If the application you wish to run can be found under our [Applications tab](../Applications/index.md), then there may be example Kestrel submit scripts on the application page. Otherwise, our [VASP documentation page](../Applications/vasp.md#vasp-on-kestrel) contains a variety of sample submit scripts that you can modify to fit your own purposes.
+Like Eagle, Kestrel uses the [Slurm job scheduler](../Slurm/index.md). If the application you need to run can be found under our [Applications tab](../Applications/index.md), then there may be example Kestrel submission scripts on the application page. Otherwise, our [VASP documentation page](../Applications/vasp.md#vasp-on-kestrel) contains a variety of sample submit scripts that you can modify to fit your own purposes.
+
+For information on the Kestrel hardware configuration, see our [Kestrel System Configuration](https://www.nrel.gov/hpc/kestrel-system-configuration.html) page. One key difference from Eagle is that not all of the Kestrel nodes have a local disk. If you need local disk space, you will need to request that in your job submission script with the `--tmp` option. For more detailed information on this, please see [this page](./Kestrel/filesystems.md#node-file-system).
 
-For information on the Kestrel hardware configuration, see our [Kestrel System Configuration](https://www.nrel.gov/hpc/kestrel-system-configuration.html) page.
 
 ### Shared Partition
 
-Note that each Kestrel standard CPU node contains 104 CPU cores (and 256 GB memory). Some applications or application use-cases may not scale well to this many CPU cores. In these cases, it is recommended to submit your jobs to the shared partition. A job submitted to the shared partition  will be charged AUs proportionate to whichever resource you require more of, between CPUs and memory.
+Note that each Kestrel standard CPU node contains 104 CPU cores and 256 GB memory. Some applications or application use-cases may not scale well to this many CPU cores. In these cases, it is recommended to submit your jobs to the shared partition. A job submitted to the shared partition  will be charged AUs proportionate to whichever resource you require more of, between CPUs and memory.
 
-The following is an example shared-partition submit script using VASP:
+The following is an example shared-partition submission script:
 
 ```
 #!/bin/bash
-#SBATCH --nodes=1
-#SBATCH --partition=shared
-#SBATCH --tasks=26 #How many cpus you want
-#SBATCH --mem-per-cpu=2G #Default is 1 GB/core but 2 GB/core is a good starting place for VASP
-#SBATCH --time=2:00:00
-#SBATCH --account=<your-account-name>
-#SBATCH --job-name=<your-job-name>
-
-module load vasp/<version>
-
-srun vasp_std |& tee out
+#SBATCH --nodes=1 
+#SBATCH --partition=shared         
+#SBATCH --time=2:00:00    
+#SBATCH --ntasks=26 # CPUs requested for job 
+#SBATCH --mem-per-cpu=2000 # Request 2GB per core.
+#SBATCH --account=<allocation handle>
+
+cd /scratch/$USER 
+srun ./my_progam # Use your application's commands here  
 ```
 
-For more information on the shared partitions and an example AU-accounting calculation, see [here](./Kestrel/running.md#shared-node-partition).
+For more information on the shared partition and an example AU-accounting calculation, see [here](./Kestrel/running.md#shared-node-partition).
 
 ## 5. Performance Recommendations
 
@@ -139,17 +142,17 @@ These environment variables turn off some collective optimizations that we have
 
 Please note that all of these recommendations are subject to change as we continue to improve the system.
 
-## 6. Kestrel release notes
+## Kestrel Release Notes
 
 Release notes for Kestrel after major upgrades can be found [here](./Kestrel/kestrel_release_notes.md).
 
-## 7. Resources
+## Resources
 
 1. [Accessing Kestrel](./Kestrel/index.md)
 2. [Transferring Files between Filesystems on the NREL Network](../Managing_Data/Transferring_Files/index.md)
 3. [Using Globus to move data from Eagle to Kestrel](../Managing_Data/Transferring_Files/globus.md)
 4. [General software building tutorial](../Development/Libraries/howto.md)
 5. [Environments Overview](./Kestrel/Environments/index.md)
-6. [Environments tutorial](./Kestrel/Environments/tutorial.md)
+6. [Environments Tutorial](./Kestrel/Environments/tutorial.md)
 
-Please reach out to [email protected] for assistance with any topic on this page.
+Please reach out to [[email protected]](mailto:[email protected]) for assistance with any topic on this page.
diff --git a/docs/Documentation/Systems/index.md b/docs/Documentation/Systems/index.md
@@ -6,7 +6,7 @@ order: 4
 ---
 
 # NREL Systems
-NREL operates three on-premises systems for computational work. 
+NREL operates four on-premises systems for computational work. 
 
 ## System configurations
 
@@ -24,5 +24,5 @@ NREL operates three on-premises systems for computational work.
 | Number of Nodes| 2454 | 2618 | 484 | 133 virtual |
 
 !!! note
-    GPUs are not currently available on Kestrel.  132 nodes with 4x Nvidia H100 GPUs are expected to be installed on Kestrel in FY24 Q2 (January, 2024). 
+    GPUs are not currently available on Kestrel.  132 nodes with 4x Nvidia H100 GPUs are expected to be installed on Kestrel in FY24 Q2. 
 
diff --git a/docs/Documentation/getting_started.md b/docs/Documentation/getting_started.md
@@ -37,7 +37,7 @@ Below we've collected answers for many of the most frequently asked questions.
 ??? note "How can I access NREL HPC systems?"
 
     Begin by [requesting an NREL HPC account](https://www.nrel.gov/hpc/user-accounts.html). 
-    Then, consult our guide on [how to connect to the NREL HPC system](https://www.nrel.gov/hpc/system-connection.html).
+    Then, consult our guide on [how to connect to  NREL HPC systems](https://www.nrel.gov/hpc/system-connection.html).
 
 ??? note "What is a one-time password (OTP) token?"
 
@@ -85,10 +85,7 @@ Below we've collected answers for many of the most frequently asked questions.
 ??? note "What is proper NREL HPC login node etiquette?"
 
     As mentioned above, login nodes are a shared resource, and are subject to process
-    limiting based on usage. Each user is permitted up to 8 cores and 100GB of RAM at
-    a time, after which the Arbiter monitoring software will begin moderating resource
-    consumption, restricting further processes by the user until usage is reduced to acceptable
-    limits. If you do computationally intensive work on these systems, it will unfairly
+    limiting based on usage. If you do computationally intensive work on these systems, it will unfairly
     occupy resources and make the system less responsive for other users. Please reserve
     your computationally intensive tasks (especially those that will fully utilize CPU
     cores) for jobs submitted to compute nodes. Offenders of login node abuse will be
@@ -100,8 +97,7 @@ Below we've collected answers for many of the most frequently asked questions.
     System time is a regularly occurring interval of time during which NREL HPC systems
     are taken offline for necessary patches, updates, software installations, and anything
     else to keep the systems useful, updated, and secure. **You will not be able to access 
-    the system or submit jobs during system times.**  System times occur the first Monday 
-    every month. A reminder announcement is sent out prior to every system time detailing 
+    the system or submit jobs during system times.** A reminder announcement is sent out prior to every system time detailing 
     what changes will take place, and includes an estimate of how long the system time will be. 
     You can check the [system status page](https://www.nrel.gov/hpc/system-status.html) if you are ever 
     unsure if an NREL HPC system is currently down for system time.
@@ -115,13 +111,13 @@ Below we've collected answers for many of the most frequently asked questions.
     emulator for Windows is known as the [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10). 
     Other recommended terminal applications include: [Git Bash](https://git-scm.com/downloads), [Git for WIndows](https://gitforwindows.org/), 
     [Cmder](https://cmder.app/), and [MYSYS2](https://www.msys2.org/). Note that PuTTY is not a terminal emulator, 
-    it is only an SSH client. The applications listed above implement an <kbd>ssh</kbd> command, 
+    it is only an SSH client. The applications listed above implement an `ssh` command, 
     which mirrors the functionality of PuTTY.
 
 ??? note "What is the secure shell (SSH) protocol?"
 
     Stated briefly, the SSH protocol establishes an encrypted channel to share various
-    kinds of network traffic. Not to be confused with the <kbd>ssh</kbd> terminal command or 
+    kinds of network traffic. Not to be confused with the `ssh` terminal command or 
     SSH clients which are applications that implement the SSH protocol in software to 
     create secure connections to remote systems.