Skip to content

Commit

Permalink
Merge pull request #595 from yandthj/kestrel_arbiter
Browse files Browse the repository at this point in the history
Kestrel arbiter
  • Loading branch information
yandthj authored Mar 12, 2024
2 parents 7fb6e54 + 0e4e0e2 commit e2b11fb
Show file tree
Hide file tree
Showing 4 changed files with 45 additions and 44 deletions.
4 changes: 3 additions & 1 deletion docs/Documentation/Systems/Kestrel/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Kestrel is configured to run compute-intensive and parallel computing jobs. It i
Please see the [System Configurations](../index.md) page for more information about hardware, storage, and networking.

!!! note
GPUs are not currently available on Kestrel. 132 nodes with 4x Nvidia H100 GPUs are expected to be installed on Kestrel in FY24 Q2 (January, 2024).
GPUs are not currently available on Kestrel. 132 nodes with 4x Nvidia H100 GPUs are expected to be installed on Kestrel in FY24 Q2.

## Accessing Kestrel
Access to Kestrel requires an NREL HPC account and permission to join an existing allocation. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations.
Expand All @@ -36,6 +36,8 @@ If you are an external HPC user, you will need a [One-Time Password Multifactor

For command line access, you may login directly to **kestrel.nrel.gov**. Alternatively, you can connect to the [SSH gateway host](https://www.nrel.gov/hpc/ssh-gateway-connection.html).

!!! warning "Login Node Policies"
Kestrel login nodes are shared resources, and because of that are subject to process limiting based on usage to ensure that these resources aren't being [used inappropriately](https://www.nrel.gov/hpc/inappropriate-use-policy.html). Each user is permitted up to 36 cores and 100GB of RAM at a time, after which the Arbiter monitoring software will begin moderating resource consumption, restricting further processes by the user until usage is reduced to acceptable limits.

## Data Analytics and Visualization (DAV) Nodes

Expand Down
67 changes: 35 additions & 32 deletions docs/Documentation/Systems/eagle_to_kestrel_transition.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
title: Transitioning from Eagle to Kestrel
---

## Overview of steps
## Overview of Steps

This page is meant to provide all necessary information to transition a project from Eagle to Kestrel. Transitioning a project can be broken down into four steps:
This page is meant to provide all necessary information to transition a project from Eagle to Kestrel. Transitioning a project can be broken down into five steps:

1. Accessing Kestrel
2. Moving your files from Eagle to Kestrel
2. Transferring Data from Eagle to Kestrel
3. Understanding the options for running your software on Kestrel

a. How to check if your software is available as a module on Kestrel
Expand All @@ -19,7 +19,7 @@ This page is meant to provide all necessary information to transition a project
4. Submitting your jobs on Kestrel
5. Review performance recommendations if scalability or performance is worse than expected

If you find yourself stuck on any of the above steps, please reach out to [email protected] as soon as possible.
If you find yourself stuck on any of the above steps, please reach out to [[email protected]](mailto:[email protected]) as soon as possible.

## 1. Accessing Kestrel

Expand All @@ -37,15 +37,19 @@ ssh <your username>@kestrel.nrel.gov
```
For more detailed information on accessing Kestrel, please see [this page](./Kestrel/index.md).

The filesystem structure of Kestrel is similar to Eagle. When you first log on, you will be in `/home/[your username]`. Your project directory can be found at `/projects/[allocation name]`.
## 2. Transferring Data from Eagle to Kestrel

## 2. Moving your files from Eagle to Kestrel

Please see our page on [transferring files](../Managing_Data/Transferring_Files/index.md) for detailed information. Essentially, you should use the command-line `rsync` tool for small transfers (<100 GB), and Globus for large transfers.
Please see our page on [transferring files](../Managing_Data/Transferring_Files/index.md) for detailed information. Essentially, you should use the command-line `rsync` tool for small transfers (<100 GB), and [Globus](../Managing_Data/Transferring_Files/globus.md) for large transfers.

See our [Globus page](../Managing_Data/Transferring_Files/globus.md) for instructions on how to use Globus to transfer files between Eagle and Kestrel.
### Filesystems

Reach out to [email protected] if you run into issues while transferring files.
Data storage polices and the filesystems layout on Kestrel is similar to Eagle. Kestrel has a **95 PB** ClusterStor Lustre file system. Unlike on Eagle, the Parallel Filesystem (PFS) consists of a ProjectFS and a ScratchFS which have different configurations. ScratchFS uses a Lustre file system in a hybrid flash-disk configuration providing a total of **27 petabytes** (PB) of capacity with **354 gigabytes (GB)/s** of IOR bandwidth. ProjectFS has **68 PB** of capacity with **200 GB/s** of IOR bandwidth. We advise running jobs out of `/scratch` and moving data to `/projects` for long term storage. Like on Eagle, `/scratch` will have a 28 day purge policy with no exceptions.

The Home File System (HFS) on Kestrel is part of the ClusterStor used for PFS, providing highly reliable storage for user home directories and NREL-specific software. HFS will provide 1.2 PB of capacity. Snapshots of files on the HFS will be available up to 30 days after change/deletion. `/home` directories have a quota of 50 GB.


Please see the [Kestrel Filesystem page](./Kestrel/filesystems.md) for more information.

## 3. Understanding the options for running your software on Kestrel

Expand All @@ -55,15 +59,15 @@ If you are used to using your software as an NREL-maintained module on Eagle, fi

`module avail [your software name]`

If nothing shows up, please email [email protected] to get the module set up on Kestrel.
If nothing shows up, please email [[email protected]](mailto:[email protected]) to get the module set up on Kestrel.

If the module exists, then you simply need to `module load [your software name]`, the same as you would do on Eagle.

### How to build your own software on Kestrel

If you need to build your own software on Kestrel, and NOT use an already-existing module, then the steps can be a bit different than Eagle. For a general software-building procedure, please see our [Libraries How-To](../Development/Libraries/howto.md#summary-of-steps) tutorial.

In general, on Kestrel we recommend using the `PrgEnv-cray` or `PrgEnv-intel` environments to build your code. For detailed descriptions on these environments, see our [environments](./Kestrel/Environments/index.md) page. For a tutorial walkthrough of building a simple code (IMB) within these environments, see our [environments tutorial](./Kestrel/Environments/tutorial.md) page. Note that `PrgEnv-` environments on Kestrel are different than environments on Eagle. Loading a `PrgEnv` loads a number of modules at once that together constitute a consistent environment.
In general, on Kestrel we recommend using the `PrgEnv-cray` or `PrgEnv-intel` environments to build your code. For detailed descriptions on these environments, see our [Environments](./Kestrel/Environments/index.md) page. For a tutorial walkthrough of building a simple code (IMB) within these environments, see our [Environments Tutorial](./Kestrel/Environments/tutorial.md) page. Note that `PrgEnv-` environments on Kestrel are different than environments on Eagle. Loading a `PrgEnv` loads a number of modules at once that together constitute a consistent environment.

!!! danger
OpenMPI currently does not work well on Kestrel, and thus it is **strongly** recommended to NOT use OpenMPI. If you require assistance in building your code with an MPI other than
Expand All @@ -72,36 +76,35 @@ In general, on Kestrel we recommend using the `PrgEnv-cray` or `PrgEnv-intel` en
!!! tip
Some MPI codes, especially old legacy scientific software, may be difficult to build with Cray MPICH. In these cases, if it is possible to build the code with Intel MPI or a different MPICH implementation, then Cray MPICH can be utilized at run-time via use of the `cray-mpich-abi` module (note that OpenMPI is *NOT* an implementation of MPICH, and you cannot use the `cray-mpich-abi` if you built with OpenMPI). A detailed example of building with Intel MPI but running with Cray MPICH can be found on our [VASP application page](../Applications/vasp.md).

## 4. Running your jobs on Kestrel
## 4. Running your Jobs on Kestrel

See our page on submitting jobs on Kestrel [here](./Kestrel/running.md).

Submitting a job on Kestrel works much the same as submitting a job on Eagle. Both systems use the Slurm scheduler. If the application you wish to run can be found under our [Applications tab](../Applications/index.md), then there may be example Kestrel submit scripts on the application page. Otherwise, our [VASP documentation page](../Applications/vasp.md#vasp-on-kestrel) contains a variety of sample submit scripts that you can modify to fit your own purposes.
Like Eagle, Kestrel uses the [Slurm job scheduler](../Slurm/index.md). If the application you need to run can be found under our [Applications tab](../Applications/index.md), then there may be example Kestrel submission scripts on the application page. Otherwise, our [VASP documentation page](../Applications/vasp.md#vasp-on-kestrel) contains a variety of sample submit scripts that you can modify to fit your own purposes.

For information on the Kestrel hardware configuration, see our [Kestrel System Configuration](https://www.nrel.gov/hpc/kestrel-system-configuration.html) page. One key difference from Eagle is that not all of the Kestrel nodes have a local disk. If you need local disk space, you will need to request that in your job submission script with the `--tmp` option. For more detailed information on this, please see [this page](./Kestrel/filesystems.md#node-file-system).

For information on the Kestrel hardware configuration, see our [Kestrel System Configuration](https://www.nrel.gov/hpc/kestrel-system-configuration.html) page.

### Shared Partition

Note that each Kestrel standard CPU node contains 104 CPU cores (and 256 GB memory). Some applications or application use-cases may not scale well to this many CPU cores. In these cases, it is recommended to submit your jobs to the shared partition. A job submitted to the shared partition will be charged AUs proportionate to whichever resource you require more of, between CPUs and memory.
Note that each Kestrel standard CPU node contains 104 CPU cores and 256 GB memory. Some applications or application use-cases may not scale well to this many CPU cores. In these cases, it is recommended to submit your jobs to the shared partition. A job submitted to the shared partition will be charged AUs proportionate to whichever resource you require more of, between CPUs and memory.

The following is an example shared-partition submit script using VASP:
The following is an example shared-partition submission script:

```
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --partition=shared
#SBATCH --tasks=26 #How many cpus you want
#SBATCH --mem-per-cpu=2G #Default is 1 GB/core but 2 GB/core is a good starting place for VASP
#SBATCH --time=2:00:00
#SBATCH --account=<your-account-name>
#SBATCH --job-name=<your-job-name>
module load vasp/<version>
srun vasp_std |& tee out
#SBATCH --nodes=1
#SBATCH --partition=shared
#SBATCH --time=2:00:00
#SBATCH --ntasks=26 # CPUs requested for job
#SBATCH --mem-per-cpu=2000 # Request 2GB per core.
#SBATCH --account=<allocation handle>
cd /scratch/$USER
srun ./my_progam # Use your application's commands here
```

For more information on the shared partitions and an example AU-accounting calculation, see [here](./Kestrel/running.md#shared-node-partition).
For more information on the shared partition and an example AU-accounting calculation, see [here](./Kestrel/running.md#shared-node-partition).

## 5. Performance Recommendations

Expand Down Expand Up @@ -139,17 +142,17 @@ These environment variables turn off some collective optimizations that we have

Please note that all of these recommendations are subject to change as we continue to improve the system.

## 6. Kestrel release notes
## Kestrel Release Notes

Release notes for Kestrel after major upgrades can be found [here](./Kestrel/kestrel_release_notes.md).

## 7. Resources
## Resources

1. [Accessing Kestrel](./Kestrel/index.md)
2. [Transferring Files between Filesystems on the NREL Network](../Managing_Data/Transferring_Files/index.md)
3. [Using Globus to move data from Eagle to Kestrel](../Managing_Data/Transferring_Files/globus.md)
4. [General software building tutorial](../Development/Libraries/howto.md)
5. [Environments Overview](./Kestrel/Environments/index.md)
6. [Environments tutorial](./Kestrel/Environments/tutorial.md)
6. [Environments Tutorial](./Kestrel/Environments/tutorial.md)

Please reach out to [email protected] for assistance with any topic on this page.
Please reach out to [[email protected]](mailto:[email protected]) for assistance with any topic on this page.
4 changes: 2 additions & 2 deletions docs/Documentation/Systems/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ order: 4
---

# NREL Systems
NREL operates three on-premises systems for computational work.
NREL operates four on-premises systems for computational work.

## System configurations

Expand All @@ -24,5 +24,5 @@ NREL operates three on-premises systems for computational work.
| Number of Nodes| 2454 | 2618 | 484 | 133 virtual |

!!! note
GPUs are not currently available on Kestrel. 132 nodes with 4x Nvidia H100 GPUs are expected to be installed on Kestrel in FY24 Q2 (January, 2024).
GPUs are not currently available on Kestrel. 132 nodes with 4x Nvidia H100 GPUs are expected to be installed on Kestrel in FY24 Q2.

14 changes: 5 additions & 9 deletions docs/Documentation/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Below we've collected answers for many of the most frequently asked questions.
??? note "How can I access NREL HPC systems?"

Begin by [requesting an NREL HPC account](https://www.nrel.gov/hpc/user-accounts.html).
Then, consult our guide on [how to connect to the NREL HPC system](https://www.nrel.gov/hpc/system-connection.html).
Then, consult our guide on [how to connect to NREL HPC systems](https://www.nrel.gov/hpc/system-connection.html).

??? note "What is a one-time password (OTP) token?"

Expand Down Expand Up @@ -85,10 +85,7 @@ Below we've collected answers for many of the most frequently asked questions.
??? note "What is proper NREL HPC login node etiquette?"

As mentioned above, login nodes are a shared resource, and are subject to process
limiting based on usage. Each user is permitted up to 8 cores and 100GB of RAM at
a time, after which the Arbiter monitoring software will begin moderating resource
consumption, restricting further processes by the user until usage is reduced to acceptable
limits. If you do computationally intensive work on these systems, it will unfairly
limiting based on usage. If you do computationally intensive work on these systems, it will unfairly
occupy resources and make the system less responsive for other users. Please reserve
your computationally intensive tasks (especially those that will fully utilize CPU
cores) for jobs submitted to compute nodes. Offenders of login node abuse will be
Expand All @@ -100,8 +97,7 @@ Below we've collected answers for many of the most frequently asked questions.
System time is a regularly occurring interval of time during which NREL HPC systems
are taken offline for necessary patches, updates, software installations, and anything
else to keep the systems useful, updated, and secure. **You will not be able to access
the system or submit jobs during system times.** System times occur the first Monday
every month. A reminder announcement is sent out prior to every system time detailing
the system or submit jobs during system times.** A reminder announcement is sent out prior to every system time detailing
what changes will take place, and includes an estimate of how long the system time will be.
You can check the [system status page](https://www.nrel.gov/hpc/system-status.html) if you are ever
unsure if an NREL HPC system is currently down for system time.
Expand All @@ -115,13 +111,13 @@ Below we've collected answers for many of the most frequently asked questions.
emulator for Windows is known as the [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10).
Other recommended terminal applications include: [Git Bash](https://git-scm.com/downloads), [Git for WIndows](https://gitforwindows.org/),
[Cmder](https://cmder.app/), and [MYSYS2](https://www.msys2.org/). Note that PuTTY is not a terminal emulator,
it is only an SSH client. The applications listed above implement an <kbd>ssh</kbd> command,
it is only an SSH client. The applications listed above implement an `ssh` command,
which mirrors the functionality of PuTTY.

??? note "What is the secure shell (SSH) protocol?"

Stated briefly, the SSH protocol establishes an encrypted channel to share various
kinds of network traffic. Not to be confused with the <kbd>ssh</kbd> terminal command or
kinds of network traffic. Not to be confused with the `ssh` terminal command or
SSH clients which are applications that implement the SSH protocol in software to
create secure connections to remote systems.

Expand Down

0 comments on commit e2b11fb

Please sign in to comment.