Skip to content

Commit

Permalink
Update index.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jdh4 authored Nov 3, 2024
1 parent 0a32b2f commit 6081e91
Showing 1 changed file with 100 additions and 7 deletions.
107 changes: 100 additions & 7 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,106 @@

Jobstats is a free and open-source job monitoring platform designed for CPU and GPU clusters that use the Slurm workload manager. It was released in 2023 under the GNU GPL v2 license.

- GPU utilization
- accurate CPU memory for multinode jobs
## What are the main benefits of Jobstats over other platforms?

The main advantages of Jobstats are:

- GPU utilization and memory usage for each allocated GPu
- automatically cancel jobs with 0% GPU utilization
- guide user with custom job notes
- Grafana dashboard
- works with Open OnDemand
- accurate CPU memory usage for single and multi-node jobs
- graphical interface for inspecting job metrics versus time
- custom job efficiency emails with job-specific notes
- automated emails to users for instances of underutilization
- periodic reports on usage and efficiency for users and group leaders
- all of the above features work with Open OnDemand jobs

## How does Jobstats work?

Jobstats is composed of data exporters, Prometheus database, Grafana visualization interface, and the Slurm database. Measurements made on the compute nodes are stored in the time-series Prometheus database. Job efficiency reports are generate from this data and Slurm.

## Which institutions are using Jobstats?

Jobstats is used by these institutions:

- Brown University - Center for Computation and Visualization
- Free University of Berlin - High-Performance Computing
- Princeton University - Computer Science Department
- Princeton University - Research Computing
- Yale University - Center for Research Computing
- and many more

## What does a Jobstats efficiency report look like?

The `jobstats` command generates a job report:

```
$ jobstats 39798795
================================================================================
Slurm Job Statistics
================================================================================
Job ID: 39798795
NetID/Account: aturing/math
Job Name: sys_logic_ordinals
State: COMPLETED
Nodes: 2
CPU Cores: 48
CPU Memory: 256GB (5.3GB per CPU-core)
GPUs: 4
QOS/Partition: della-gpu/gpu
Cluster: della
Start Time: Fri Mar 4, 2022 at 1:56 AM
Run Time: 18:41:56
Time Limit: 4-00:00:00
Overall Utilization
================================================================================
CPU utilization [||||| 10%]
CPU memory usage [||| 6%]
GPU utilization [|||||||||||||||||||||||||||||||||| 68%]
GPU memory usage [||||||||||||||||||||||||||||||||| 66%]
Detailed Utilization
================================================================================
CPU utilization per node (CPU time used/run time)
della-i14g2: 1-21:41:20/18-16:46:24 (efficiency=10.2%)
della-i14g3: 1-18:48:55/18-16:46:24 (efficiency=9.5%)
Total used/runtime: 3-16:30:16/37-09:32:48, efficiency=9.9%
CPU memory usage per node - used/allocated
della-i14g2: 7.9GB/128.0GB (335.5MB/5.3GB per core of 24)
della-i14g3: 7.8GB/128.0GB (334.6MB/5.3GB per core of 24)
Total used/allocated: 15.7GB/256.0GB (335.1MB/5.3GB per core of 48)
GPU utilization per node
della-i14g2 (GPU 0): 65.7%
della-i14g2 (GPU 1): 64.5%
della-i14g3 (GPU 0): 72.9%
della-i14g3 (GPU 1): 67.5%
GPU memory usage per node - maximum used/total
della-i14g2 (GPU 0): 26.5GB/40.0GB (66.2%)
della-i14g2 (GPU 1): 26.5GB/40.0GB (66.2%)
della-i14g3 (GPU 0): 26.5GB/40.0GB (66.2%)
della-i14g3 (GPU 1): 26.5GB/40.0GB (66.2%)
Notes
================================================================================
* This job only used 6% of the 256GB of total allocated CPU memory. For
future jobs, please allocate less memory by using a Slurm directive such
as --mem-per-cpu=1G or --mem=10G. This will reduce your queue times and
make the resources available to other users. For more info:
https://researchcomputing.princeton.edu/support/knowledge-base/memory
* For additional job metrics including metrics plotted against time:
https://mydella.princeton.edu/pun/sys/jobstats (VPN required off-campus)
```

## Other Job Monitoring Platforms

# Comparison to Other Platforms
Consider these alternatives to Jobstats:

It is most similar to Open XDMod.
- [XDMod (SUPReMM)](https://supremm.xdmod.org/7.0/supremm-architecture.html)
- [LLload](https://dl.acm.org/doi/10.1145/3626203.3670565)
- [TACC Stats](https://tacc.utexas.edu/research/tacc-research/tacc-stats/)
- [REMORA](https://docs.tacc.utexas.edu/software/remora/)

0 comments on commit 6081e91

Please sign in to comment.