From 6081e912db4ba895c47e707cbeba0e7afc11d3e3 Mon Sep 17 00:00:00 2001 From: Jonathan Halverson <52128661+jdh4@users.noreply.github.com> Date: Sun, 3 Nov 2024 11:53:54 -0500 Subject: [PATCH] Update index.md --- docs/index.md | 107 ++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 100 insertions(+), 7 deletions(-) diff --git a/docs/index.md b/docs/index.md index 61a8125..bf7d453 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,13 +2,106 @@ Jobstats is a free and open-source job monitoring platform designed for CPU and GPU clusters that use the Slurm workload manager. It was released in 2023 under the GNU GPL v2 license. -- GPU utilization -- accurate CPU memory for multinode jobs +## What are the main benefits of Jobstats over other platforms? + +The main advantages of Jobstats are: + +- GPU utilization and memory usage for each allocated GPu - automatically cancel jobs with 0% GPU utilization -- guide user with custom job notes -- Grafana dashboard -- works with Open OnDemand +- accurate CPU memory usage for single and multi-node jobs +- graphical interface for inspecting job metrics versus time +- custom job efficiency emails with job-specific notes +- automated emails to users for instances of underutilization +- periodic reports on usage and efficiency for users and group leaders +- all of the above features work with Open OnDemand jobs + +## How does Jobstats work? + +Jobstats is composed of data exporters, Prometheus database, Grafana visualization interface, and the Slurm database. Measurements made on the compute nodes are stored in the time-series Prometheus database. Job efficiency reports are generate from this data and Slurm. + +## Which institutions are using Jobstats? + +Jobstats is used by these institutions: + +- Brown University - Center for Computation and Visualization +- Free University of Berlin - High-Performance Computing +- Princeton University - Computer Science Department +- Princeton University - Research Computing +- Yale University - Center for Research Computing +- and many more + +## What does a Jobstats efficiency report look like? + +The `jobstats` command generates a job report: + +``` +$ jobstats 39798795 + +================================================================================ + Slurm Job Statistics +================================================================================ + Job ID: 39798795 + NetID/Account: aturing/math + Job Name: sys_logic_ordinals + State: COMPLETED + Nodes: 2 + CPU Cores: 48 + CPU Memory: 256GB (5.3GB per CPU-core) + GPUs: 4 + QOS/Partition: della-gpu/gpu + Cluster: della + Start Time: Fri Mar 4, 2022 at 1:56 AM + Run Time: 18:41:56 + Time Limit: 4-00:00:00 + + Overall Utilization +================================================================================ + CPU utilization [||||| 10%] + CPU memory usage [||| 6%] + GPU utilization [|||||||||||||||||||||||||||||||||| 68%] + GPU memory usage [||||||||||||||||||||||||||||||||| 66%] + + Detailed Utilization +================================================================================ + CPU utilization per node (CPU time used/run time) + della-i14g2: 1-21:41:20/18-16:46:24 (efficiency=10.2%) + della-i14g3: 1-18:48:55/18-16:46:24 (efficiency=9.5%) + Total used/runtime: 3-16:30:16/37-09:32:48, efficiency=9.9% + + CPU memory usage per node - used/allocated + della-i14g2: 7.9GB/128.0GB (335.5MB/5.3GB per core of 24) + della-i14g3: 7.8GB/128.0GB (334.6MB/5.3GB per core of 24) + Total used/allocated: 15.7GB/256.0GB (335.1MB/5.3GB per core of 48) + + GPU utilization per node + della-i14g2 (GPU 0): 65.7% + della-i14g2 (GPU 1): 64.5% + della-i14g3 (GPU 0): 72.9% + della-i14g3 (GPU 1): 67.5% + + GPU memory usage per node - maximum used/total + della-i14g2 (GPU 0): 26.5GB/40.0GB (66.2%) + della-i14g2 (GPU 1): 26.5GB/40.0GB (66.2%) + della-i14g3 (GPU 0): 26.5GB/40.0GB (66.2%) + della-i14g3 (GPU 1): 26.5GB/40.0GB (66.2%) + + Notes +================================================================================ + * This job only used 6% of the 256GB of total allocated CPU memory. For + future jobs, please allocate less memory by using a Slurm directive such + as --mem-per-cpu=1G or --mem=10G. This will reduce your queue times and + make the resources available to other users. For more info: + https://researchcomputing.princeton.edu/support/knowledge-base/memory + + * For additional job metrics including metrics plotted against time: + https://mydella.princeton.edu/pun/sys/jobstats (VPN required off-campus) +``` + +## Other Job Monitoring Platforms -# Comparison to Other Platforms +Consider these alternatives to Jobstats: -It is most similar to Open XDMod. +- [XDMod (SUPReMM)](https://supremm.xdmod.org/7.0/supremm-architecture.html) +- [LLload](https://dl.acm.org/doi/10.1145/3626203.3670565) +- [TACC Stats](https://tacc.utexas.edu/research/tacc-research/tacc-stats/) +- [REMORA](https://docs.tacc.utexas.edu/software/remora/)