Skip to content

Utility for monitoring process, thread, OS and HW resources.

License

Notifications You must be signed in to change notification settings

UO-OACISS/zerosum

Repository files navigation

zerosum

Utility for monitoring process, thread, OS and HW resources, including GPU utilization.

Current CI status on develop branch : CircleCI

Inspired by Tom Pappatheodore's Hello jsrun code for testing layout of Summit resources, and further inspired by Dagstuhl seminar 23171: "Driving HPC Operations With Holistic Monitoring and Operational Data Analytics"

HUST 2023 Publication - 11th International Workshop on HPC User Support Tools @SC23, 2023

HUST 2023 Presentation - 11th International Workshop on HPC User Support Tools @SC23, 2023

Overview

ZeroSum will monitor OS threads, OpenMP threads, MPI processes, and the hardware assigned to them including CPUs, memory usage and GPU utilization. Supported systems include all Linux operating systems, as well as NVIDIA (CUDA/NVML), AMD (HIP/ROCm-SMI) and Intel (Intel SYCL) GPUs. Host side monitoring happens through the virtual /proc filesystem, so should be portable to all Linux systems.

Build instructions

Configure and build with cmake. See the examples in the various build-*.sh scripts. Some systems have their own scripts (like build-frontier.sh).

For a basic installation with CPU-only support, you would do (for example):

cmake -B ${builddir} \
-DCMAKE_CXX_COMPILER=`which g++` \
-DCMAKE_C_COMPILER=`which gcc` \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=${instdir}
cmake --build ${builddir}

For additional support options, add:

  • NVIDIA CUDA (NVML): -DZeroSum_WITH_CUDA=TRUE and possibly -DCUDAToolkit_ROOT=<path to cuda>
  • AMD HIP (ROCm-SMI): -DZeroSum_WITH_HIP=TRUE and possibly -DROCM_PATH=/opt/rocm-${ROCM_COMPILER_VERSION}
  • Intel SYCL: -DZeroSum_WITH_SYCL=TRUE
  • MPI: -DZeroSum_WITH_MPI=TRUE.
  • OpenMP: -DZeroSum_WITH_OPENMP=TRUE and with compilers that support it (NVIDIA, AMD, Clang, Intel) -DZeroSum_WITH_OMPT=TRUE.
  • HWLOC: set export PKG_CONFIG_PATH=<path to hwloc>/lib/pkgconfig and use -DZeroSum_WITH_HWLOC=TRUE.

Other Build Dependencies

Support for specific GPU monitoring libaries is assumed to be installed on the machine already. ZeroSum does use the PerfStubs git submodule to allow collected data to be passsed on to other performance tools like TAU or APEX. For that reason, a working internet connection is needed at configuration time. PerfStubs can be disabled with the -DZeroSum_WITH_PerfStubs=FALSE CMake flag at configuration time (FALSE by default). GPU, HWLOC, MPI, and OpenMP support are also optional but recommended.

Sample Output

Sample output from the first MPI rank of an 8 process job on Frontier (see cores example from job-frontier.sh):

...
Duration of execution: 12.4312 s

Process Summary:
MPI 000 - PID 23319 - Node frontier00255 - CPUs allowed: [1,2,3,4,5,6,7,65,66,67,68,69,70,71]

LWP (thread) Summary:
LWP 23319: Main,OpenMP - stime:   2.38, utime:  90.69, nv_ctx:     1, ctx:  1537, CPUs allowed: [1,65]
LWP 23324:     ZeroSum - stime:   0.27, utime:   0.18, nv_ctx:     0, ctx:    25, CPUs allowed: [71]
LWP 23332:      OpenMP - stime:   0.33, utime:  98.92, nv_ctx:     0, ctx:     5, CPUs allowed: [1,65]
LWP 23341:      OpenMP - stime:   0.33, utime:  98.92, nv_ctx:     0, ctx:     5, CPUs allowed: [2,66]
LWP 23349:      OpenMP - stime:   0.33, utime:  98.92, nv_ctx:     1, ctx:     4, CPUs allowed: [2,66]
LWP 23355:      OpenMP - stime:   0.33, utime:  98.92, nv_ctx:     0, ctx:     4, CPUs allowed: [3,67]
LWP 23362:      OpenMP - stime:   0.33, utime:  98.92, nv_ctx:     1, ctx:     3, CPUs allowed: [3,67]
LWP 23369:      OpenMP - stime:   0.33, utime:  98.08, nv_ctx: 11773, ctx:     2, CPUs allowed: [4,68]
LWP 23377:      OpenMP - stime:   0.33, utime:  98.08, nv_ctx: 11773, ctx:     3, CPUs allowed: [4,68]
LWP 23385:      OpenMP - stime:   0.33, utime:  98.92, nv_ctx:     1, ctx:     3, CPUs allowed: [5,69]
LWP 23393:      OpenMP - stime:   0.33, utime:  98.92, nv_ctx:     1, ctx:     3, CPUs allowed: [5,69]
LWP 23400:      OpenMP - stime:   0.25, utime:  99.00, nv_ctx:     1, ctx:     3, CPUs allowed: [6,70]
LWP 23408:      OpenMP - stime:   0.25, utime:  99.00, nv_ctx:     0, ctx:     3, CPUs allowed: [6,70]
LWP 23416:      OpenMP - stime:   0.25, utime:  98.42, nv_ctx:    26, ctx:     3, CPUs allowed: [7,71]
LWP 23423:      OpenMP - stime:   0.25, utime:  99.00, nv_ctx:     1, ctx:     3, CPUs allowed: [7,71]
LWP 23453:       Other - stime:   0.00, utime:   0.00, nv_ctx:     0, ctx:     3, CPUs allowed: [1,2,3,4,5,6,7,9,10,11,12,13,14,15,17,18,19,20,21,22,23,25,26,27,28,29,30,31,33,34,35,36,37,38,39,41,42,43,44,45,46,47,49,50,51,52,53,54,55,57,58,59,60,61,62,63,65,66,67,68,69,70,71,73,74,75,76,77,78,79,81,82,83,84,85,86,87,89,90,91,92,93,94,95,97,98,99,100,101,102,103,105,106,107,108,109,110,111,113,114,115,116,117,118,119,121,122,123,124,125,126,127]

Hardware Summary:
CPU 001 - idle:   0.00, system:   0.18, user:  99.64
CPU 002 - idle:   0.00, system:   0.09, user:  99.64
CPU 003 - idle:   0.00, system:   0.09, user:  99.64
CPU 004 - idle:   0.00, system:   0.09, user:  99.64
CPU 005 - idle:   0.00, system:   0.09, user:  99.64
CPU 006 - idle:   0.00, system:   0.09, user:  99.64
CPU 007 - idle:   0.00, system:   0.09, user:  99.64
CPU 065 - idle:   0.00, system:   0.00, user:  99.73
CPU 066 - idle:   0.00, system:   0.09, user:  99.64
CPU 067 - idle:   0.00, system:   0.09, user:  99.64
CPU 068 - idle:   0.00, system:   0.09, user:  99.64
CPU 069 - idle:   0.00, system:   0.09, user:  99.64
CPU 070 - idle:   0.00, system:   0.09, user:  99.64
CPU 071 - idle:   0.00, system:   0.36, user:  99.45

GPU 0 - (metric: min  avg  max)
    Clock Frequency, GLX (MHz): 800.000000 800.000000 800.000000
    Clock Frequency, SOC (MHz): 1090.000000 1090.000000 1090.000000
    Device Busy %: 0.000000 0.000000 0.000000
    Energy Average (J): 0.000000 5.900000 6.000000
    GFX Activity: 0.000000 0.000000 0.000000
    GFX Activity %: 0.000000 0.000000 0.000000
    Memory Activity %: 0.000000 0.000000 0.000000
    Memory Busy %: 0.000000 0.000000 0.000000
    Memory Controller Activity: 0.000000 0.000000 0.000000
    Power Average (W): 84.000000 86.363636 93.000000
    Temperature (C): 34.000000 34.545455 35.000000
    Throttle Status: 0.000000 0.000000 0.000000
    Total GTT Bytes: 539494100992.000000 539494100992.000000 539494100992.000000
    Total VRAM Bytes: 68702699520.000000 68702699520.000000 68702699520.000000
    Total Visible VRAM Bytes: 68702699520.000000 68702699520.000000 68702699520.000000
    UVD|VCN Activity: 0.000000 0.000000 0.000000
    Used GTT Bytes: 11452416.000000 11452416.000000 11452416.000000
    Used VRAM Bytes: 13586432.000000 13586432.000000 13586432.000000
    Used Visible VRAM Bytes: 13586432.000000 13586432.000000 13586432.000000
    Voltage (mV): 818.000000 818.000000 818.000000

In this example, the stime values are time spent in system calls, the utime is time spent in user code, nv_ctx is the number of nonvoluntary context switches, ctx is the number of context switches, and CPUs allowed is the list of hardware threads each thread can run on. In the hardware summary, each thread is monitored to determine utilization. In the GPU summary, utilization data is summarized.

Notes

  • Don't want to pin progress threads from MPI or GPU runtimes. They are designed and expected to "run free". Future todo: Might want to identify the origin of the threads (if possible). MPI, HIP, CUDA threads should be allowed to float within their assigned resource set.
  • ZeroSum spawns a thread to monitor the process, so there is one additional thread. That thread is identified with type 'ZeroSum'. It will always report as "running", because it is running when all the threads are queried. However, it is almost always sleeping. It gets pinned to the last core in the resource set, that could be configurable (future todo).
  • Future todo: To get backtrace of each thread: https://github.com/albertz/openlierox/blob/0.59/src/common/Debug_GetCallstack.cpp This could be useful to determine library source of thread, if needed. done!
  • On SYCL machines, you have to set ZES_ENABLE_SYSMAN=1 or else device queries will fail.
  • Other SYCL note: you can theoretically use SYCL on NVIDIA machines (i.e. Polaris), and ZeroSum has been tested to work in such situations. It requres supplying the path to the compiler directory that has the sycl support included (i.e. LLVM). For more details, see the sourceme-polaris-sycl.sh, build-polaris-sycl.sh and job-polaris-sycl.sh scripts.

About

Utility for monitoring process, thread, OS and HW resources.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published