From 068961c2a3b9e23c12a96cbfa37b3af4e2ef8a19 Mon Sep 17 00:00:00 2001 From: Karl W Schulz Date: Wed, 3 Jul 2024 14:38:58 -0500 Subject: [PATCH] Updating documentation with name change Signed-off-by: Karl W Schulz --- docs/README | 2 +- docs/conf.py | 2 +- docs/index.md | 2 +- docs/installation.md | 128 +++++++++++++++++++++---------------------- docs/introduction.md | 22 ++++---- 5 files changed, 78 insertions(+), 78 deletions(-) diff --git a/docs/README b/docs/README index 4b68117b..7b239b85 100644 --- a/docs/README +++ b/docs/README @@ -1,4 +1,4 @@ -This subdirectory houses the input markup for Omniwatch documentation using +This subdirectory houses the input markup for Omnistat documentation using Sphinx. You can build a local copy of the documentation in this directory using diff --git a/docs/conf.py b/docs/conf.py index 6bfdabd4..30653981 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -31,7 +31,7 @@ def install(package): # -- Project information ----------------------------------------------------- -project = "Omniwatch" +project = "Omnistat" copyright = "2023-2024, Advanced Micro Devices, Inc. All Rights Reserved" author = "AMD Research" diff --git a/docs/index.md b/docs/index.md index 59cfbb50..6e68c10a 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,4 +1,4 @@ -# Welcome to the [Omniwatch](https://github.com/AMDResearch/omniwatch) Documentation! +# Welcome to the [Omnistat](https://github.com/AMDResearch/omnistat) Documentation! ```eval_rst .. toctree:: diff --git a/docs/installation.md b/docs/installation.md index 3b1af614..515b37b7 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -8,10 +8,10 @@ ## System-wide deployment -There are different ways to deploy and install Omniwatch in a data center, and +There are different ways to deploy and install Omnistat in a data center, and each system will generally require a certain level of customization. This -section first describes the basic manual steps to install the Omniwatch client -and server, and then provides an example of how to deploy Omniwatch in a data +section first describes the basic manual steps to install the Omnistat client +and server, and then provides an example of how to deploy Omnistat in a data center using Ansible. ### Node-level deployment (client) @@ -24,72 +24,72 @@ as a package. 1. Clone repository. ``` - $ git clone https://github.com/AMDResearch/omniwatch.git + $ git clone https://github.com/AMDResearch/omnistat.git ``` 2. Install dependencies. ``` - $ cd omniwatch + $ cd omnistat $ pip install --user -r requirements.txt ``` 3. Launch client with `gunicorn`. Needs to be executed from the root - directory of the Omniwatch project. + directory of the Omnistat project. ``` - $ gunicorn -b 0.0.0.0:8000 "omniwatch.node_monitoring:app" + $ gunicorn -b 0.0.0.0:8000 "omnistat.node_monitoring:app" ``` #### Option B. Install package 1. Clone repository. ``` - $ git clone https://github.com/AMDResearch/omniwatch.git + $ git clone https://github.com/AMDResearch/omnistat.git ``` 2. Create a virtual environment, with Python 3.8, 3.9, or 3.10. ``` - $ cd omniwatch - $ python -m venv /opt/omniwatch + $ cd omnistat + $ python -m venv /opt/omnistat ``` -3. Install omniwatch in a virtual environment. The virtual environment can - also be used by sourcing the `./opt/omniwatch/bin/activate` file, and that +3. Install omnistat in a virtual environment. The virtual environment can + also be used by sourcing the `./opt/omnistat/bin/activate` file, and that way there is no need to keep using the complete `./venv/bin` path every time. This guide uses the complete path for clarity. Needs to be - executed from the root directory of the Omniwatch repository. + executed from the root directory of the Omnistat repository. ``` - $ /opt/omniwatch/bin/python -m pip install . + $ /opt/omnistat/bin/python -m pip install . ``` - Alternatively, use the following line to install Omniwatch with the - optional dependencies for the `omniwatch-query` tool. + Alternatively, use the following line to install Omnistat with the + optional dependencies for the `omnistat-query` tool. ``` - $ /opt/omniwatch/bin/python -m pip install .[query] + $ /opt/omnistat/bin/python -m pip install .[query] ``` 4. Launch the client with `gunicorn`. To make sure the installed version of - Omniwatch is being used, this shouldn't be executed from the root directory + Omnistat is being used, this shouldn't be executed from the root directory of the project. ``` - $ /opt/omniwatch/bin/gunicorn -b 0.0.0.0:8000 "omniwatch.node_monitoring:app" + $ /opt/omnistat/bin/gunicorn -b 0.0.0.0:8000 "omnistat.node_monitoring:app" ``` #### Configure client -Launching the Omniwatch client as described above will load the default +Launching the Omnistat client as described above will load the default configuration options. To use a different configuration file, use the -`OMNIWATCH_CONFIG` environment variable. +`OMNISTAT_CONFIG` environment variable. ``` -$ OMNIWATCH_CONFIG=/path/to/config/file gunicorn -b 0.0.0.0:8000 "omniwatch.node_monitoring:app" +$ OMNISTAT_CONFIG=/path/to/config/file gunicorn -b 0.0.0.0:8000 "omnistat.node_monitoring:app" ``` A [sample configuration -file](https://github.com/AMDResearch/omniwatch/blob/main/omniwatch.default) is +file](https://github.com/AMDResearch/omnistat/blob/main/omnistat.default) is available in the respository. #### Check installation As a sanity check, this is the expected output you should see when launching -the Omniwatch client: +the Omnistat client: ``` [2024-06-08 18:50:56 -0400] [5834] [INFO] Starting gunicorn 21.2.0 [2024-06-08 18:50:56 -0400] [5834] [INFO] Listening at: http://0.0.0.0:8000 (5834) @@ -125,17 +125,17 @@ card0_rocm_utilization 0.0 #### Enable systemd service -To run the Omniwatch client permanently on a host, configure the service via +To run the Omnistat client permanently on a host, configure the service via systemd. An [example service -file](https://github.com/AMDResearch/omniwatch/blob/main/omniwatch.service) is +file](https://github.com/AMDResearch/omnistat/blob/main/omnistat.service) is available in the repository, including the following key lines: ``` -Environment="OMNIWATCH_CONFIG=/etc/omniwatch/config" -Environment="OMNIWATCH_PORT=8000" -ExecStart=/opt/omniwatch/bin/gunicorn -b 0.0.0.0:${OMNIWATCH_PORT} "omniwatch.node_monitoring:app" +Environment="OMNISTAT_CONFIG=/etc/omnistat/config" +Environment="OMNISTAT_PORT=8000" +ExecStart=/opt/omnistat/bin/gunicorn -b 0.0.0.0:${OMNISTAT_PORT} "omnistat.node_monitoring:app" ``` -Please set `OMNIWATCH_CONFIG` and `OMNIWATCH_PORT` as needed depending on how -Omniwatch is installed. +Please set `OMNISTAT_CONFIG` and `OMNISTAT_PORT` as needed depending on how +Omnistat is installed. ### Prometheus installation and configuration (server) @@ -159,7 +159,7 @@ On a separate server with access to compute nodes, install and configure which nodes to poll and at what frequency. For example: ``` scrape_configs: - - job_name: "omniwatch" + - job_name: "omnistat" scrape_interval: 30s scrape_timeout: 5s static_configs: @@ -173,64 +173,64 @@ On a separate server with access to compute nodes, install and configure ### Ansible example For a cluster or data center deployment, management tools like Ansible may be -used to install Omniwatch. +used to install Omnistat. -The following Ansible playbook will fetch the Omniwatch repository in each -node, create a virtual environment for Omniwatch under `/opt/omniwatch`, -install a configuration file under `/etc/omniwatch`, and enable Omniwatch as a +The following Ansible playbook will fetch the Omnistat repository in each +node, create a virtual environment for Omnistat under `/opt/omnistat`, +install a configuration file under `/etc/omnistat`, and enable Omnistat as a systemd service. This is only an example and will likely need to be adapted depending on the characteristics and scale of the system. ``` - hosts: all vars: - - omniwatch_url: git@github.com:AMDResearch/omniwatch.git - - omniwatch_tmp: /tmp/omniwatch-install - - omniwatch_dir: /opt/omniwatch + - omnistat_url: git@github.com:AMDResearch/omnistat.git + - omnistat_tmp: /tmp/omnistat-install + - omnistat_dir: /opt/omnistat tasks: - - name: Fetch copy of omniwatch repository for installation + - name: Fetch copy of omnistat repository for installation git: - repo: "{{ omniwatch_url }}" - dest: "{{ omniwatch_tmp }}" + repo: "{{ omnistat_url }}" + dest: "{{ omnistat_tmp }}" version: jorda/python-package single_branch: true - - name: Install omniwatch in virtual environment + - name: Install omnistat in virtual environment pip: - name: "{{ omniwatch_tmp }}[query]" - virtualenv: "{{ omniwatch_dir }}" + name: "{{ omnistat_tmp }}[query]" + virtualenv: "{{ omnistat_dir }}" virtualenv_command: /usr/bin/python3 -m venv - name: Create configuration directory file: - path: /etc/omniwatch + path: /etc/omnistat state: directory mode: "0755" - name: Copy configuration file copy: remote_src: true - src: "{{ omniwatch_tmp }}/omniwatch/config/omniwatch.default" - dest: /etc/omniwatch/config + src: "{{ omnistat_tmp }}/omnistat/config/omnistat.default" + dest: /etc/omnistat/config mode: "0644" - name: Copy service file copy: remote_src: true - src: "{{ omniwatch_tmp }}/omniwatch.service" + src: "{{ omnistat_tmp }}/omnistat.service" dest: /etc/systemd/system mode: "0644" - name: Enable service service: - name: omniwatch + name: omnistat enabled: yes state: started - name: Delete temporary installation files file: - path: "{{ omniwatch_tmp }}" + path: "{{ omnistat_tmp }}" state: absent ``` @@ -238,18 +238,18 @@ depending on the characteristics and scale of the system. ## User-mode execution with SLURM -### Installing Omniwatch +### Installing Omnistat 1. Create a virtual environment in a shared directory, with Python 3.8, 3.9, or 3.10. ``` - $ python -m venv ~/omniwatch + $ python -m venv ~/omnistat ``` -2. From to root directory of the Omniwatch repository, install omniwatch in +2. From to root directory of the Omnistat repository, install omnistat in the virtual environment. ``` - $ ~/omniwatch/bin/python -m pip install .[query] + $ ~/omnistat/bin/python -m pip install .[query] ``` ### Running a SLURM Job @@ -258,34 +258,34 @@ In the SLURM job script, add the following lines to start and stop the data collection before and after running the application. ``` -export OMNIWATCH_CONFIG=~/omniwatch/omniwatch.config +export OMNISTAT_CONFIG=~/omnistat/omnistat.config # Start data collector -~/omniwatch/bin/omniwatch-util --start --interval 1 +~/omnistat/bin/omnistat-util --start --interval 1 # Run application sleep 10 # Stop data collector -~/omniwatch/bin/omniwatch-util --stop +~/omnistat/bin/omnistat-util --stop # Query server to generate job report -~/omniwatch/bin/omniwatch-util --startserver -~/omniwatch/bin/omniwatch-util --job ${SLURM_JOB_ID} -~/omniwatch/bin/omniwatch-util --stopserver +~/omnistat/bin/omnistat-util --startserver +~/omnistat/bin/omnistat-util --job ${SLURM_JOB_ID} +~/omnistat/bin/omnistat-util --stopserver ``` ### Exploring results with a local Docker environment -To explore results generated for user-mode executions of Omniwatch, we provide +To explore results generated for user-mode executions of Omnistat, we provide a Docker environment that will automatically launch the required services locally. That includes Prometheus to read and query the stored data, and Grafana as visualization platform to display time series and other metrics. To explore results: -1. Copy Prometheus data collected with Omniwatch to `./prometheus-data`. The - entire `datadir` defined in the Omniwatch configuration needs to be copied +1. Copy Prometheus data collected with Omnistat to `./prometheus-data`. The + entire `datadir` defined in the Omnistat configuration needs to be copied (e.g. a `data` directory should be present under `./prometheus-data`). 2. Start services: ``` diff --git a/docs/introduction.md b/docs/introduction.md index 9ae1711a..76190a7f 100644 --- a/docs/introduction.md +++ b/docs/introduction.md @@ -6,13 +6,13 @@ :maxdepth: 4 ``` -Welcome to the documentation area for the **Omniwatch** project. Use the navigation links on the left-hand side of this page to access more information on installation and capabilities. +Welcome to the documentation area for the **Omnistat** project. Use the navigation links on the left-hand side of this page to access more information on installation and capabilities. -[Browse Omniwatch source code on Github](https://github.com/AMDResearch/omniwatch) +[Browse Omnistat source code on Github](https://github.com/AMDResearch/omnistat) -## What is Omniwatch? +## What is Omnistat? -Omniwatch provides a set of utilities to aid cluster administrators or individual application developers to aggregate scale-out system metrics via low-overhead sampling across all hosts in a cluster or, alternatively on a subset of hosts associated with a specific user job. At its core, Omniwatch was designed to aid collection of key telemetry from AMD Instinct(tm) accelerators (on a per-GPU basis). Relevant target metrics include: +Omnistat provides a set of utilities to aid cluster administrators or individual application developers to aggregate scale-out system metrics via low-overhead sampling across all hosts in a cluster or, alternatively on a subset of hosts associated with a specific user job. At its core, Omnistat was designed to aid collection of key telemetry from AMD Instinct(tm) accelerators (on a per-GPU basis). Relevant target metrics include: * GPU utilization (occupancy) * High-bandwidth memory (HBM) usage @@ -22,27 +22,27 @@ Omniwatch provides a set of utilities to aid cluster administrators or individua * GPU memory clock frequency (Mhz) * GPU throttling events -To enable scalable collection of these metrics, Omniwatch provides a python-based [Prometheus](https://prometheus.io) client that supplies instantaneous metric values on-demand for periodic polling by a companion Prometheus server. +To enable scalable collection of these metrics, Omnistat provides a python-based [Prometheus](https://prometheus.io) client that supplies instantaneous metric values on-demand for periodic polling by a companion Prometheus server. ## User-mode vs System-level monitoring -Omniwatch utilities can be deployed with two primary use-cases in mind that differ based on the end-consumer and whether the user has administrative rights or not. The use cases are denoted as follows: +Omnistat utilities can be deployed with two primary use-cases in mind that differ based on the end-consumer and whether the user has administrative rights or not. The use cases are denoted as follows: 1. __System-wide monitoring__: requires administrative rights and is typically used to monitor all GPU hosts within a given cluster in a 24x7 mode of operation. Use this approach to support system-wide telemetry collection for all user workloads and optionally, provide job-level insights for systems running the [SLURM](https://slurm.schedmd.com) workload manager. -1. __User-mode monitoring__: does not require administrative rights and can be run entirely within user-space. This case is typically exercised by end application users running on production SLURM clusters who want to gather telemetry data within a single SLURM job allocation. Frequently, this approach is performed entirely within a command-line `ssh` environment but Omniwatch includes support for downloading data after a job for visualization with a dockerized Grafana environment. Alternatively, standalone query utilities can be used to summarize collected metrics at the conclusion of a SLURM job. +1. __User-mode monitoring__: does not require administrative rights and can be run entirely within user-space. This case is typically exercised by end application users running on production SLURM clusters who want to gather telemetry data within a single SLURM job allocation. Frequently, this approach is performed entirely within a command-line `ssh` environment but Omnistat includes support for downloading data after a job for visualization with a dockerized Grafana environment. Alternatively, standalone query utilities can be used to summarize collected metrics at the conclusion of a SLURM job. -To demonstrate the overall data collection architecture employed by Omniwatch in these two modes of operation, the following diagrams highlight the data collector layout and life-cycle for both cases. +To demonstrate the overall data collection architecture employed by Omnistat in these two modes of operation, the following diagrams highlight the data collector layout and life-cycle for both cases. ![System Mode](images/architecture_system-mode.png) ![User Mode](images/architecture_user-mode.png) In the __system-wide monitoring__ case, a system administrator enables data collectors permanently on all relevant hosts within the cluster and configures a Prometheus server to periodically poll these nodes (e.g. at 1 minute or 5 minute intervals). The Prometheus server typically runs on the cluster head node (or separate administrative host) and does not require GPU resources locally. For real-time and historical queries, the system administrator also enables a Grafana instance that queries the Prometheus datastore to provide a variety of visualizations with collected data. Example visualization panels using this approach are highlighted in the [Grafana](./grafana.md) section. -Conversely, in the __user-mode__ case, Omniwatch data collector(s) and a companion prometheus server are deployed temporarily on hosts assigned to a user's SLURM job. At the end of the job, Omniwatch utilities can query cached telemetry data to summarize GPU utilization details or it can be visualized offline after the job completes. An example command-line summary from this user-mode approach is highlighted as follows: +Conversely, in the __user-mode__ case, Omnistat data collector(s) and a companion prometheus server are deployed temporarily on hosts assigned to a user's SLURM job. At the end of the job, Omnistat utilities can query cached telemetry data to summarize GPU utilization details or it can be visualized offline after the job completes. An example command-line summary from this user-mode approach is highlighted as follows: ```none ---------------------------------------- -Omniwatch Report Card for Job # 44092 +Omnistat Report Card for Job # 44092 ---------------------------------------- Job Overview (Num Nodes = 1, Machine = Snazzy Cluster) @@ -66,7 +66,7 @@ Version = 0.2.0 ## Software dependencies -The basic minimum dependencies to enable data collection via Omniwatch tools in user-mode are as follows: +The basic minimum dependencies to enable data collection via Omnistat tools in user-mode are as follows: * [ROCm](https://rocm.docs.amd.com/en/latest) * Python dependencies (see top-level requirements.txt)