Skip to content

Commit

Permalink
Updating documentation with name change
Browse files Browse the repository at this point in the history
Signed-off-by: Karl W Schulz <[email protected]>
  • Loading branch information
koomie committed Jul 3, 2024
1 parent 36567b5 commit 068961c
Show file tree
Hide file tree
Showing 5 changed files with 78 additions and 78 deletions.
2 changes: 1 addition & 1 deletion docs/README
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This subdirectory houses the input markup for Omniwatch documentation using
This subdirectory houses the input markup for Omnistat documentation using
Sphinx.

You can build a local copy of the documentation in this directory using
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def install(package):

# -- Project information -----------------------------------------------------

project = "Omniwatch"
project = "Omnistat"
copyright = "2023-2024, Advanced Micro Devices, Inc. All Rights Reserved"
author = "AMD Research"

Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Welcome to the [Omniwatch](https://github.com/AMDResearch/omniwatch) Documentation!
# Welcome to the [Omnistat](https://github.com/AMDResearch/omnistat) Documentation!

```eval_rst
.. toctree::
Expand Down
128 changes: 64 additions & 64 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@

## System-wide deployment

There are different ways to deploy and install Omniwatch in a data center, and
There are different ways to deploy and install Omnistat in a data center, and
each system will generally require a certain level of customization. This
section first describes the basic manual steps to install the Omniwatch client
and server, and then provides an example of how to deploy Omniwatch in a data
section first describes the basic manual steps to install the Omnistat client
and server, and then provides an example of how to deploy Omnistat in a data
center using Ansible.

### Node-level deployment (client)
Expand All @@ -24,72 +24,72 @@ as a package.

1. Clone repository.
```
$ git clone https://github.com/AMDResearch/omniwatch.git
$ git clone https://github.com/AMDResearch/omnistat.git
```

2. Install dependencies.
```
$ cd omniwatch
$ cd omnistat
$ pip install --user -r requirements.txt
```

3. Launch client with `gunicorn`. Needs to be executed from the root
directory of the Omniwatch project.
directory of the Omnistat project.
```
$ gunicorn -b 0.0.0.0:8000 "omniwatch.node_monitoring:app"
$ gunicorn -b 0.0.0.0:8000 "omnistat.node_monitoring:app"
```

#### Option B. Install package

1. Clone repository.
```
$ git clone https://github.com/AMDResearch/omniwatch.git
$ git clone https://github.com/AMDResearch/omnistat.git
```

2. Create a virtual environment, with Python 3.8, 3.9, or 3.10.
```
$ cd omniwatch
$ python -m venv /opt/omniwatch
$ cd omnistat
$ python -m venv /opt/omnistat
```

3. Install omniwatch in a virtual environment. The virtual environment can
also be used by sourcing the `./opt/omniwatch/bin/activate` file, and that
3. Install omnistat in a virtual environment. The virtual environment can
also be used by sourcing the `./opt/omnistat/bin/activate` file, and that
way there is no need to keep using the complete `./venv/bin` path every
time. This guide uses the complete path for clarity. Needs to be
executed from the root directory of the Omniwatch repository.
executed from the root directory of the Omnistat repository.
```
$ /opt/omniwatch/bin/python -m pip install .
$ /opt/omnistat/bin/python -m pip install .
```
Alternatively, use the following line to install Omniwatch with the
optional dependencies for the `omniwatch-query` tool.
Alternatively, use the following line to install Omnistat with the
optional dependencies for the `omnistat-query` tool.
```
$ /opt/omniwatch/bin/python -m pip install .[query]
$ /opt/omnistat/bin/python -m pip install .[query]
```

4. Launch the client with `gunicorn`. To make sure the installed version of
Omniwatch is being used, this shouldn't be executed from the root directory
Omnistat is being used, this shouldn't be executed from the root directory
of the project.
```
$ /opt/omniwatch/bin/gunicorn -b 0.0.0.0:8000 "omniwatch.node_monitoring:app"
$ /opt/omnistat/bin/gunicorn -b 0.0.0.0:8000 "omnistat.node_monitoring:app"
```

#### Configure client

Launching the Omniwatch client as described above will load the default
Launching the Omnistat client as described above will load the default
configuration options. To use a different configuration file, use the
`OMNIWATCH_CONFIG` environment variable.
`OMNISTAT_CONFIG` environment variable.
```
$ OMNIWATCH_CONFIG=/path/to/config/file gunicorn -b 0.0.0.0:8000 "omniwatch.node_monitoring:app"
$ OMNISTAT_CONFIG=/path/to/config/file gunicorn -b 0.0.0.0:8000 "omnistat.node_monitoring:app"
```

A [sample configuration
file](https://github.com/AMDResearch/omniwatch/blob/main/omniwatch.default) is
file](https://github.com/AMDResearch/omnistat/blob/main/omnistat.default) is
available in the respository.

#### Check installation

As a sanity check, this is the expected output you should see when launching
the Omniwatch client:
the Omnistat client:
```
[2024-06-08 18:50:56 -0400] [5834] [INFO] Starting gunicorn 21.2.0
[2024-06-08 18:50:56 -0400] [5834] [INFO] Listening at: http://0.0.0.0:8000 (5834)
Expand Down Expand Up @@ -125,17 +125,17 @@ card0_rocm_utilization 0.0

#### Enable systemd service

To run the Omniwatch client permanently on a host, configure the service via
To run the Omnistat client permanently on a host, configure the service via
systemd. An [example service
file](https://github.com/AMDResearch/omniwatch/blob/main/omniwatch.service) is
file](https://github.com/AMDResearch/omnistat/blob/main/omnistat.service) is
available in the repository, including the following key lines:
```
Environment="OMNIWATCH_CONFIG=/etc/omniwatch/config"
Environment="OMNIWATCH_PORT=8000"
ExecStart=/opt/omniwatch/bin/gunicorn -b 0.0.0.0:${OMNIWATCH_PORT} "omniwatch.node_monitoring:app"
Environment="OMNISTAT_CONFIG=/etc/omnistat/config"
Environment="OMNISTAT_PORT=8000"
ExecStart=/opt/omnistat/bin/gunicorn -b 0.0.0.0:${OMNISTAT_PORT} "omnistat.node_monitoring:app"
```
Please set `OMNIWATCH_CONFIG` and `OMNIWATCH_PORT` as needed depending on how
Omniwatch is installed.
Please set `OMNISTAT_CONFIG` and `OMNISTAT_PORT` as needed depending on how
Omnistat is installed.

### Prometheus installation and configuration (server)

Expand All @@ -159,7 +159,7 @@ On a separate server with access to compute nodes, install and configure
which nodes to poll and at what frequency. For example:
```
scrape_configs:
- job_name: "omniwatch"
- job_name: "omnistat"
scrape_interval: 30s
scrape_timeout: 5s
static_configs:
Expand All @@ -173,83 +173,83 @@ On a separate server with access to compute nodes, install and configure
### Ansible example

For a cluster or data center deployment, management tools like Ansible may be
used to install Omniwatch.
used to install Omnistat.

The following Ansible playbook will fetch the Omniwatch repository in each
node, create a virtual environment for Omniwatch under `/opt/omniwatch`,
install a configuration file under `/etc/omniwatch`, and enable Omniwatch as a
The following Ansible playbook will fetch the Omnistat repository in each
node, create a virtual environment for Omnistat under `/opt/omnistat`,
install a configuration file under `/etc/omnistat`, and enable Omnistat as a
systemd service. This is only an example and will likely need to be adapted
depending on the characteristics and scale of the system.

```
- hosts: all
vars:
- omniwatch_url: [email protected]:AMDResearch/omniwatch.git
- omniwatch_tmp: /tmp/omniwatch-install
- omniwatch_dir: /opt/omniwatch
- omnistat_url: [email protected]:AMDResearch/omnistat.git
- omnistat_tmp: /tmp/omnistat-install
- omnistat_dir: /opt/omnistat
tasks:
- name: Fetch copy of omniwatch repository for installation
- name: Fetch copy of omnistat repository for installation
git:
repo: "{{ omniwatch_url }}"
dest: "{{ omniwatch_tmp }}"
repo: "{{ omnistat_url }}"
dest: "{{ omnistat_tmp }}"
version: jorda/python-package
single_branch: true
- name: Install omniwatch in virtual environment
- name: Install omnistat in virtual environment
pip:
name: "{{ omniwatch_tmp }}[query]"
virtualenv: "{{ omniwatch_dir }}"
name: "{{ omnistat_tmp }}[query]"
virtualenv: "{{ omnistat_dir }}"
virtualenv_command: /usr/bin/python3 -m venv
- name: Create configuration directory
file:
path: /etc/omniwatch
path: /etc/omnistat
state: directory
mode: "0755"
- name: Copy configuration file
copy:
remote_src: true
src: "{{ omniwatch_tmp }}/omniwatch/config/omniwatch.default"
dest: /etc/omniwatch/config
src: "{{ omnistat_tmp }}/omnistat/config/omnistat.default"
dest: /etc/omnistat/config
mode: "0644"
- name: Copy service file
copy:
remote_src: true
src: "{{ omniwatch_tmp }}/omniwatch.service"
src: "{{ omnistat_tmp }}/omnistat.service"
dest: /etc/systemd/system
mode: "0644"
- name: Enable service
service:
name: omniwatch
name: omnistat
enabled: yes
state: started
- name: Delete temporary installation files
file:
path: "{{ omniwatch_tmp }}"
path: "{{ omnistat_tmp }}"
state: absent
```

---

## User-mode execution with SLURM

### Installing Omniwatch
### Installing Omnistat

1. Create a virtual environment in a shared directory, with Python 3.8, 3.9,
or 3.10.
```
$ python -m venv ~/omniwatch
$ python -m venv ~/omnistat
```

2. From to root directory of the Omniwatch repository, install omniwatch in
2. From to root directory of the Omnistat repository, install omnistat in
the virtual environment.
```
$ ~/omniwatch/bin/python -m pip install .[query]
$ ~/omnistat/bin/python -m pip install .[query]
```

### Running a SLURM Job
Expand All @@ -258,34 +258,34 @@ In the SLURM job script, add the following lines to start and stop the data
collection before and after running the application.

```
export OMNIWATCH_CONFIG=~/omniwatch/omniwatch.config
export OMNISTAT_CONFIG=~/omnistat/omnistat.config
# Start data collector
~/omniwatch/bin/omniwatch-util --start --interval 1
~/omnistat/bin/omnistat-util --start --interval 1
# Run application
sleep 10
# Stop data collector
~/omniwatch/bin/omniwatch-util --stop
~/omnistat/bin/omnistat-util --stop
# Query server to generate job report
~/omniwatch/bin/omniwatch-util --startserver
~/omniwatch/bin/omniwatch-util --job ${SLURM_JOB_ID}
~/omniwatch/bin/omniwatch-util --stopserver
~/omnistat/bin/omnistat-util --startserver
~/omnistat/bin/omnistat-util --job ${SLURM_JOB_ID}
~/omnistat/bin/omnistat-util --stopserver
```

### Exploring results with a local Docker environment

To explore results generated for user-mode executions of Omniwatch, we provide
To explore results generated for user-mode executions of Omnistat, we provide
a Docker environment that will automatically launch the required services
locally. That includes Prometheus to read and query the stored data, and
Grafana as visualization platform to display time series and other metrics.

To explore results:

1. Copy Prometheus data collected with Omniwatch to `./prometheus-data`. The
entire `datadir` defined in the Omniwatch configuration needs to be copied
1. Copy Prometheus data collected with Omnistat to `./prometheus-data`. The
entire `datadir` defined in the Omnistat configuration needs to be copied
(e.g. a `data` directory should be present under `./prometheus-data`).
2. Start services:
```
Expand Down
22 changes: 11 additions & 11 deletions docs/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@
:maxdepth: 4
```

Welcome to the documentation area for the **Omniwatch** project. Use the navigation links on the left-hand side of this page to access more information on installation and capabilities.
Welcome to the documentation area for the **Omnistat** project. Use the navigation links on the left-hand side of this page to access more information on installation and capabilities.

[Browse Omniwatch source code on Github](https://github.com/AMDResearch/omniwatch)
[Browse Omnistat source code on Github](https://github.com/AMDResearch/omnistat)

## What is Omniwatch?
## What is Omnistat?

Omniwatch provides a set of utilities to aid cluster administrators or individual application developers to aggregate scale-out system metrics via low-overhead sampling across all hosts in a cluster or, alternatively on a subset of hosts associated with a specific user job. At its core, Omniwatch was designed to aid collection of key telemetry from AMD Instinct(tm) accelerators (on a per-GPU basis). Relevant target metrics include:
Omnistat provides a set of utilities to aid cluster administrators or individual application developers to aggregate scale-out system metrics via low-overhead sampling across all hosts in a cluster or, alternatively on a subset of hosts associated with a specific user job. At its core, Omnistat was designed to aid collection of key telemetry from AMD Instinct(tm) accelerators (on a per-GPU basis). Relevant target metrics include:

* GPU utilization (occupancy)
* High-bandwidth memory (HBM) usage
Expand All @@ -22,27 +22,27 @@ Omniwatch provides a set of utilities to aid cluster administrators or individua
* GPU memory clock frequency (Mhz)
* GPU throttling events

To enable scalable collection of these metrics, Omniwatch provides a python-based [Prometheus](https://prometheus.io) client that supplies instantaneous metric values on-demand for periodic polling by a companion Prometheus server.
To enable scalable collection of these metrics, Omnistat provides a python-based [Prometheus](https://prometheus.io) client that supplies instantaneous metric values on-demand for periodic polling by a companion Prometheus server.

## User-mode vs System-level monitoring

Omniwatch utilities can be deployed with two primary use-cases in mind that differ based on the end-consumer and whether the user has administrative rights or not. The use cases are denoted as follows:
Omnistat utilities can be deployed with two primary use-cases in mind that differ based on the end-consumer and whether the user has administrative rights or not. The use cases are denoted as follows:

1. __System-wide monitoring__: requires administrative rights and is typically used to monitor all GPU hosts within a given cluster in a 24x7 mode of operation. Use this approach to support system-wide telemetry collection for all user workloads and optionally, provide job-level insights for systems running the [SLURM](https://slurm.schedmd.com) workload manager.
1. __User-mode monitoring__: does not require administrative rights and can be run entirely within user-space. This case is typically exercised by end application users running on production SLURM clusters who want to gather telemetry data within a single SLURM job allocation. Frequently, this approach is performed entirely within a command-line `ssh` environment but Omniwatch includes support for downloading data after a job for visualization with a dockerized Grafana environment. Alternatively, standalone query utilities can be used to summarize collected metrics at the conclusion of a SLURM job.
1. __User-mode monitoring__: does not require administrative rights and can be run entirely within user-space. This case is typically exercised by end application users running on production SLURM clusters who want to gather telemetry data within a single SLURM job allocation. Frequently, this approach is performed entirely within a command-line `ssh` environment but Omnistat includes support for downloading data after a job for visualization with a dockerized Grafana environment. Alternatively, standalone query utilities can be used to summarize collected metrics at the conclusion of a SLURM job.

To demonstrate the overall data collection architecture employed by Omniwatch in these two modes of operation, the following diagrams highlight the data collector layout and life-cycle for both cases.
To demonstrate the overall data collection architecture employed by Omnistat in these two modes of operation, the following diagrams highlight the data collector layout and life-cycle for both cases.

![System Mode](images/architecture_system-mode.png)
![User Mode](images/architecture_user-mode.png)

In the __system-wide monitoring__ case, a system administrator enables data collectors permanently on all relevant hosts within the cluster and configures a Prometheus server to periodically poll these nodes (e.g. at 1 minute or 5 minute intervals). The Prometheus server typically runs on the cluster head node (or separate administrative host) and does not require GPU resources locally. For real-time and historical queries, the system administrator also enables a Grafana instance that queries the Prometheus datastore to provide a variety of visualizations with collected data. Example visualization panels using this approach are highlighted in the [Grafana](./grafana.md) section.

Conversely, in the __user-mode__ case, Omniwatch data collector(s) and a companion prometheus server are deployed temporarily on hosts assigned to a user's SLURM job. At the end of the job, Omniwatch utilities can query cached telemetry data to summarize GPU utilization details or it can be visualized offline after the job completes. An example command-line summary from this user-mode approach is highlighted as follows:
Conversely, in the __user-mode__ case, Omnistat data collector(s) and a companion prometheus server are deployed temporarily on hosts assigned to a user's SLURM job. At the end of the job, Omnistat utilities can query cached telemetry data to summarize GPU utilization details or it can be visualized offline after the job completes. An example command-line summary from this user-mode approach is highlighted as follows:

```none
----------------------------------------
Omniwatch Report Card for Job # 44092
Omnistat Report Card for Job # 44092
----------------------------------------
Job Overview (Num Nodes = 1, Machine = Snazzy Cluster)
Expand All @@ -66,7 +66,7 @@ Version = 0.2.0

## Software dependencies

The basic minimum dependencies to enable data collection via Omniwatch tools in user-mode are as follows:
The basic minimum dependencies to enable data collection via Omnistat tools in user-mode are as follows:

* [ROCm](https://rocm.docs.amd.com/en/latest)
* Python dependencies (see top-level requirements.txt)
Expand Down

0 comments on commit 068961c

Please sign in to comment.