Skip to content

Commit

Permalink
docs: Update docs with libvirt sections
Browse files Browse the repository at this point in the history
Signed-off-by: Mahendra Paipuri <[email protected]>
  • Loading branch information
mahendrapaipuri committed Oct 10, 2024
1 parent f160bcc commit 190b609
Show file tree
Hide file tree
Showing 10 changed files with 176 additions and 49 deletions.
14 changes: 7 additions & 7 deletions etc/nvidia-dcgm-exporter/counters.csv
Original file line number Diff line number Diff line change
Expand Up @@ -78,14 +78,14 @@ DCGM_FI_DRIVER_VERSION, label, Driver Version

# Profiling metrics. Ref: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#profiling-metrics
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Fraction of time any portion of the graphics or compute engines were active.
DCGM_FI_PROF_SM_ACTIVE, gauge, Fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors.
DCGM_FI_PROF_SM_OCCUPANCY, gauge, Fraction of resident warps on a multiprocessor, relative to the maximum number of concurrent warps supported on a multiprocessor.
DCGM_FI_PROF_SM_ACTIVE, gauge, Fraction of time at least one warp was active on a multiprocessor averaged over all multiprocessors.
DCGM_FI_PROF_SM_OCCUPANCY, gauge, Fraction of resident warps on a multiprocessor relative to the maximum number of concurrent warps supported on a multiprocessor.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Fraction of cycles the tensor (HMMA / IMMA) pipe was active.
DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Fraction of cycles the FP64 (double precision) pipe was active.
DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Fraction of cycles the FMA (FP32 (single precision), and integer) pipe was active.
DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Fraction of cycles the FMA (FP32 (single precision) and integer) pipe was active.
DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Fraction of cycles the FP16 (half precision) pipe was active. The value represents an average over a time interval and is not an instantaneous value.
DCGM_FI_PROF_DRAM_ACTIVE, gauge, Fraction of cycles where data was sent to or received from device memory.
DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, Total rate of data transmitted over NVLink, not including protocol headers, in bytes per second.
DCGM_FI_PROF_NVLINK_RX_BYTES, gauge, Total rate of data received over NVLink, not including protocol headers, in bytes per second.
DCGM_FI_PROF_PCIE_TX_BYTES, gauge, Total rate of data transmitted over PCIE, not including protocol headers, in bytes per second.
DCGM_FI_PROF_PCIE_RX_BYTES, gauge, Total rate of data received over PCIE, not including protocol headers, in bytes per second.
DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, Total rate of data transmitted over NVLink not including protocol headers in bytes per second.
DCGM_FI_PROF_NVLINK_RX_BYTES, gauge, Total rate of data received over NVLink not including protocol headers in bytes per second.
DCGM_FI_PROF_PCIE_TX_BYTES, gauge, Total rate of data transmitted over PCIE not including protocol headers in bytes per second.
DCGM_FI_PROF_PCIE_RX_BYTES, gauge, Total rate of data received over PCIE not including protocol headers in bytes per second.
10 changes: 2 additions & 8 deletions etc/slurm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,5 @@ This directory provides those scripts that should be used with SLURM.
An example [systemd service file](https://github.com/mahendrapaipuri/ceems/blob/main/init/systemd/ceems_exporter_no_privs.service)
is also provided in the repo that can be used along with these prolog and epilog scripts.

> [!IMPORTANT]
> The CLI argument `--collector.slurm.gpu-job-map-path`
is hidden and cannot be seen in `ceems_exporter --help` output. However, this argument
exists in the exporter and can be used.

Even with such prolog and epilog scripts, operators should grant the user running CEEMS
exporter permissions to run `ipmi-dcmi` command as this command can be executable by only
`root` by default.
Even with such prolog and epilog scripts, operators should grant the CEEMS exporter
process additional privileges for collectors like `ipmi_dcmi`, `ebpf`, _etc_.
3 changes: 2 additions & 1 deletion website/cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,8 @@
"cpus",
"memsw",
"retrans",
"Mellanox"
"Mellanox",
"blkio"
],
// flagWords - list of words to be always considered incorrect
// This is useful for offensive words and common spelling errors.
Expand Down
4 changes: 2 additions & 2 deletions website/docs/00-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ with Grafana and Prometheus to show statistics to users.

:::important[Note]

Currently, only SLURM is supported as a resource manager. In future support for Openstack
and Kubernetes will be added.
Currently, only SLURM and Openstack are supported as a resource managers. In future support
for Kubernetes will be added.

:::
30 changes: 15 additions & 15 deletions website/docs/02-objectives.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,31 @@

The objectives of the current stack are two-fold:

- For end users to be able to monitor their compute units in real time. Besides the
conventional metrics like CPU usage, memory usage, _etc_, the stack also exposes
metrics like energy consumption and equivalent emissions in real time. The stack is
- For end users to be able to monitor their compute units in real time. Besides the
conventional metrics like CPU usage, memory usage, _etc_, the stack also exposes
metrics like energy consumption and equivalent emissions in real time. The stack is
also capable of showing the aggregate usage metrics of a given project/tenant/namespace.

- For the operators/admins to be able to monitor the usage of the cluster in terms of
- For the operators/admins to be able to monitor the usage of the cluster in terms of
CPU usage, memory, energy, _etc_. With the current stack the operators will be able to
identify the top consumers of the resources in the cluster, users/projects that are
identify the top consumers of the resources in the cluster, users/projects that are
under consuming the allocated resources _etc_.

CEEMS has been designed to be modular and extensible, _i.e.,_ CEEMS is meant to support
multiple clusters at the same time. For instance, imagine a Data Center (DC) has a SLURM
cluster and a Openstack cluster. A single deployment of CEEMS should be able to
consolidate the metrics data of SLURM jobs and Openstack VMs and expose it to
CEEMS has been designed to be modular and extensible, _i.e.,_ CEEMS is meant to support
multiple clusters at the same time. For instance, imagine a Data Center (DC) has a SLURM
cluster and a Openstack cluster. A single deployment of CEEMS should be able to
consolidate the metrics data of SLURM jobs and Openstack VMs and expose it to
end users using a single instance of Grafana.

## End user's perspective

The following screenshots show some of the capabilities of the CEEMS when used with
Grafana.
The following screenshots show some of the capabilities of the CEEMS when used with
Grafana.

:::note[Note]

These are _only few_ dashboards build to
demonstrate the capabilities of CEEMS and the operators are free to create more
These are _only few_ dashboards build to
demonstrate the capabilities of CEEMS and the operators are free to create more
dashboards according to their business requirements.

:::
Expand Down Expand Up @@ -60,8 +60,8 @@ dashboards according to their business requirements.
:::important[Important]

This is an interesting metric as we can clearly see there is a considerable reduction
in the emissions even when the overall energy consumption remained the same. This is due
to the fact that we use real time emission factors which can be dynamic and a small
in the emissions even when the overall energy consumption remained the same. This is due
to the fact that we use real time emission factors which can be dynamic and a small
change in factor can have huge implications in emissions for big data centers.

:::
Expand Down
94 changes: 92 additions & 2 deletions website/docs/components/ceems-exporter.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,39 @@ sidebar_position: 1
metrics, RAPL energy, IPMI power consumption, emission factor and GPU to compute unit
mapping.

Currently, the exporter supports only SLURM resource manager.
`ceems_exporter` provides following collectors:
`ceems_exporter` collectors can be categorized as follows:

### Resource manager collectors

These collectors exports metrics from different resource managers.

- Slurm collector: Exports SLURM job metrics like CPU, memory and GPU indices to job ID maps
- Libvirt collector: Exports libvirt managed VMs metrics like CPU, memory, IO, _etc_.

### Energy related collectors

These collectors exports energy related metrics from different
sources on compute node.

- IPMI collector: Exports power usage reported by `ipmi` tools
- RAPL collector: Exports RAPL energy metrics

### Emissions related collectors

This collector exports emissions related metrics that are used
in estimating carbon footprint

- Emissions collector: Exports emission factor (g eCO2/kWh)

### Node metrics collectors

These collectors exports node level metrics

- CPU collector: Exports CPU time in different modes (at node level)
- Meminfo collector: Exports memory related statistics (at node level)

### Perf related collectors

In addition to above stated collectors, there are common "sub-collectors" that
can be reused with different collectors. These sub-collectors provide auxiliary
metrics like IO, networking, performance _etc_. Currently available sub-collectors are:
Expand Down Expand Up @@ -283,6 +306,73 @@ and [eBPF](./ceems-exporter.md#ebpf-sub-collector) sub-collectors. Hence, in
addition to above stated metrics, all the metrics available in the sub-collectors
can also be reported for each cgroup.

### Libvirt collector

Similar to slurm collector, libvirt collector exports metrics of VMs managed
by libvirt. This collector is useful monitor Openstack clusters where
[nova](https://docs.openstack.org/nova/latest/) uses libvirt to manage lifecycle
of the VMs. The exported metrics include usage of CPU, DRAM, block IO retrieved
from cgroups. The collector supports both cgroups v1 and v2.

When GPUs are present on the compute node, like in the case of Slurm, we will
need information on which GPU is used by which VM. This information can be
obtained in libvirt's XML file that keeps the state of the VM. However, there
are few caveats here:

- If a GPU is added to VM using PCI pass through, this GPU will not be available
for the hypervisor and hence, it cannot be queried or monitored. This is due to
the fact that the GPU will be unbound from the hypervisor and bound to guest.
Thus, energy consumption and GPU metrics for GPUs using PCI passthrough
**will only be available in the guest**.

- NVIDIA's vGPU uses mediated devices to expose GPUs in the guest and thus,
GPUs can be queried and monitored from both hypervisor and guest. However,
CEEMS rely on [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) to
export GPU energy consumption and usage metrics and it does not support
usage and energy consumption metrics for vGPUs.

- NVIDIA's MIG instances uses a similar approach to vGPU to expose GPUs inside
guests and hence, similar limitations apply.

Thus, currently it is not possible to reliably monitor the energy and usage
metrics of libvirt instances with GPUs. In any case, the exporter will always
export the GPU UUID to instance UUID to keep track of which instance is using
which GPU. If the above stated limitations are addressed upstream, CEEMS will
allow us to track usage metrics of GPU instances as well.

Currently, the list of metrics exported by Libvirt exporter are as follows:

- Instance current CPU time in user and system mode
- Instance CPUs limit (Number of CPUs allocated to the job)
- Instance current total memory usage
- Instance total memory limit (Memory allocated to the job)
- Instance current RSS memory usage
- Instance current cache memory usage
- Instance current number of memory usage hits limits
- Instance current memory and swap usage
- Instance current memory and swap usage hits limits
- Instance total memory and swap limit
- Instance block IO read and write bytes
- Instance block IO read and write requests
- Instance CPU, memory and IO pressures
- Instance to GPU ordinal mapping (when GPUs found on the compute node)
- Current number of instances on the compute node

Similar to Slurm, libvirt exporter supports
[perf](./ceems-exporter.md#perf-sub-collector)
and [eBPF](./ceems-exporter.md#ebpf-sub-collector) sub-collectors.

:::warning[WARNING]

Libvirt will have no information about the guest running inside the
cgroup and hence, it is not possible to profile individual processes
inside the guest. Therefore, metrics exported by
[perf](./ceems-exporter.md#perf-sub-collector) are for entire VM and
it is not possible to have more fine grained control on which processes
inside the guest can be profiled.

:::

### IPMI collector

The IPMI collector reports the current power usage by the node reported by
Expand Down
Loading

0 comments on commit 190b609

Please sign in to comment.