docs: Update docs with libvirt sections

Signed-off-by: Mahendra Paipuri <[email protected]>
mahendrapaipuri · Oct 10, 2024 · 190b609 · 190b609
1 parent f160bcc
commit 190b609
Show file tree

Hide file tree

Showing 10 changed files with 176 additions and 49 deletions.
diff --git a/etc/nvidia-dcgm-exporter/counters.csv b/etc/nvidia-dcgm-exporter/counters.csv
@@ -78,14 +78,14 @@ DCGM_FI_DRIVER_VERSION,        label, Driver Version
 
 # Profiling metrics. Ref: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#profiling-metrics
 DCGM_FI_PROF_GR_ENGINE_ACTIVE,    gauge, Fraction of time any portion of the graphics or compute engines were active.
-DCGM_FI_PROF_SM_ACTIVE,           gauge, Fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors.
-DCGM_FI_PROF_SM_OCCUPANCY,        gauge, Fraction of resident warps on a multiprocessor, relative to the maximum number of concurrent warps supported on a multiprocessor.
+DCGM_FI_PROF_SM_ACTIVE,           gauge, Fraction of time at least one warp was active on a multiprocessor averaged over all multiprocessors.
+DCGM_FI_PROF_SM_OCCUPANCY,        gauge, Fraction of resident warps on a multiprocessor relative to the maximum number of concurrent warps supported on a multiprocessor.
 DCGM_FI_PROF_PIPE_TENSOR_ACTIVE,  gauge, Fraction of cycles the tensor (HMMA / IMMA) pipe was active.
 DCGM_FI_PROF_PIPE_FP64_ACTIVE,    gauge, Fraction of cycles the FP64 (double precision) pipe was active. 
-DCGM_FI_PROF_PIPE_FP32_ACTIVE,    gauge, Fraction of cycles the FMA (FP32 (single precision), and integer) pipe was active. 
+DCGM_FI_PROF_PIPE_FP32_ACTIVE,    gauge, Fraction of cycles the FMA (FP32 (single precision) and integer) pipe was active. 
 DCGM_FI_PROF_PIPE_FP16_ACTIVE,    gauge, Fraction of cycles the FP16 (half precision) pipe was active. The value represents an average over a time interval and is not an instantaneous value.
 DCGM_FI_PROF_DRAM_ACTIVE,         gauge, Fraction of cycles where data was sent to or received from device memory.
-DCGM_FI_PROF_NVLINK_TX_BYTES,     gauge, Total rate of data transmitted over NVLink, not including protocol headers, in bytes per second.
-DCGM_FI_PROF_NVLINK_RX_BYTES,     gauge, Total rate of data received over NVLink, not including protocol headers, in bytes per second.
-DCGM_FI_PROF_PCIE_TX_BYTES,       gauge, Total rate of data transmitted over PCIE, not including protocol headers, in bytes per second.
-DCGM_FI_PROF_PCIE_RX_BYTES,       gauge, Total rate of data received over PCIE, not including protocol headers, in bytes per second.
+DCGM_FI_PROF_NVLINK_TX_BYTES,     gauge, Total rate of data transmitted over NVLink not including protocol headers in bytes per second.
+DCGM_FI_PROF_NVLINK_RX_BYTES,     gauge, Total rate of data received over NVLink not including protocol headers in bytes per second.
+DCGM_FI_PROF_PCIE_TX_BYTES,       gauge, Total rate of data transmitted over PCIE not including protocol headers in bytes per second.
+DCGM_FI_PROF_PCIE_RX_BYTES,       gauge, Total rate of data received over PCIE not including protocol headers in bytes per second.
diff --git a/etc/slurm/README.md b/etc/slurm/README.md
@@ -12,11 +12,5 @@ This directory provides those scripts that should be used with SLURM.
 An example [systemd service file](https://github.com/mahendrapaipuri/ceems/blob/main/init/systemd/ceems_exporter_no_privs.service)
 is also provided in the repo that can be used along with these prolog and epilog scripts.
 
-> [!IMPORTANT]
-> The CLI argument `--collector.slurm.gpu-job-map-path`
-is hidden and cannot be seen in `ceems_exporter --help` output. However, this argument
-exists in the exporter and can be used.
-
-Even with such prolog and epilog scripts, operators should grant the user running CEEMS
-exporter permissions to run `ipmi-dcmi` command as this command can be executable by only
-`root` by default.
+Even with such prolog and epilog scripts, operators should grant the CEEMS exporter
+process additional privileges for collectors like `ipmi_dcmi`, `ebpf`, _etc_.
diff --git a/website/cspell.json b/website/cspell.json
@@ -52,7 +52,8 @@
         "cpus",
         "memsw",
         "retrans",
-        "Mellanox"
+        "Mellanox",
+        "blkio"
     ],
     // flagWords - list of words to be always considered incorrect
     // This is useful for offensive words and common spelling errors.

diff --git a/website/docs/00-introduction.md b/website/docs/00-introduction.md
@@ -45,7 +45,7 @@ with Grafana and Prometheus to show statistics to users.
 
 :::important[Note]
 
-Currently, only SLURM is supported as a resource manager. In future support for Openstack
-and Kubernetes will be added.
+Currently, only SLURM and Openstack are supported as a resource managers. In future support
+for Kubernetes will be added.
 
 :::
diff --git a/website/docs/02-objectives.md b/website/docs/02-objectives.md
@@ -2,31 +2,31 @@
 
 The objectives of the current stack are two-fold:
 
-- For end users to be able to monitor their compute units in real time. Besides the 
-conventional metrics like CPU usage, memory usage, _etc_, the stack also exposes 
-metrics like energy consumption and equivalent emissions in real time. The stack is 
+- For end users to be able to monitor their compute units in real time. Besides the
+conventional metrics like CPU usage, memory usage, _etc_, the stack also exposes
+metrics like energy consumption and equivalent emissions in real time. The stack is
 also capable of showing the aggregate usage metrics of a given project/tenant/namespace.
 
-- For the operators/admins to be able to monitor the usage of the cluster in terms of 
+- For the operators/admins to be able to monitor the usage of the cluster in terms of
 CPU usage, memory, energy, _etc_. With the current stack the operators will be able to
-identify the top consumers of the resources in the cluster, users/projects that are 
+identify the top consumers of the resources in the cluster, users/projects that are
 under consuming the allocated resources _etc_.
 
-CEEMS has been designed to be modular and extensible, _i.e.,_ CEEMS is meant to support 
-multiple clusters at the same time. For instance, imagine a Data Center (DC) has a SLURM 
-cluster and a Openstack cluster. A single deployment of CEEMS should be able to 
-consolidate the metrics data of SLURM jobs and Openstack VMs and expose it to 
+CEEMS has been designed to be modular and extensible, _i.e.,_ CEEMS is meant to support
+multiple clusters at the same time. For instance, imagine a Data Center (DC) has a SLURM
+cluster and a Openstack cluster. A single deployment of CEEMS should be able to
+consolidate the metrics data of SLURM jobs and Openstack VMs and expose it to
 end users using a single instance of Grafana.
 
 ## End user's perspective
 
-The following screenshots show some of the capabilities of the CEEMS when used with 
-Grafana. 
+The following screenshots show some of the capabilities of the CEEMS when used with
+Grafana.
 
 :::note[Note]
 
-These are _only few_ dashboards build to 
-demonstrate the capabilities of CEEMS and the operators are free to create more 
+These are _only few_ dashboards build to
+demonstrate the capabilities of CEEMS and the operators are free to create more
 dashboards according to their business requirements.
 
 :::
@@ -60,8 +60,8 @@ dashboards according to their business requirements.
 :::important[Important]
 
 This is an interesting metric as we can clearly see there is a considerable reduction
-in the emissions even when the overall energy consumption remained the same. This is due 
-to the fact that we use real time emission factors which can be dynamic and a small 
+in the emissions even when the overall energy consumption remained the same. This is due
+to the fact that we use real time emission factors which can be dynamic and a small
 change in factor can have huge implications in emissions for big data centers.
 
 :::

diff --git a/website/docs/components/ceems-exporter.md b/website/docs/components/ceems-exporter.md
@@ -10,16 +10,39 @@ sidebar_position: 1
 metrics, RAPL energy, IPMI power consumption, emission factor and GPU to compute unit
 mapping.
 
-Currently, the exporter supports only SLURM resource manager.
-`ceems_exporter` provides following collectors:
+`ceems_exporter` collectors can be categorized as follows:
+
+### Resource manager collectors
+
+These collectors exports metrics from different resource managers.
 
 - Slurm collector: Exports SLURM job metrics like CPU, memory and GPU indices to job ID maps
+- Libvirt collector: Exports libvirt managed VMs metrics like CPU, memory, IO, _etc_.
+
+### Energy related collectors
+
+These collectors exports energy related metrics from different
+sources on compute node.
+
 - IPMI collector: Exports power usage reported by `ipmi` tools
 - RAPL collector: Exports RAPL energy metrics
+
+### Emissions related collectors
+
+This collector exports emissions related metrics that are used
+in estimating carbon footprint
+
 - Emissions collector: Exports emission factor (g eCO2/kWh)
+
+### Node metrics collectors
+
+These collectors exports node level metrics
+
 - CPU collector: Exports CPU time in different modes (at node level)
 - Meminfo collector: Exports memory related statistics (at node level)
 
+### Perf related collectors
+
 In addition to above stated collectors, there are common "sub-collectors" that
 can be reused with different collectors. These sub-collectors provide auxiliary
 metrics like IO, networking, performance _etc_. Currently available sub-collectors are:
@@ -283,6 +306,73 @@ and [eBPF](./ceems-exporter.md#ebpf-sub-collector) sub-collectors. Hence, in
 addition to above stated metrics, all the metrics available in the sub-collectors
 can also be reported for each cgroup.
 
+### Libvirt collector
+
+Similar to slurm collector, libvirt collector exports metrics of VMs managed
+by libvirt. This collector is useful monitor Openstack clusters where
+[nova](https://docs.openstack.org/nova/latest/) uses libvirt to manage lifecycle
+of the VMs. The exported metrics include usage of CPU, DRAM, block IO retrieved
+from cgroups. The collector supports both cgroups v1 and v2.
+
+When GPUs are present on the compute node, like in the case of Slurm, we will
+need information on which GPU is used by which VM. This information can be
+obtained in libvirt's XML file that keeps the state of the VM. However, there
+are few caveats here:
+
+- If a GPU is added to VM using PCI pass through, this GPU will not be available
+for the hypervisor and hence, it cannot be queried or monitored. This is due to
+the fact that the GPU will be unbound from the hypervisor and bound to guest.
+Thus, energy consumption and GPU metrics for GPUs using PCI passthrough
+**will only be available in the guest**.
+
+- NVIDIA's vGPU uses mediated devices to expose GPUs in the guest and thus,
+GPUs can be queried and monitored from both hypervisor and guest. However,
+CEEMS rely on [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) to
+export GPU energy consumption and usage metrics and it does not support
+usage and energy consumption metrics for vGPUs.
+
+- NVIDIA's MIG instances uses a similar approach to vGPU to expose GPUs inside
+guests and hence, similar limitations apply.
+
+Thus, currently it is not possible to reliably monitor the energy and usage
+metrics of libvirt instances with GPUs. In any case, the exporter will always
+export the GPU UUID to instance UUID to keep track of which instance is using
+which GPU. If the above stated limitations are addressed upstream, CEEMS will
+allow us to track usage metrics of GPU instances as well.
+
+Currently, the list of metrics exported by Libvirt exporter are as follows:
+
+- Instance current CPU time in user and system mode
+- Instance CPUs limit (Number of CPUs allocated to the job)
+- Instance current total memory usage
+- Instance total memory limit (Memory allocated to the job)
+- Instance current RSS memory usage
+- Instance current cache memory usage
+- Instance current number of memory usage hits limits
+- Instance current memory and swap usage
+- Instance current memory and swap usage hits limits
+- Instance total memory and swap limit
+- Instance block IO read and write bytes
+- Instance block IO read and write requests
+- Instance CPU, memory and IO pressures
+- Instance to GPU ordinal mapping (when GPUs found on the compute node)
+- Current number of instances on the compute node
+
+Similar to Slurm, libvirt exporter supports
+[perf](./ceems-exporter.md#perf-sub-collector)
+and [eBPF](./ceems-exporter.md#ebpf-sub-collector) sub-collectors.
+
+:::warning[WARNING]
+
+Libvirt will have no information about the guest running inside the
+cgroup and hence, it is not possible to profile individual processes
+inside the guest. Therefore, metrics exported by
+[perf](./ceems-exporter.md#perf-sub-collector) are for entire VM and
+it is not possible to have more fine grained control on which processes
+inside the guest can be profiled.
+
+:::
+
 ### IPMI collector
 
 The IPMI collector reports the current power usage by the node reported by