@@ -34,6 +34,21 @@ class DeviceStatsMonitor(Callback):
34
34
r"""Automatically monitors and logs device stats during training, validation and testing stage.
35
35
``DeviceStatsMonitor`` is a special callback as it requires a ``logger`` to passed as argument to the ``Trainer``.
36
36
37
+ Logged Metrics:
38
+ Device statistics are logged with keys prefixed as
39
+ ``DeviceStatsMonitor.{hook_name}/{base_metric_name}`` (e.g.,
40
+ ``DeviceStatsMonitor.on_train_batch_start/cpu_percent``). The source of these metrics depends on the ``cpu_stats`` flag and the active accelerator.
41
+
42
+ CPU (via ``psutil``): Logs ``cpu_percent``, ``cpu_vm_percent``, ``cpu_swap_percent``.
43
+ All are percentages (%).
44
+ CUDA GPU (via :func:`torch.cuda.memory_stats`): Logs detailed memory statistics from
45
+ PyTorch's allocator (e.g., ``allocated_bytes.all.current``, ``num_ooms``; all in Bytes).
46
+ GPU compute utilization is not logged by default.
47
+ Other Accelerators (e.g., TPU, MPS): Logs device-specific stats.
48
+ - TPU example: ``avg. free memory (MB)``.
49
+ - MPS example: ``mps.current_allocated_bytes``.
50
+ Observe logs or check accelerator documentation for details.
51
+
37
52
Args:
38
53
cpu_stats: if ``None``, it will log CPU stats only if the accelerator is CPU.
39
54
If ``True``, it will log CPU stats regardless of the accelerator.
@@ -45,6 +60,7 @@ class DeviceStatsMonitor(Callback):
45
60
ModuleNotFoundError:
46
61
If ``psutil`` is not installed and CPU stats are monitored.
47
62
63
+
48
64
Example::
49
65
50
66
from lightning import Trainer
0 commit comments