DOC: Clarify DeviceStatsMonitor logged metrics #20895

MrAnayDongre · 2025-06-11T23:08:26Z

What does this PR do?

This PR addresses issue #20807 by adding detailed documentation for the metrics logged by DeviceStatsMonitor.

The key clarifications include:

The source of metrics (CPU via psutil, CUDA GPU via torch.cuda.memory_stats, and other accelerators via accelerator.get_device_stats()).
The naming convention for logged keys: DeviceStatsMonitor.{hook_name}/{base_metric_name}.
Explicitly states that GPU compute utilization is not logged by default for CUDA devices, with a pointer to torch.cuda.memory_stats() for the full list of memory metrics.
Provides examples for common CPU, CUDA, and other accelerator (TPU, MPS) metrics.
Includes a minor update to profiler_basic.rst to align with these clarifications and link to the API docs.

This documentation aims to help users understand what statistics to expect when using DeviceStatsMonitor with different hardware configurations.

Fixes #20807

No breaking changes are introduced by this documentation update.

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20895.org.readthedocs.build/en/20895/

Borda · 2025-06-12T08:04:44Z

src/lightning/pytorch/callbacks/device_stats_monitor.py

@@ -45,6 +45,23 @@ class DeviceStatsMonitor(Callback):
        ModuleNotFoundError:
            If ``psutil`` is not installed and CPU stats are monitored.

+    Logged Metrics:


Raises: or Args: are Sphinx-specific keywords compared to Logged Metrics:, so pls let's move it just to the top of this docstring

Thanks, @Borda! I've moved the 'Logged Metrics' section to the top of the docstring as requested in commit dcd1042.

well, I still see it without change

Hi @Borda,
I've pushed an update (the latest commit on the PR branch is now 2f1c083).
The other CI checks, including the main documentation builds, are now running or queued. Could you please take another look when you have a moment?
Thanks!

for more information, see https://pre-commit.ci

Borda · 2025-06-18T11:30:52Z

src/lightning/pytorch/callbacks/device_stats_monitor.py

+        CPU (via ``psutil``): Logs ``cpu_percent``, ``cpu_vm_percent``, ``cpu_swap_percent``.
+        All are percentages (%).
+        CUDA GPU (via :func:`torch.cuda.memory_stats`): Logs detailed memory statistics from
+        PyTorch's allocator (e.g., ``allocated_bytes.all.current``, ``num_ooms``; all in Bytes).
+        GPU compute utilization is not logged by default.
+        Other Accelerators (e.g., TPU, MPS): Logs device-specific stats:
+
+        - TPU example: ``avg. free memory (MB)``.
+        - MPS example: ``mps.current_allocated_bytes``.
+
+        Observe logs or check accelerator documentation for details.


'Let''s make this a complete list and you can validate the compiled docs in readthedocs link 📚: pytorch-lightning--20895.org.readthedocs.build/en/20895

MrAnayDongre requested review from lantiga, Borda, tchaton, justusschock and ethanwharris as code owners June 11, 2025 23:08

github-actions bot added the pl Generic label for PyTorch Lightning package label Jun 11, 2025

Borda changed the title ~~DOC: Clarify DeviceStatsMonitor logged metrics (#20807)~~ DOC: Clarify DeviceStatsMonitor logged metrics Jun 12, 2025

Borda reviewed Jun 12, 2025

View reviewed changes

MrAnayDongre force-pushed the docs/fix-20807-device-stats-metrics branch 4 times, most recently from cf7e36a to 798f9c9 Compare June 17, 2025 03:24

DOC: Clarify DeviceStatsMonitor logged metrics (Lightning-AI#20807)

2461f52

MrAnayDongre force-pushed the docs/fix-20807-device-stats-metrics branch from 798f9c9 to 2461f52 Compare June 17, 2025 03:42

pre-commit-ci bot and others added 2 commits June 17, 2025 03:43

[pre-commit.ci] auto fixes from pre-commit.com hooks

2f1c083

for more information, see https://pre-commit.ci

update

1afe3a5

Borda reviewed Jun 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DOC: Clarify DeviceStatsMonitor logged metrics #20895

DOC: Clarify DeviceStatsMonitor logged metrics #20895

Uh oh!

MrAnayDongre commented Jun 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

Borda Jun 12, 2025

Uh oh!

MrAnayDongre Jun 12, 2025

Uh oh!

Borda Jun 13, 2025

Uh oh!

MrAnayDongre Jun 17, 2025

Uh oh!

Borda Jun 18, 2025

Uh oh!

Uh oh!

DOC: Clarify DeviceStatsMonitor logged metrics #20895

Are you sure you want to change the base?

DOC: Clarify DeviceStatsMonitor logged metrics #20895

Uh oh!

Conversation

MrAnayDongre commented Jun 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

Borda Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

MrAnayDongre Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Borda Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

MrAnayDongre Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Borda Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MrAnayDongre commented Jun 11, 2025 •

edited by github-actions bot

Loading