Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Windows Server Unexpectedly Shuts Down When Using Nvitop to Monitor GPU Usage #136

Open
3 tasks done
NI-MingCheng opened this issue Oct 23, 2024 · 3 comments
Open
3 tasks done
Assignees
Labels
bug Something isn't working

Comments

@NI-MingCheng
Copy link

NI-MingCheng commented Oct 23, 2024

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.3.2

Operating system and version

Windows Server 2022 Datacenter

NVIDIA driver version

516.01

NVIDIA-SMI

Wed Oct 23 15:33:05 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 516.01       Driver Version: 516.01       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000   WDDM  | 00000000:02:00.0 Off |                  Off |
| 30%   38C    P8    17W / 230W |    464MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000   WDDM  | 00000000:21:00.0 Off |                  Off |
| 30%   34C    P8     6W / 230W |      0MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000   WDDM  | 00000000:49:00.0 Off |                  Off |
| 30%   30C    P8     5W / 230W |      0MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000   WDDM  | 00000000:4A:00.0 Off |                  Off |
| 30%   33C    P8     4W / 230W |      0MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9436    C+G   ...lPanel\SystemSettings.exe    N/A      |
|    0   N/A  N/A     12172    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A     15244    C+G   C:\Windows\explorer.exe         N/A      |
|    0   N/A  N/A     17912    C+G   ...2txyewy\TextInputHost.exe    N/A      |
|    0   N/A  N/A     18180    C+G   ...y\ShellExperienceHost.exe    N/A      |
+-----------------------------------------------------------------------------+

Python environment

python -m pip freeze
(base) C:\Users\Administrator>python -m pip freeze
absl-py==2.1.0
accelerate==0.24.1
aiofiles==23.2.0
aiohttp==3.8.5
aiosignal==1.3.1
altair==5.1.2
anaconda-anon-usage @ file:///C:/b/abs_95v3x0wy8p/croot/anaconda-anon-usage_1697038984188/work
anaconda-client==1.12.0
anaconda-cloud-auth @ file:///C:/b/abs_410afndtyf/croot/anaconda-cloud-auth_1697462767853/work
anaconda-navigator @ file:///C:/b/abs_cfvv8k_j21/croot/anaconda-navigator_1704813334508/work
anaconda-project @ file:///C:/ci_311/anaconda-project_1676458365912/work
annotated-types==0.6.0
ansicon==1.89.0
antlr4-python3-runtime==4.9.3
anyio==3.7.1
archspec @ file:///croot/archspec_1709217642129/work
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==24.2.0
Babel==2.14.0
backports.functools-lru-cache @ file:///tmp/build/80754af9/backports.functools_lru_cache_1618170165463/work
backports.tempfile @ file:///home/linux1/recipes/ci/backports.tempfile_1610991236607/work
backports.weakref==1.0.post1
beautifulsoup4 @ file:///C:/b/abs_0agyz1wsr4/croot/beautifulsoup4-split_1681493048687/work
bleach==6.1.0
blessed==1.20.0
blinker==1.7.0
boltons @ file:///C:/ci_311/boltons_1677729932371/work
Brotli @ file:///C:/ci_311/brotli-split_1676435766766/work
cachetools==5.3.2
certifi @ file:///C:/b/abs_1fw_exq1si/croot/certifi_1725551736618/work/certifi
cffi @ file:///C:/b/abs_924gv1kxzj/croot/cffi_1700254355075/work
chardet @ file:///C:/ci_311/chardet_1676436134885/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click @ file:///C:/b/abs_f9ihnt72pu/croot/click_1698129847492/work
clip==0.2.0
clyent==1.2.1
colorama @ file:///C:/ci_311/colorama_1676422310965/work
coloredlogs==15.0.1
comm==0.2.1
conda @ file:///C:/b/abs_85jnuwc__u/croot/conda_1729193917673/work
conda-build @ file:///C:/b/abs_3ed9gavxgz/croot/conda-build_1708025907525/work
conda-content-trust @ file:///tmp/build/80754af9/conda-content-trust_1617045594566/work
conda-libmamba-solver @ file:///croot/conda-libmamba-solver_1727775630457/work/src
conda-pack @ file:///tmp/build/80754af9/conda-pack_1611163042455/work
conda-package-handling @ file:///C:/b/abs_b9wp3lr1gn/croot/conda-package-handling_1691008700066/work
conda-repo-cli==1.0.75
conda-token @ file:///Users/paulyim/miniconda3/envs/c3i/conda-bld/conda-token_1662660369760/work
conda-verify @ file:///D:/bld/conda-verify_1667049856137/work
conda_index @ file:///croot/conda-index_1706633791028/work
conda_package_streaming @ file:///C:/b/abs_6c28n38aaj/croot/conda-package-streaming_1690988019210/work
contourpy==1.2.0
cpm-kernels==1.0.11
cryptography @ file:///C:/b/abs_531eqmhgsd/croot/cryptography_1707523768330/work
cycler==0.12.1
ddddocr==1.5.5
debugpy==1.8.1
decorator==5.1.1
defusedxml @ file:///tmp/build/80754af9/defusedxml_1615228127516/work
distro @ file:///C:/b/abs_a3uni_yez3/croot/distro_1701455052240/work
easydict==1.12
einops==0.7.0
executing==2.0.1
fastapi==0.104.1
fastjsonschema @ file:///C:/ci_311/python-fastjsonschema_1679500568724/work
ffmpy==0.3.1
filelock @ file:///C:/b/abs_f2gie28u58/croot/filelock_1700591233643/work
flatbuffers==24.3.25
fonttools==4.49.0
fqdn==1.5.1
frozendict @ file:///C:/b/abs_2alamqss6p/croot/frozendict_1713194885124/work
frozenlist==1.4.1
fsspec==2023.10.0
ftfy==6.1.3
future @ file:///C:/ci_311_rebuilds/future_1678998246262/work
gitdb==4.0.11
GitPython==3.1.40
gmpy2 @ file:///C:/ci_311/gmpy2_1677743390134/work
gpustat==1.1.1
gradio==3.39.0
gradio_client==0.7.0
grpcio==1.60.1
h11==0.14.0
httpcore==1.0.1
httpx==0.25.1
huggingface-hub==0.19.0
humanfriendly==10.0
idna @ file:///C:/ci_311/idna_1676424932545/work
importlib-metadata==6.8.0
ipykernel==6.29.2
ipython==8.21.0
ipywidgets==8.1.2
isoduration==20.11.0
jaraco.classes @ file:///tmp/build/80754af9/jaraco.classes_1620983179379/work
jedi==0.19.1
Jinja2 @ file:///C:/b/abs_f7x5a8op2h/croot/jinja2_1706733672594/work
jinxed==1.2.1
json5==0.9.14
jsonpatch @ file:///tmp/build/80754af9/jsonpatch_1615747632069/work
jsonpointer==2.1
jsonschema @ file:///C:/b/abs_d1c4sm8drk/croot/jsonschema_1699041668863/work
jsonschema-specifications @ file:///C:/b/abs_0brvm6vryw/croot/jsonschema-specifications_1699032417323/work
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.9.0
jupyter-lsp==2.2.2
jupyter_client==8.6.0
jupyter_core @ file:///C:/b/abs_c769pbqg9b/croot/jupyter_core_1698937367513/work
jupyter_server==2.12.5
jupyter_server_terminals==0.5.2
jupyterlab==4.1.1
jupyterlab-language-pack-zh-CN==4.0.post3
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.3
jupyterlab_widgets==3.0.10
keyring @ file:///C:/b/abs_dbjc7g0dh2/croot/keyring_1678999228878/work
kiwisolver==1.4.5
latex2mathml==3.76.0
libarchive-c @ file:///tmp/build/80754af9/python-libarchive-c_1617780486945/work
libmambapy @ file:///C:/b/abs_2euls_1a38/croot/mamba-split_1704219444888/work/libmambapy
linkify-it-py==2.0.3
Markdown==3.7
markdown-it-py==2.2.0
MarkupSafe @ file:///C:/b/abs_ecfdqh67b_/croot/markupsafe_1704206030535/work
matplotlib==3.8.3
matplotlib-inline==0.1.6
mdit-py-plugins==0.3.3
mdtex2html==1.2.0
mdurl==0.1.2
menuinst @ file:///C:/b/abs_099kybla52/croot/menuinst_1706732987063/work
mistune==3.0.2
mkl-fft @ file:///C:/b/abs_19i1y8ykas/croot/mkl_fft_1695058226480/work
mkl-random @ file:///C:/b/abs_edwkj1_o69/croot/mkl_random_1695059866750/work
mkl-service==2.4.0
more-itertools @ file:///C:/b/abs_36p38zj5jx/croot/more-itertools_1700662194485/work
mpmath @ file:///C:/b/abs_7833jrbiox/croot/mpmath_1690848321154/work
multidict==6.1.0
navigator-updater @ file:///C:/b/abs_895otdwmo9/croot/navigator-updater_1695210220239/work
nbclient==0.9.0
nbconvert==7.16.1
nbformat @ file:///C:/b/abs_5a2nea1iu2/croot/nbformat_1694616866197/work
nest-asyncio==1.6.0
networkx @ file:///C:/b/abs_e6gi1go5op/croot/networkx_1690562046966/work
notebook==7.1.0
notebook_shim==0.2.4
numpy @ file:///C:/b/abs_16b2j7ad8n/croot/numpy_and_numpy_base_1704311752418/work/dist/numpy-1.26.3-cp311-cp311-win_amd64.whl#sha256=5f2c4b54fd5d52b9fb18e32607c79b03cf14665cecce8a5a10e2950559df4651
nvidia-ml-py==12.535.161
nvitop==1.3.2
omegaconf==2.3.0
onnxruntime==1.19.2
opencv-python-headless==4.10.0.84
orjson==3.9.10
outcome==1.3.0.post0
overrides==7.7.0
packaging @ file:///C:/b/abs_28t5mcoltc/croot/packaging_1693575224052/work
pandas==2.0.3
pandocfilters==1.5.1
parso==0.8.3
pathlib @ file:///Users/ktietz/demo/mc3/conda-bld/pathlib_1629713961906/work
pillow @ file:///C:/b/abs_e22m71t0cb/croot/pillow_1707233126420/work
pkce @ file:///C:/b/abs_d0z4444tb0/croot/pkce_1690384879799/work
pkginfo @ file:///C:/b/abs_d18srtr68x/croot/pkginfo_1679431192239/work
platformdirs @ file:///C:/b/abs_b6z_yqw_ii/croot/platformdirs_1692205479426/work
pluggy @ file:///C:/ci_311/pluggy_1676422178143/work
ply==3.11
prettytable==3.9.0
prometheus_client==0.20.0
prompt-toolkit==3.0.43
propcache==0.2.0
protobuf==4.25.0
psutil @ file:///C:/ci_311_rebuilds/psutil_1679005906571/work
pure-eval==0.2.2
pyarrow==12.0.1
pycosat @ file:///C:/b/abs_31zywn1be3/croot/pycosat_1696537126223/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic @ file:///C:/b/abs_9byjrk31gl/croot/pydantic_1695798904828/work
pydantic_core==2.10.1
pydeck==0.8.1b0
pydub==0.25.1
Pygments==2.17.2
PyJWT @ file:///C:/ci_311/pyjwt_1676438890509/work
pyparsing==3.1.1
PyQt5==5.15.10
PyQt5-sip @ file:///C:/b/abs_c0pi2mimq3/croot/pyqt-split_1698769125270/work/pyqt_sip
pyreadline3==3.5.4
PySocks @ file:///C:/ci_311/pysocks_1676425991111/work
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
python-dotenv @ file:///C:/ci_311/python-dotenv_1676455170580/work
python-json-logger==2.0.7
python-multipart==0.0.6
pytz @ file:///C:/b/abs_19q3ljkez4/croot/pytz_1695131651401/work
pywin32==305.1
pywin32-ctypes @ file:///C:/ci_311/pywin32-ctypes_1676427747089/work
pywinpty==2.0.12
PyYAML @ file:///C:/b/abs_782o3mbw7z/croot/pyyaml_1698096085010/work
pyzmq==25.1.2
qtconsole==5.5.1
QtPy @ file:///C:/b/abs_derqu__3p8/croot/qtpy_1700144907661/work
referencing @ file:///C:/b/abs_09f4hj6adf/croot/referencing_1699012097448/work
regex==2023.12.25
requests @ file:///C:/b/abs_474vaa3x9e/croot/requests_1707355619957/work
requests-mock==1.12.1
requests-toolbelt @ file:///C:/b/abs_2fsmts66wp/croot/requests-toolbelt_1690874051210/work
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.6.0
rpds-py @ file:///C:/b/abs_76j4g4la23/croot/rpds-py_1698947348047/work
ruamel-yaml-conda @ file:///C:/ci_311/ruamel_yaml_1676455799258/work
ruamel.yaml @ file:///C:/ci_311/ruamel.yaml_1676439214109/work
safetensors==0.3.3
selenium==4.25.0
semantic-version==2.10.0
semver @ file:///tmp/build/80754af9/semver_1603822362442/work
Send2Trash==1.8.2
sentencepiece==0.1.99
sip @ file:///C:/b/abs_edevan3fce/croot/sip_1698675983372/work
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.1
sniffio==1.3.0
sortedcontainers==2.4.0
soupsieve @ file:///C:/b/abs_bbsvy9t4pl/croot/soupsieve_1696347611357/work
sse-starlette==1.6.5
stack-data==0.6.3
starlette==0.27.0
streamlit==1.28.1
sympy @ file:///C:/b/abs_82njkonm7f/croot/sympy_1701397685028/work
tenacity==8.2.3
tensorboard==2.16.2
tensorboard-data-server==0.7.2
termcolor==2.4.0
terminado==0.18.0
timm==0.9.12
tinycss2==1.2.1
tokenizers==0.13.3
toml==0.10.2
toolz==1.0.0
torch==2.2.0
torchaudio==2.2.0
torchvision==0.17.0
tornado @ file:///C:/b/abs_0cbrstidzg/croot/tornado_1696937003724/work
tqdm @ file:///C:/b/abs_f76j9hg7pv/croot/tqdm_1679561871187/work
traitlets @ file:///C:/ci_311/traitlets_1676423290727/work
transformers==4.30.2
trio==0.26.2
trio-websocket==0.11.1
truststore @ file:///C:/b/abs_55z7b3r045/croot/truststore_1695245455435/work
types-python-dateutil==2.8.19.20240106
typing_extensions==4.12.2
tzdata==2023.3
tzlocal==5.2
uc-micro-py==1.0.3
ujson @ file:///C:/ci_311/ujson_1676434714224/work
uri-template==1.3.0
urllib3 @ file:///C:/b/abs_4etpfrkumr/croot/urllib3_1707770616184/work
uvicorn==0.24.0.post1
validators==0.22.0
watchdog==3.0.0
wcwidth==0.2.13
webcolors==1.13
webdriver-manager==4.0.2
webencodings==0.5.1
websocket-client==1.8.0
websockets==11.0.3
Werkzeug==3.0.4
widgetsnbextension==4.0.10
win-inet-pton @ file:///C:/ci_311/win_inet_pton_1676425458225/work
windows-curses==2.3.3
wsproto==1.2.0
yarl==1.14.0
zipp @ file:///C:/b/abs_b0beoc27oa/croot/zipp_1704206963359/work
zstandard==0.19.0

Problem description

When monitoring GPU usage with nvitop on Windows Server systems, the system experiences unexpected shutdowns. This issue appears to be caused by compatibility conflicts between nvitop and Windows Server's hardware monitoring system.

日志名称:          System
来源:            Microsoft-Windows-Kernel-Power
日期:            2024/10/23 15:18:48
事件 ID:         41
任务类别:          (63)
级别:            关键
关键字:           (70368744177664),(2)
用户:            SYSTEM
计算机:           WIN-3I9RKHAQAH5
描述:
系统已在未先正常关机的情况下重新启动。如果系统停止响应、发生崩溃或意外断电,则可能会导致此错误。
事件 Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Kernel-Power" Guid="{331c3b3a-2005-44c2-ac5e-77220c37d6b4}" />
    <EventID>41</EventID>
    <Version>8</Version>
    <Level>1</Level>
    <Task>63</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000400000000002</Keywords>
    <TimeCreated SystemTime="2024-10-23T07:18:48.0040790Z" />
    <EventRecordID>210129</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="8" />
    <Channel>System</Channel>
    <Computer>WIN-3I9RKHAQAH5</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="BugcheckCode">80</Data>
    <Data Name="BugcheckParameter1">0xffffdc8d2fdfa000</Data>
    <Data Name="BugcheckParameter2">0x2</Data>
    <Data Name="BugcheckParameter3">0xfffff8029b2505e6</Data>
    <Data Name="BugcheckParameter4">0x0</Data>
    <Data Name="SleepInProgress">0</Data>
    <Data Name="PowerButtonTimestamp">0</Data>
    <Data Name="BootAppStatus">0</Data>
    <Data Name="Checkpoint">0</Data>
    <Data Name="ConnectedStandbyInProgress">true</Data>
    <Data Name="SystemSleepTransitionsToOn">0</Data>
    <Data Name="CsEntryScenarioInstanceId">136</Data>
    <Data Name="BugcheckInfoFromEFI">false</Data>
    <Data Name="CheckpointStatus">0</Data>
    <Data Name="CsEntryScenarioInstanceIdV2">136</Data>
    <Data Name="LongPowerButtonPressDetected">false</Data>
  </EventData>
</Event>

Steps to Reproduce

Deep learning training using GPU first
Then use Nvitop to view GPU usage
Unexpected system shutdown

Traceback

None

Logs

None

Expected behavior

None

Additional context

None

@NI-MingCheng NI-MingCheng added the bug Something isn't working label Oct 23, 2024
@XuehaiPan
Copy link
Owner

XuehaiPan commented Oct 23, 2024

@NI-MingCheng thanks for the report, I wonder if the R515 driver can work with CUDA 11.7 on Windows. I found the latest production driver for WinServer 2022 for RTX A5000 is the R550 driver NVIDIA RTX Server Driver Release 550 R550 U10 (553.24) | Windows Server 2022.

It would be helpful if you could run the following Python code in a REPL (e.g. ipython or just type python in the terminal) manually:

>>> from nvitop import CudaDevice

>>> cuda0 = CudaDevice(0)
>>> print(cuda0.as_snapshot())

>>> cuda1 = CudaDevice(1)
>>> print(cuda1.as_snapshot())

>>> cuda2 = CudaDevice(2)
>>> print(cuda2.as_snapshot())

>>> cuda3 = CudaDevice(3)
>>> print(cuda3.as_snapshot())

@NI-MingCheng
Copy link
Author

NI-MingCheng commented Oct 23, 2024 via email

@NI-MingCheng
Copy link
Author

NI-MingCheng commented Oct 23, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants