-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip pyNVML to support Tegra devices #402
base: branch-0.17
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.16 #402 +/- ##
===============================================
- Coverage 59.65% 56.99% -2.66%
===============================================
Files 17 19 +2
Lines 1331 1451 +120
===============================================
+ Hits 794 827 +33
- Misses 537 624 +87
Continue to review full report at Codecov.
|
Would this make sense to do generally or have we found issues with this approach in some cases? Asking as we now know of 2 cases where NVML is not an option Tegra and WSL. |
Using Numba to do that implies creating a context, and that can be tricky in certain situations where we need the context creation to be delayed. We should evaluate context creation on a case-by-case basis as this has proven problematic in several instances in the past. TBH, I don't know if things are 100% correct for Tegra, but it's very difficult to evaluate much further without being able to test. Long story short: I hope this PR will provide some support to Tegra but we can't make any guarantees without extensive testing, same applies for WSL or any other platforms we may want to support in the future. |
Won't it already be tricky to initialize the context? AIUI we wanted this for multi-GPU devices.
Fair enough I think we have to rely on folks being pretty engaged to keep things working. This is also why I'm wondering if we can standardize on a common code path that we are able to test in other cases at least. |
On a discussion with @JasonAtNvidia and @jakirkham , we decided to move this to 0.17. One of the limitations we found is getting GPU information, such as memory, without NVML. That requires Numba, and since we replace its memory manager with RMM's and RMM doesn't implement such functionality, we get errors: ============================= test session starts ==============================
platform linux -- Python 3.7.8, pytest-6.1.1, py-1.9.0, pluggy-0.13.1 -- /mnt/data/miniforge3/envs/dasktest/bin/python
cachedir: .pytest_cache
rootdir: /home/nvidia/Documents/dask-cuda
collecting ... collected 998 items / 1 error / 1 skipped / 996 selected
==================================== ERRORS ====================================
________________ ERROR collecting dask_cuda/tests/test_spill.py ________________
dask_cuda/tests/test_spill.py:18: in <module>
if utils.get_device_total_memory() < 1e10:
/mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/dask_cuda-0.16.0a0+95.g04dcbb6.dirty-py3.7.egg/dask_cuda/utils.py:175: in get_device_total_memory
return numba.cuda.current_context().get_memory_info()[1]
/mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py:1040: in get_memory_info
return self.memory_manager.get_memory_info()
/mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/rmm/rmm.py:214: in get_memory_info
raise NotImplementedError()
E NotImplementedError
=============================== warnings summary ===============================
../../../../mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/pandas/util/__init__.py:12
/mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/pandas/util/__init__.py:12: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing
dask_cuda/tests/test_local_cuda_cluster.py:76
/home/nvidia/Documents/dask-cuda/dask_cuda/tests/test_local_cuda_cluster.py:76: PytestUnknownMarkWarning: Unknown pytest.mark.asyncio - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/mark.html
@pytest.mark.asyncio
dask_cuda/tests/test_local_cuda_cluster.py:89
/home/nvidia/Documents/dask-cuda/dask_cuda/tests/test_local_cuda_cluster.py:89: PytestUnknownMarkWarning: Unknown pytest.mark.asyncio - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/mark.html
@pytest.mark.asyncio
-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
ERROR dask_cuda/tests/test_spill.py - NotImplementedError
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=================== 1 skipped, 3 warnings, 1 error in 4.28s ==================== |
Even single-GPU devices can benefit from these changes for Tegra clusters.
It seems we'll soon begin testing more with Tegra, so that will help us here. To be frank, I don't want to move away from NVML only because of Tegra, NVML is much better in various ways (e.g., no need to create a CUDA context) and supports things like gathering CPU affinity for each GPU that other tools don't, so I'd rather have two code paths. |
It seems we'll have some Tegra devices in CI soon, we can then push this PR forward. |
This PR has been marked stale due to no recent activity in the past 30d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be marked rotten if there is no activity in the next 60d. |
I believe we are still waiting on Tegra devices |
@mike-wendt @raydouglass based on the discussion from Ops Demo Tuesday, this is a PR we would like to test with Tegra devices when we have it available in CI. |
@pentschev @quasiben as an update @Ethyling and I are working on getting the L4T images integrated on Monday. Once we get that done we'll have a platform to easily test these builds for CUDA 10.2 as that's the only current CUDA version supported on L4T/AGX. |
Sounds great, thanks @mike-wendt and let us know if/how we can assist. |
Fixes #400