Skip pyNVML to support Tegra devices #402

pentschev · 2020-09-25T21:05:33Z

Fixes #400

pentschev · 2020-09-25T21:05:57Z

codecov-commenter · 2020-09-25T21:25:42Z

Codecov Report

Merging #402 into branch-0.16 will decrease coverage by 2.65%.
The diff coverage is 69.23%.

@@               Coverage Diff               @@
##           branch-0.16     #402      +/-   ##
===============================================
- Coverage        59.65%   56.99%   -2.66%     
===============================================
  Files               17       19       +2     
  Lines             1331     1451     +120     
===============================================
+ Hits               794      827      +33     
- Misses             537      624      +87

Impacted Files	Coverage Δ
dask_cuda/utils.py	`85.31% <69.23%> (-2.00%)`	⬇️
dask_cuda/explicit_comms/dataframe_merge.py	`90.38% <0.00%> (-0.45%)`	⬇️
dask_cuda/explicit_comms/__init__.py	`100.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cupy_map_overlap.py	`0.00% <0.00%> (ø)`
dask_cuda/explicit_comms/dataframe_shuffle.py	`95.52% <0.00%> (ø)`
dask_cuda/device_host_file.py	`98.64% <0.00%> (+0.03%)`	⬆️
dask_cuda/cli/dask_cuda_worker.py	`96.77% <0.00%> (+0.05%)`	⬆️
dask_cuda/initialize.py	`92.59% <0.00%> (+0.28%)`	⬆️
dask_cuda/cuda_worker.py	`72.28% <0.00%> (+0.86%)`	⬆️
dask_cuda/local_cuda_cluster.py	`82.95% <0.00%> (+0.93%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71fad3a...0c9f29a. Read the comment docs.

jakirkham · 2020-09-25T21:42:29Z

Would this make sense to do generally or have we found issues with this approach in some cases? Asking as we now know of 2 cases where NVML is not an option Tegra and WSL.

pentschev · 2020-09-28T09:34:48Z

Would this make sense to do generally or have we found issues with this approach in some cases? Asking as we now know of 2 cases where NVML is not an option Tegra and WSL.

Using Numba to do that implies creating a context, and that can be tricky in certain situations where we need the context creation to be delayed. We should evaluate context creation on a case-by-case basis as this has proven problematic in several instances in the past.

TBH, I don't know if things are 100% correct for Tegra, but it's very difficult to evaluate much further without being able to test. Long story short: I hope this PR will provide some support to Tegra but we can't make any guarantees without extensive testing, same applies for WSL or any other platforms we may want to support in the future.

jakirkham · 2020-10-07T17:50:11Z

Using Numba to do that implies creating a context, and that can be tricky in certain situations where we need the context creation to be delayed. We should evaluate context creation on a case-by-case basis as this has proven problematic in several instances in the past.

Won't it already be tricky to initialize the context? AIUI we wanted this for multi-GPU devices.

TBH, I don't know if things are 100% correct for Tegra, but it's very difficult to evaluate much further without being able to test. Long story short: I hope this PR will provide some support to Tegra but we can't make any guarantees without extensive testing, same applies for WSL or any other platforms we may want to support in the future.

Fair enough I think we have to rely on folks being pretty engaged to keep things working. This is also why I'm wondering if we can standardize on a common code path that we are able to test in other cases at least.

dask_cuda/utils.py

pentschev · 2020-10-07T19:06:26Z

On a discussion with @JasonAtNvidia and @jakirkham , we decided to move this to 0.17. One of the limitations we found is getting GPU information, such as memory, without NVML. That requires Numba, and since we replace its memory manager with RMM's and RMM doesn't implement such functionality, we get errors:

============================= test session starts ==============================
platform linux -- Python 3.7.8, pytest-6.1.1, py-1.9.0, pluggy-0.13.1 -- /mnt/data/miniforge3/envs/dasktest/bin/python
cachedir: .pytest_cache
rootdir: /home/nvidia/Documents/dask-cuda
collecting ... collected 998 items / 1 error / 1 skipped / 996 selected
==================================== ERRORS ====================================
________________ ERROR collecting dask_cuda/tests/test_spill.py ________________
dask_cuda/tests/test_spill.py:18: in <module>
    if utils.get_device_total_memory() < 1e10:
/mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/dask_cuda-0.16.0a0+95.g04dcbb6.dirty-py3.7.egg/dask_cuda/utils.py:175: in get_device_total_memory
    return numba.cuda.current_context().get_memory_info()[1]
/mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py:1040: in get_memory_info
    return self.memory_manager.get_memory_info()
/mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/rmm/rmm.py:214: in get_memory_info
    raise NotImplementedError()
E   NotImplementedError
=============================== warnings summary ===============================
../../../../mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/pandas/util/__init__.py:12
  /mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/pandas/util/__init__.py:12: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
    import pandas.util.testing
dask_cuda/tests/test_local_cuda_cluster.py:76
  /home/nvidia/Documents/dask-cuda/dask_cuda/tests/test_local_cuda_cluster.py:76: PytestUnknownMarkWarning: Unknown pytest.mark.asyncio - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/mark.html
    @pytest.mark.asyncio
dask_cuda/tests/test_local_cuda_cluster.py:89
  /home/nvidia/Documents/dask-cuda/dask_cuda/tests/test_local_cuda_cluster.py:89: PytestUnknownMarkWarning: Unknown pytest.mark.asyncio - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/mark.html
    @pytest.mark.asyncio
-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
ERROR dask_cuda/tests/test_spill.py - NotImplementedError
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=================== 1 skipped, 3 warnings, 1 error in 4.28s ====================

pentschev · 2020-10-07T20:44:55Z

Won't it already be tricky to initialize the context? AIUI we wanted this for multi-GPU devices.

Even single-GPU devices can benefit from these changes for Tegra clusters.

Fair enough I think we have to rely on folks being pretty engaged to keep things working. This is also why I'm wondering if we can standardize on a common code path that we are able to test in other cases at least.

It seems we'll soon begin testing more with Tegra, so that will help us here. To be frank, I don't want to move away from NVML only because of Tegra, NVML is much better in various ways (e.g., no need to create a CUDA context) and supports things like gathering CPU affinity for each GPU that other tools don't, so I'd rather have two code paths.

pentschev · 2020-11-30T21:37:35Z

It seems we'll have some Tegra devices in CI soon, we can then push this PR forward.

github-actions · 2021-02-16T19:08:41Z

This PR has been marked stale due to no recent activity in the past 30d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be marked rotten if there is no activity in the next 60d.

quasiben · 2021-02-17T13:51:33Z

I believe we are still waiting on Tegra devices

pentschev · 2021-02-23T22:35:30Z

@mike-wendt @raydouglass based on the discussion from Ops Demo Tuesday, this is a PR we would like to test with Tegra devices when we have it available in CI.

mike-wendt · 2021-03-19T15:29:08Z

@mike-wendt @raydouglass based on the discussion from Ops Demo Tuesday, this is a PR we would like to test with Tegra devices when we have it available in CI.

@pentschev @quasiben as an update @Ethyling and I are working on getting the L4T images integrated on Monday. Once we get that done we'll have a platform to easily test these builds for CUDA 10.2 as that's the only current CUDA version supported on L4T/AGX.

pentschev · 2021-03-19T16:58:30Z

Sounds great, thanks @mike-wendt and let us know if/how we can assist.

Skip pyNVML to support Tegra devices

0c9f29a

pentschev requested a review from a team as a code owner September 25, 2020 21:05

pentschev mentioned this pull request Sep 25, 2020

pyNVML won't work on a Jetson, is there a workaround #400

Open

jakirkham reviewed Oct 7, 2020

View reviewed changes

dask_cuda/utils.py Show resolved Hide resolved

Support for Tegra in get_cpu_affinity

287406a

pentschev changed the base branch from branch-0.16 to branch-0.17 October 7, 2020 19:06

Use CUDA driver via Numba to get memory info on Tegra

d2d5e30

pentschev added 2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change 0 - Blocked Cannot progress due to external reasons and removed 2 - In Progress Currently a work in progress labels Jan 8, 2021

pentschev mentioned this pull request Feb 12, 2021

Add GHA to mark issues/prs as stale/rotten #525

Closed

github-actions bot added inactive-30d and removed inactive-30d labels Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip pyNVML to support Tegra devices #402

Skip pyNVML to support Tegra devices #402

pentschev commented Sep 25, 2020

pentschev commented Sep 25, 2020

codecov-commenter commented Sep 25, 2020 •

edited

Loading

jakirkham commented Sep 25, 2020

pentschev commented Sep 28, 2020

jakirkham commented Oct 7, 2020

pentschev commented Oct 7, 2020

pentschev commented Oct 7, 2020

pentschev commented Nov 30, 2020

github-actions bot commented Feb 16, 2021

quasiben commented Feb 17, 2021

pentschev commented Feb 23, 2021

mike-wendt commented Mar 19, 2021

pentschev commented Mar 19, 2021

Skip pyNVML to support Tegra devices #402

Are you sure you want to change the base?

Skip pyNVML to support Tegra devices #402

Conversation

pentschev commented Sep 25, 2020

pentschev commented Sep 25, 2020

codecov-commenter commented Sep 25, 2020 • edited Loading

Codecov Report

jakirkham commented Sep 25, 2020

pentschev commented Sep 28, 2020

jakirkham commented Oct 7, 2020

pentschev commented Oct 7, 2020

pentschev commented Oct 7, 2020

pentschev commented Nov 30, 2020

github-actions bot commented Feb 16, 2021

quasiben commented Feb 17, 2021

pentschev commented Feb 23, 2021

mike-wendt commented Mar 19, 2021

pentschev commented Mar 19, 2021

codecov-commenter commented Sep 25, 2020 •

edited

Loading