Fix Python Cuda loading issues #7939

RyanUnderhill · 2021-06-04T01:13:14Z

Description: Previously, if built with Cuda enabled, the Python bindings would require the cuda provider to be loaded. But as it's now a shared library this isn't fatal.

This removes a bunch of now unnecessary code from _pybind_state.py that our internal shared provider DLL loading code handles already. This also enables running the GPU package on a CPU only machine. If CUDA_PATH is set, it will generate an error, as in this case the user is likely expecting CUDA to be used.

I changed GetProviderInfo_CUDA to return a reference (which now makes it clearer it can't fail, instead it'll throw an exception like it already did), and added TryGetProviderInfo_CUDA which will return nullptr if Cuda is not available. Python now uses TryGetProviderInfo_CUDA.

pranavsharma

Is it possible to add an automated test? Like installing the onnxruntime-gpu pkg on a CPU-only machine and inferencing a model with it?

pranavsharma · 2021-06-04T01:57:18Z

onnxruntime/core/framework/provider_bridge_ort.cc

    return info->GetCurrentGpuDeviceId(device_id);
-  return CreateStatus(ORT_FAIL, "CUDA execution provider is not enabled.");
+  return CreateStatus(ORT_FAIL, "CUDA execution provider is not available.");


Suggested change

return CreateStatus(ORT_FAIL, "CUDA execution provider is not available.");

return CreateStatus(ORT_FAIL, "CUDA execution provider is either not enabled or not available.");

Changed, though I could probably make it conditional on #if USE_CUDA to say if we just weren't built with it at all (and other providers)

pranavsharma · 2021-06-04T02:01:37Z

onnxruntime/core/framework/provider_bridge_ort.cc

+  return nullptr;
+}
+
+ProviderInfo_CUDA& GetProviderInfo_CUDA() {


Is there a better error msg we can give to users? Even if the user enables the cuda provider, this might fail and the error msg is not clear why it failed. The user cannot take any action other than filing a github issue which increases support costs.

second this. we need to somehow propagate the actual error.

I worked on improving the error messages, it used to show this on failure:

"2021-06-09 18:24:21.8575539 [E:onnxruntime:Default, provider_bridge_ort.cc:1000 onnxruntime::ProviderLibrary::Get] Failed to load library, error code: 126"

With these changes, now it will show this:

2021-06-09 18:54:17.7950730 [E:onnxruntime:Default, provider_bridge_ort.cc:1000 onnxruntime::ProviderLibrary::Get] LoadLibrary failed with error 126 "The specified module could not be found." when trying to load "C:\Code\Github\onnxruntime\build\Windows\Debug\Debug\onnxruntime\capi\onnxruntime_providers_cuda.dll"

so now on Windows it will translate the error message into text, vs just showing a number.

jywu-msft · 2021-06-04T02:38:27Z

onnxruntime/python/onnxruntime_pybind_state.cc

+      }
+      else
+      {
+        LOGS_DEFAULT(INFO) << "CUDA execution provider is not available.";


it's too dangerous to log as INFO. it should be WARNING.
a common scenario is the user may think they are running with CUDA enabled but underneath it's failing silently and everything is running on CPU.

When the gpu pkg is installed on a CPU-only machine, logging it at WARNING might wrongly indicate something is wrong.

I didn't want it to be a warning since a CPU only user wouldn't expect a warning because cuda isn't on their system, right? The goal here is to have one version for both CPU and Cuda, I wouldn't expect a warning.

But I do think it's a problem that Python users can't explicitly say they expect the cuda provider to be used.

When the gpu pkg is installed on a CPU-only machine, logging it at WARNING might wrongly indicate something is wrong.

But the common case is that the user has a GPU, that is why they are installing the GPU package in the first place.
so it's better to warn that it's not working than silently execute on CPU.
nobody uses INFO level by default, so it will be a silent failure which isn't what we want.

put another way, we shouldn't be optimizing for the case that GPU package is installed on CPU machine.
as long as we continue to publish both CPU and GPU packages, then we have to assume that common case for someone installing GPU package is because they have a GPU on their system.
once we have a single CPU/GPU package, then I agree that WARNING is inappropriate.

I didn't want it to be a warning since a CPU only user wouldn't expect a warning because cuda isn't on their system, right? The goal here is to have one version for both CPU and Cuda, I wouldn't expect a warning.

But I do think it's a problem that Python users can't explicitly say they expect the cuda provider to be used.

yes, goal is to have a single package, but we don't have a single package currently right?

Makes sense, I'll be sure it reports the same information the C API provides, and keep it as a warning. Then we can change it later if we ever want to remove the CPU only package.

I removed this block since it's redundant. The C API will already log an error, this just logs an extra one that's redundant.

Are you referring to this one?

onnxruntime/onnxruntime/core/framework/provider_bridge_ort.cc

Line 998 in 1a5ee11

LOGS_DEFAULT(ERROR) << error.ErrorMessage();

snnn · 2021-06-04T06:24:58Z

I think we need to finish #7854 first. Otherwise it is still not tested.

…o ryanunderhill/python_optional_cuda

…n_optional_cuda

…o ryanunderhill/python_optional_cuda

snnn · 2021-06-17T21:12:44Z

To test it on a CPU-only machine: I suggest you opening https://github.com/microsoft/onnxruntime/blob/master/tools/ci_build/github/azure-pipelines/linux-gpu-ci-pipeline.yml and duplicating the "Linux_Test" job part. Please give the new job a different name, and change the machine pool from "centos7gpu" to "centos7cpu". The centos7cpu machine pool has cuda but doesn't have GPU cards.

Once it is done, I will give you a new machine pool without cuda and GPUs.

…om/microsoft/onnxruntime into ryanunderhill/python_optional_cuda

pranavsharma · 2021-06-19T01:35:48Z

onnxruntime/python/_pybind_state.py

-                              f"Set the CUDNN_HOME environment variable to the path of the 'cuda' directory "
-                              f"in your CUDNN installation if necessary.")
+            print(f"cuDNN {version_info.cudnn_version} not installed in {cudnn_bin_dir}. "
+                  f"Set the CUDNN_HOME environment variable to the path of the 'cuda' directory "


What is the user expected to do with this print statement if cudnn is obtained from the PATH?

I don't think this will alone fix #7965. For python vers >=3.8, PATH won't be consulted to find the deps.

The user is doing something non recommended, I don't think we can help them in that case.

snnn · 2021-06-21T22:14:18Z

tools/ci_build/github/azure-pipelines/linux-gpu-ci-pipeline.yml

+              --enable_onnx_tests --use_cuda --cuda_version=11.1 --cuda_home=/usr/local/cuda-11.1 --cudnn_home=/usr/local/cuda-11.1 \
+              --enable_pybind --build_java --ctest_path ''
+
+              - job: Linux_Test


Indentation matters.

You may try to manually queue a build for this pipeline, I'm afraid this yaml change doesn't work.

pranavsharma · 2021-06-22T21:58:51Z

onnxruntime/python/onnxruntime_pybind_state.cc

+      }
+      else
+      {
+        if(!Env::Default().GetEnvironmentVar("CUDA_PATH").empty()) {


can you please add a note here saying why we're checking for this? something like - we want to allow the flexibility to use a gpu pkg on a CPU-only machine unless the user intended to run inferencing on the GPU.

would be nice to add something else to the message for the user to check?
Like make sure they have all the correct version prerequisites and they are in the user's PATH?

Good idea, error message updated

pranavsharma · 2021-06-22T21:59:49Z

LGTM 👍 I'll let @snnn verify the pipeline changes.

snnn · 2021-06-23T00:58:58Z

Linux GPU CI Pipeline doesn't work yet.

ivanst0 · 2021-06-23T06:54:34Z

@jywu-msft, @pranavsharma, @RyanUnderhill, @snnn
I reviewed the entire discussion under this PR and I still don't understand why this change is needed (especially the changes on Python side). Is there a bug being fixed? Is there a new scenario we want to enable?

Specifically, why would we allow successful import of onnxruntime-gpu package if CUDA/cuDNN can't be found?

One of the most common pain points with using onnxruntime-gpu package was a cryptic error message when any of the required DLLs is not found (bug #6435). PR #6436 added (among other things) more informative error messages, but this seems to be removed here.

snnn · 2021-06-23T17:32:21Z

@ivanst0 we want to make onnxruntime-gpu package work on cpu-only devices. So that most users only need one package: "onnxruntime-gpu". And the "onnxruntime" package is for who has a concern on package size.

ivanst0 · 2021-06-23T21:16:39Z

I see, thanks for the context. Is the intention to automatically fall back to CPU version when loading any of dependent DLLs fail? How will users debug such failures?

My personal take is that in any serious application the DNN model is fixed size and either a. small enough so it can be evaluated on CPU without problems, or b. large enough to require GPU/CUDA (i.e. too slow to be usable on CPU). In this context I wouldn’t find the automatic fallback to be useful, especially if it makes it harder to troubleshoot the common errors with loading CUDA/cuDNN. This is based just on the use cases I observed. I assume this change was driven by user feedback/requests (although I wasn’t able to find the corresponding GitHub issue).

pranavsharma · 2021-06-23T21:19:42Z

I see, thanks for the context. Is the intention to automatically fall back to CPU version when loading any of dependent DLLs fail? How will users debug such failures?

My personal take is that in any serious application the DNN model is fixed size and either a. small enough so it can be evaluated on CPU without problems, or b. large enough to require GPU/CUDA (i.e. too slow to be usable on CPU). In this context I wouldn’t find the automatic fallback to be useful, especially if it makes it harder to troubleshoot the common errors with loading CUDA/cuDNN. This is based just on the use cases I observed. I assume this change was driven by user feedback/requests (although I wasn’t able to find the corresponding GitHub issue).

We should not fallback to CPU automatically. That's why we're checking for the presence of the CUDA_PATH env variable and throwing an exception only if populated and the dep libs not found. A user will install CUDA (and hence have that env var populated) only on a GPU machine where she intends to run the model.

ivanst0 · 2021-06-24T21:39:06Z

Is import onnxruntime supposed to load CUDA DLLs, or this is done only later (on demand)? Which call actually triggers the loading of CUDA/cuDNN DLLs?

RyanUnderhill · 2021-06-25T06:18:36Z

Is import onnxruntime supposed to load CUDA DLLs, or this is done only later (on demand)? Which call actually triggers the loading of CUDA/cuDNN DLLs?

Calling 'onnxruntime.InferenceSession(...)' is what causes the CUDA dlls to be loaded.

Edit: As an example, here's what happens if you run it and CUDA can't be loaded:

RuntimeError: D:\a_work\1\s\onnxruntime\python\onnxruntime_pybind_state.cc:497 onnxruntime::python::RegisterExecutionProviders CUDA_PATH is set but CUDA wasn't able to be loaded

ivanst0 · 2021-06-25T08:25:52Z

Thanks for explanation, Ryan!

I still find the new version be a regression compared to previous state.

Previous code was successfully handling situations when there are multiple versions of CUDA installed on the machine, even if CUDA_PATH is pointing to the "wrong" version of CUDA.
The previous error message was more informative and user friendly:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Python\lib\site-packages\onnxruntime\__init__.py", line 24, in <module>
    from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed, \
  File "D:\Python\lib\site-packages\onnxruntime\capi\_pybind_state.py", line 43, in <module>
    raise ImportError(f"cuDNN {version_info.cudnn_version} not installed in {cudnn_bin_dir}.")
ImportError: cuDNN 8 not installed in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin.

vs.

2021-06-25 08:50:21.8704770 [E:onnxruntime:Default, provider_bridge_ort.cc:997 onnxruntime::ProviderLibrary::Get] LoadLibrary failed with error 126 "The specified module could not be found." when trying to load "D:\Python\lib\site-packages\onnxruntime\capi\onnxruntime_providers_cuda.dll"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Python\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 283, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\Python\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 321, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: D:\git\onnxruntime\onnxruntime\python\onnxruntime_pybind_state.cc:497 onnxruntime::python::RegisterExecutionProviders CUDA_PATH is set but CUDA wasn't able to be loaded. Please install the correct version of CUDA and cuDNN as mentioned in the GPU requirements page, make sure they're in the PATH, and that your GPU is supported.

(1) could be potentially addressed by prepending PATH with the correct CUDA path before loading CUDA EP.
(2) could be potentially addressed by examining the current state before trying to load CUDA EP and presenting the user with a more specific error message if the load is going to fail.

RyanUnderhill · 2021-06-25T09:26:39Z

@ivanst0 Since the shared library version calls LoadLibraryEx directly on our don't use CUDA_PATH to determine where to load the libraries from, if the correct version is anywhere on the PATH we will load it. That code was just to work around a Python restriction that our internal shared library loading bypasses (which is why we wanted to remove it).

I agree the error messages are not perfect, but at least they will now no longer cause failures of their own (as the previous code would cause the abort, even if the load might have been successful had we tried it, this is bad).

We want to improve the error messages for all languages as time goes on, based on what errors people typically run into. So this is not over yet...

snnn · 2021-06-25T16:07:03Z

@RyanUnderhill will you add some tests for it?

* fix boost download url (#7843) * Topo sort the model before saving (#7913) * checkin toposort * review comments * revert and add TODO * Add shape inference to custom symbolic functions (#7937) **Description**: As title. **Motivation and Context** - PyTorch ONNX exporter heavily depends on ONNX shape inference to export accurate and efficient model. Custom symbolic function exports the op as contrib ops, thus exporter is unable to perform standard onnx shape inference. Models with dynamic shape inputs are affected. * Fix missing files on linux (#8066) * [Mobile package] Update required operator config with additional ops for wav2vec2. (#8079) Add some additional ops to the mobile package that are needed for the wav2vec2 model. * Add module attribute to ORTModule to support HuggingFace Trainer save_model (#8088) * Fix input schema extrator for ORTModule (#8098) * Fix 32bit Android java API crash (#8122) * Fix 32bit Android java API crash * fix code formating * [Mobile package] Update required operator config with additional ops for newer version of Wav2Vec 2. (#8123) This is an update to #8079 The sample application motivating the original update changed to use an updated version of the model. Now, fewer ops are required. This change removes the previously added ops which are no longer needed. * Add int64 as a required type to ConstantOfShape as it's used by the pytorch converter for Pad. (#8128) It's also used pointlessly for torch.tensor.repeat (although that usage should always be able to be constant folded). * Update logic in props.xml to account for shared provider library changes (#8138) * Ortmodule override torch.manual_seed() (#8131) * Ortmodule override torch.manual_seed() * Fix Python Cuda loading issues (#7939) * Fix mac shared_provider warning (#8153) Co-authored-by: Guoyu Wang <[email protected]> Co-authored-by: Ye Wang <[email protected]> Co-authored-by: Bowen Bao <[email protected]> Co-authored-by: Ryan Hill <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: baijumeswani <[email protected]> Co-authored-by: Thiago Crepaldi <[email protected]> Co-authored-by: Scott McKay <[email protected]> Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: Sherlock <[email protected]>

Python will run even if Cuda provider can't be loaded

473b641

RyanUnderhill requested a review from a team as a code owner June 4, 2021 01:13

RyanUnderhill added 2 commits June 3, 2021 18:15

Merge with master

de5ce6c

Master fix (was with my own previous changes, oops)

c2d171b

pranavsharma reviewed Jun 4, 2021

View reviewed changes

jywu-msft reviewed Jun 4, 2021

View reviewed changes

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

486ce43

…o ryanunderhill/python_optional_cuda

snnn mentioned this pull request Jun 9, 2021

to create a training cpu package for torch-ort documentation #7845

Merged

Improve error messages for shared providers

42383d1

faxu added the release:1.8.1 label Jun 14, 2021

Merge remote-tracking branch 'origin/master' into ryanunderhill/pytho…

3f46fda

…n_optional_cuda

faxu added the triage:approved label Jun 16, 2021

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

5465766

…o ryanunderhill/python_optional_cuda

RyanUnderhill added 3 commits June 18, 2021 00:47

Merge branch 'ryanunderhill/python_optional_cuda' of https://github.c…

cbd3d68

…om/microsoft/onnxruntime into ryanunderhill/python_optional_cuda

Fixup warning & error messages from python script

0026ddd

Test CPU pool with CUDA

645bed3

pranavsharma reviewed Jun 19, 2021

View reviewed changes

snnn reviewed Jun 21, 2021

View reviewed changes

RyanUnderhill added 3 commits June 21, 2021 16:28

Accidental indentation

f8499a1

Accidental undo -> gpu->cpu

87cc758

Simplify pybind_state as we will find the cuda libraries

27968ef

pranavsharma reviewed Jun 22, 2021

View reviewed changes

pranavsharma previously approved these changes Jun 22, 2021

View reviewed changes

Python error

1a3a128

RyanUnderhill dismissed pranavsharma’s stale review via 1a3a128 June 22, 2021 22:03

Remove extra pipeline test for now

95d7c0f

Update error message

d4b7b44

pranavsharma approved these changes Jun 25, 2021

View reviewed changes

RyanUnderhill changed the title ~~Allow Python to run even if Cuda provider can't be loaded~~ Fix Python Cuda loading issues Jun 25, 2021

RyanUnderhill merged commit 49938cc into master Jun 25, 2021

RyanUnderhill deleted the ryanunderhill/python_optional_cuda branch June 25, 2021 09:26

harshithapv pushed a commit that referenced this pull request Jun 25, 2021

Fix Python Cuda loading issues (#7939)

bf46846

harshithapv mentioned this pull request Jun 25, 2021

Cherry picks for release - 1.8.1 Round 2 #8137

Merged

harshithapv removed the release:1.8.1 label Jun 28, 2021

	return CreateStatus(ORT_FAIL, "CUDA execution provider is not available.");
	return CreateStatus(ORT_FAIL, "CUDA execution provider is either not enabled or not available.");

Fix Python Cuda loading issues #7939

Fix Python Cuda loading issues #7939

Conversation

RyanUnderhill commented Jun 4, 2021 • edited Loading

pranavsharma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jywu-msft Jun 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snnn commented Jun 4, 2021

snnn commented Jun 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pranavsharma commented Jun 22, 2021

snnn commented Jun 23, 2021

ivanst0 commented Jun 23, 2021

snnn commented Jun 23, 2021

ivanst0 commented Jun 23, 2021

pranavsharma commented Jun 23, 2021 • edited Loading

ivanst0 commented Jun 24, 2021

RyanUnderhill commented Jun 25, 2021 • edited Loading

ivanst0 commented Jun 25, 2021

RyanUnderhill commented Jun 25, 2021

snnn commented Jun 25, 2021

RyanUnderhill commented Jun 4, 2021 •

edited

Loading

jywu-msft Jun 4, 2021 •

edited

Loading

pranavsharma commented Jun 23, 2021 •

edited

Loading

RyanUnderhill commented Jun 25, 2021 •

edited

Loading