Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Python Cuda loading issues #7939

Merged
merged 16 commits into from
Jun 25, 2021
Merged

Conversation

RyanUnderhill
Copy link
Member

@RyanUnderhill RyanUnderhill commented Jun 4, 2021

Description: Previously, if built with Cuda enabled, the Python bindings would require the cuda provider to be loaded. But as it's now a shared library this isn't fatal.

This removes a bunch of now unnecessary code from _pybind_state.py that our internal shared provider DLL loading code handles already. This also enables running the GPU package on a CPU only machine. If CUDA_PATH is set, it will generate an error, as in this case the user is likely expecting CUDA to be used.

I changed GetProviderInfo_CUDA to return a reference (which now makes it clearer it can't fail, instead it'll throw an exception like it already did), and added TryGetProviderInfo_CUDA which will return nullptr if Cuda is not available. Python now uses TryGetProviderInfo_CUDA.

@RyanUnderhill RyanUnderhill requested a review from a team as a code owner June 4, 2021 01:13
Copy link
Contributor

@pranavsharma pranavsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add an automated test? Like installing the onnxruntime-gpu pkg on a CPU-only machine and inferencing a model with it?

return info->GetCurrentGpuDeviceId(device_id);
return CreateStatus(ORT_FAIL, "CUDA execution provider is not enabled.");
return CreateStatus(ORT_FAIL, "CUDA execution provider is not available.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return CreateStatus(ORT_FAIL, "CUDA execution provider is not available.");
return CreateStatus(ORT_FAIL, "CUDA execution provider is either not enabled or not available.");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, though I could probably make it conditional on #if USE_CUDA to say if we just weren't built with it at all (and other providers)

return nullptr;
}

ProviderInfo_CUDA& GetProviderInfo_CUDA() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better error msg we can give to users? Even if the user enables the cuda provider, this might fail and the error msg is not clear why it failed. The user cannot take any action other than filing a github issue which increases support costs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second this. we need to somehow propagate the actual error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worked on improving the error messages, it used to show this on failure:

"2021-06-09 18:24:21.8575539 [E:onnxruntime:Default, provider_bridge_ort.cc:1000 onnxruntime::ProviderLibrary::Get] Failed to load library, error code: 126"

With these changes, now it will show this:

2021-06-09 18:54:17.7950730 [E:onnxruntime:Default, provider_bridge_ort.cc:1000 onnxruntime::ProviderLibrary::Get] LoadLibrary failed with error 126 "The specified module could not be found." when trying to load "C:\Code\Github\onnxruntime\build\Windows\Debug\Debug\onnxruntime\capi\onnxruntime_providers_cuda.dll"

so now on Windows it will translate the error message into text, vs just showing a number.

}
else
{
LOGS_DEFAULT(INFO) << "CUDA execution provider is not available.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's too dangerous to log as INFO. it should be WARNING.
a common scenario is the user may think they are running with CUDA enabled but underneath it's failing silently and everything is running on CPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the gpu pkg is installed on a CPU-only machine, logging it at WARNING might wrongly indicate something is wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want it to be a warning since a CPU only user wouldn't expect a warning because cuda isn't on their system, right? The goal here is to have one version for both CPU and Cuda, I wouldn't expect a warning.

But I do think it's a problem that Python users can't explicitly say they expect the cuda provider to be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the gpu pkg is installed on a CPU-only machine, logging it at WARNING might wrongly indicate something is wrong.

But the common case is that the user has a GPU, that is why they are installing the GPU package in the first place.
so it's better to warn that it's not working than silently execute on CPU.
nobody uses INFO level by default, so it will be a silent failure which isn't what we want.

Copy link
Member

@jywu-msft jywu-msft Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put another way, we shouldn't be optimizing for the case that GPU package is installed on CPU machine.
as long as we continue to publish both CPU and GPU packages, then we have to assume that common case for someone installing GPU package is because they have a GPU on their system.
once we have a single CPU/GPU package, then I agree that WARNING is inappropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want it to be a warning since a CPU only user wouldn't expect a warning because cuda isn't on their system, right? The goal here is to have one version for both CPU and Cuda, I wouldn't expect a warning.

But I do think it's a problem that Python users can't explicitly say they expect the cuda provider to be used.

yes, goal is to have a single package, but we don't have a single package currently right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll be sure it reports the same information the C API provides, and keep it as a warning. Then we can change it later if we ever want to remove the CPU only package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this block since it's redundant. The C API will already log an error, this just logs an extra one that's redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to this one?

LOGS_DEFAULT(ERROR) << error.ErrorMessage();

@snnn
Copy link
Member

snnn commented Jun 4, 2021

I think we need to finish #7854 first. Otherwise it is still not tested.

@snnn
Copy link
Member

snnn commented Jun 17, 2021

To test it on a CPU-only machine: I suggest you opening https://github.com/microsoft/onnxruntime/blob/master/tools/ci_build/github/azure-pipelines/linux-gpu-ci-pipeline.yml and duplicating the "Linux_Test" job part. Please give the new job a different name, and change the machine pool from "centos7gpu" to "centos7cpu". The centos7cpu machine pool has cuda but doesn't have GPU cards.

Once it is done, I will give you a new machine pool without cuda and GPUs.

f"Set the CUDNN_HOME environment variable to the path of the 'cuda' directory "
f"in your CUDNN installation if necessary.")
print(f"cuDNN {version_info.cudnn_version} not installed in {cudnn_bin_dir}. "
f"Set the CUDNN_HOME environment variable to the path of the 'cuda' directory "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the user expected to do with this print statement if cudnn is obtained from the PATH?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will alone fix #7965. For python vers >=3.8, PATH won't be consulted to find the deps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user is doing something non recommended, I don't think we can help them in that case.

--enable_onnx_tests --use_cuda --cuda_version=11.1 --cuda_home=/usr/local/cuda-11.1 --cudnn_home=/usr/local/cuda-11.1 \
--enable_pybind --build_java --ctest_path ''

- job: Linux_Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation matters.

You may try to manually queue a build for this pipeline, I'm afraid this yaml change doesn't work.

}
else
{
if(!Env::Default().GetEnvironmentVar("CUDA_PATH").empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add a note here saying why we're checking for this? something like - we want to allow the flexibility to use a gpu pkg on a CPU-only machine unless the user intended to run inferencing on the GPU.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to add something else to the message for the user to check?
Like make sure they have all the correct version prerequisites and they are in the user's PATH?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, error message updated

@pranavsharma
Copy link
Contributor

LGTM 👍 I'll let @snnn verify the pipeline changes.

pranavsharma
pranavsharma previously approved these changes Jun 22, 2021
@snnn
Copy link
Member

snnn commented Jun 23, 2021

Linux GPU CI Pipeline doesn't work yet.

@ivanst0
Copy link
Member

ivanst0 commented Jun 23, 2021

@jywu-msft, @pranavsharma, @RyanUnderhill, @snnn
I reviewed the entire discussion under this PR and I still don't understand why this change is needed (especially the changes on Python side). Is there a bug being fixed? Is there a new scenario we want to enable?

Specifically, why would we allow successful import of onnxruntime-gpu package if CUDA/cuDNN can't be found?

One of the most common pain points with using onnxruntime-gpu package was a cryptic error message when any of the required DLLs is not found (bug #6435). PR #6436 added (among other things) more informative error messages, but this seems to be removed here.

@snnn
Copy link
Member

snnn commented Jun 23, 2021

@ivanst0 we want to make onnxruntime-gpu package work on cpu-only devices. So that most users only need one package: "onnxruntime-gpu". And the "onnxruntime" package is for who has a concern on package size.

@ivanst0
Copy link
Member

ivanst0 commented Jun 23, 2021

I see, thanks for the context. Is the intention to automatically fall back to CPU version when loading any of dependent DLLs fail? How will users debug such failures?

My personal take is that in any serious application the DNN model is fixed size and either a. small enough so it can be evaluated on CPU without problems, or b. large enough to require GPU/CUDA (i.e. too slow to be usable on CPU). In this context I wouldn’t find the automatic fallback to be useful, especially if it makes it harder to troubleshoot the common errors with loading CUDA/cuDNN. This is based just on the use cases I observed. I assume this change was driven by user feedback/requests (although I wasn’t able to find the corresponding GitHub issue).

@pranavsharma
Copy link
Contributor

pranavsharma commented Jun 23, 2021

I see, thanks for the context. Is the intention to automatically fall back to CPU version when loading any of dependent DLLs fail? How will users debug such failures?

My personal take is that in any serious application the DNN model is fixed size and either a. small enough so it can be evaluated on CPU without problems, or b. large enough to require GPU/CUDA (i.e. too slow to be usable on CPU). In this context I wouldn’t find the automatic fallback to be useful, especially if it makes it harder to troubleshoot the common errors with loading CUDA/cuDNN. This is based just on the use cases I observed. I assume this change was driven by user feedback/requests (although I wasn’t able to find the corresponding GitHub issue).

We should not fallback to CPU automatically. That's why we're checking for the presence of the CUDA_PATH env variable and throwing an exception only if populated and the dep libs not found. A user will install CUDA (and hence have that env var populated) only on a GPU machine where she intends to run the model.

@ivanst0
Copy link
Member

ivanst0 commented Jun 24, 2021

Is import onnxruntime supposed to load CUDA DLLs, or this is done only later (on demand)? Which call actually triggers the loading of CUDA/cuDNN DLLs?

@RyanUnderhill
Copy link
Member Author

RyanUnderhill commented Jun 25, 2021

Is import onnxruntime supposed to load CUDA DLLs, or this is done only later (on demand)? Which call actually triggers the loading of CUDA/cuDNN DLLs?

Calling 'onnxruntime.InferenceSession(...)' is what causes the CUDA dlls to be loaded.

Edit: As an example, here's what happens if you run it and CUDA can't be loaded:

RuntimeError: D:\a_work\1\s\onnxruntime\python\onnxruntime_pybind_state.cc:497 onnxruntime::python::RegisterExecutionProviders CUDA_PATH is set but CUDA wasn't able to be loaded

@RyanUnderhill RyanUnderhill changed the title Allow Python to run even if Cuda provider can't be loaded Fix Python Cuda loading issues Jun 25, 2021
@ivanst0
Copy link
Member

ivanst0 commented Jun 25, 2021

Thanks for explanation, Ryan!

I still find the new version be a regression compared to previous state.

  1. Previous code was successfully handling situations when there are multiple versions of CUDA installed on the machine, even if CUDA_PATH is pointing to the "wrong" version of CUDA.

  2. The previous error message was more informative and user friendly:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Python\lib\site-packages\onnxruntime\__init__.py", line 24, in <module>
    from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed, \
  File "D:\Python\lib\site-packages\onnxruntime\capi\_pybind_state.py", line 43, in <module>
    raise ImportError(f"cuDNN {version_info.cudnn_version} not installed in {cudnn_bin_dir}.")
ImportError: cuDNN 8 not installed in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin.

vs.

2021-06-25 08:50:21.8704770 [E:onnxruntime:Default, provider_bridge_ort.cc:997 onnxruntime::ProviderLibrary::Get] LoadLibrary failed with error 126 "The specified module could not be found." when trying to load "D:\Python\lib\site-packages\onnxruntime\capi\onnxruntime_providers_cuda.dll"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Python\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 283, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\Python\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 321, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: D:\git\onnxruntime\onnxruntime\python\onnxruntime_pybind_state.cc:497 onnxruntime::python::RegisterExecutionProviders CUDA_PATH is set but CUDA wasn't able to be loaded. Please install the correct version of CUDA and cuDNN as mentioned in the GPU requirements page, make sure they're in the PATH, and that your GPU is supported.

(1) could be potentially addressed by prepending PATH with the correct CUDA path before loading CUDA EP.
(2) could be potentially addressed by examining the current state before trying to load CUDA EP and presenting the user with a more specific error message if the load is going to fail.

@RyanUnderhill
Copy link
Member Author

@ivanst0 Since the shared library version calls LoadLibraryEx directly on our don't use CUDA_PATH to determine where to load the libraries from, if the correct version is anywhere on the PATH we will load it. That code was just to work around a Python restriction that our internal shared library loading bypasses (which is why we wanted to remove it).

I agree the error messages are not perfect, but at least they will now no longer cause failures of their own (as the previous code would cause the abort, even if the load might have been successful had we tried it, this is bad).

We want to improve the error messages for all languages as time goes on, based on what errors people typically run into. So this is not over yet...

@RyanUnderhill RyanUnderhill merged commit 49938cc into master Jun 25, 2021
@RyanUnderhill RyanUnderhill deleted the ryanunderhill/python_optional_cuda branch June 25, 2021 09:26
@snnn
Copy link
Member

snnn commented Jun 25, 2021

@RyanUnderhill will you add some tests for it?

harshithapv pushed a commit that referenced this pull request Jun 25, 2021
harshithapv added a commit that referenced this pull request Jun 26, 2021
* fix boost download url (#7843)

* Topo sort the model before saving (#7913)

* checkin toposort

* review comments

* revert and add TODO

* Add shape inference to custom symbolic functions (#7937)

**Description**: As title.

**Motivation and Context**
- PyTorch ONNX exporter heavily depends on ONNX shape inference to export accurate and efficient model. Custom symbolic function exports the op as contrib ops, thus exporter is unable to perform standard onnx shape inference. Models with dynamic shape inputs are affected.

* Fix missing files on linux (#8066)

* [Mobile package] Update required operator config with additional ops for wav2vec2. (#8079)

Add some additional ops to the mobile package that are needed for the wav2vec2 model.

* Add module attribute to ORTModule to support HuggingFace Trainer save_model (#8088)

* Fix input schema extrator for ORTModule (#8098)

* Fix 32bit Android java API crash (#8122)

* Fix 32bit Android java API crash

* fix code formating

* [Mobile package] Update required operator config with additional ops for newer version of Wav2Vec 2. (#8123)

This is an update to #8079
The sample application motivating the original update changed to use an updated version of the model. Now, fewer ops are required. This change removes the previously added ops which are no longer needed.

* Add int64 as a required type to ConstantOfShape as it's used by the pytorch converter for Pad. (#8128)

It's also used pointlessly for torch.tensor.repeat (although that usage should always be able to be constant folded).

* Update logic in props.xml to account for shared provider library changes (#8138)

* Ortmodule override torch.manual_seed() (#8131)

* Ortmodule override torch.manual_seed()

* Fix Python Cuda loading issues (#7939)

* Fix mac shared_provider warning (#8153)

Co-authored-by: Guoyu Wang <[email protected]>
Co-authored-by: Ye Wang <[email protected]>
Co-authored-by: Bowen Bao <[email protected]>
Co-authored-by: Ryan Hill <[email protected]>
Co-authored-by: Edward Chen <[email protected]>
Co-authored-by: baijumeswani <[email protected]>
Co-authored-by: Thiago Crepaldi <[email protected]>
Co-authored-by: Scott McKay <[email protected]>
Co-authored-by: Hariharan Seshadri <[email protected]>
Co-authored-by: Sherlock <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants