-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Python Cuda loading issues #7939
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add an automated test? Like installing the onnxruntime-gpu pkg on a CPU-only machine and inferencing a model with it?
return info->GetCurrentGpuDeviceId(device_id); | ||
return CreateStatus(ORT_FAIL, "CUDA execution provider is not enabled."); | ||
return CreateStatus(ORT_FAIL, "CUDA execution provider is not available."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return CreateStatus(ORT_FAIL, "CUDA execution provider is not available."); | |
return CreateStatus(ORT_FAIL, "CUDA execution provider is either not enabled or not available."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, though I could probably make it conditional on #if USE_CUDA to say if we just weren't built with it at all (and other providers)
return nullptr; | ||
} | ||
|
||
ProviderInfo_CUDA& GetProviderInfo_CUDA() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better error msg we can give to users? Even if the user enables the cuda provider, this might fail and the error msg is not clear why it failed. The user cannot take any action other than filing a github issue which increases support costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
second this. we need to somehow propagate the actual error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worked on improving the error messages, it used to show this on failure:
"2021-06-09 18:24:21.8575539 [E:onnxruntime:Default, provider_bridge_ort.cc:1000 onnxruntime::ProviderLibrary::Get] Failed to load library, error code: 126"
With these changes, now it will show this:
2021-06-09 18:54:17.7950730 [E:onnxruntime:Default, provider_bridge_ort.cc:1000 onnxruntime::ProviderLibrary::Get] LoadLibrary failed with error 126 "The specified module could not be found." when trying to load "C:\Code\Github\onnxruntime\build\Windows\Debug\Debug\onnxruntime\capi\onnxruntime_providers_cuda.dll"
so now on Windows it will translate the error message into text, vs just showing a number.
} | ||
else | ||
{ | ||
LOGS_DEFAULT(INFO) << "CUDA execution provider is not available."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's too dangerous to log as INFO. it should be WARNING.
a common scenario is the user may think they are running with CUDA enabled but underneath it's failing silently and everything is running on CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the gpu pkg is installed on a CPU-only machine, logging it at WARNING might wrongly indicate something is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want it to be a warning since a CPU only user wouldn't expect a warning because cuda isn't on their system, right? The goal here is to have one version for both CPU and Cuda, I wouldn't expect a warning.
But I do think it's a problem that Python users can't explicitly say they expect the cuda provider to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the gpu pkg is installed on a CPU-only machine, logging it at WARNING might wrongly indicate something is wrong.
But the common case is that the user has a GPU, that is why they are installing the GPU package in the first place.
so it's better to warn that it's not working than silently execute on CPU.
nobody uses INFO level by default, so it will be a silent failure which isn't what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put another way, we shouldn't be optimizing for the case that GPU package is installed on CPU machine.
as long as we continue to publish both CPU and GPU packages, then we have to assume that common case for someone installing GPU package is because they have a GPU on their system.
once we have a single CPU/GPU package, then I agree that WARNING is inappropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want it to be a warning since a CPU only user wouldn't expect a warning because cuda isn't on their system, right? The goal here is to have one version for both CPU and Cuda, I wouldn't expect a warning.
But I do think it's a problem that Python users can't explicitly say they expect the cuda provider to be used.
yes, goal is to have a single package, but we don't have a single package currently right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll be sure it reports the same information the C API provides, and keep it as a warning. Then we can change it later if we ever want to remove the CPU only package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this block since it's redundant. The C API will already log an error, this just logs an extra one that's redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to this one?
LOGS_DEFAULT(ERROR) << error.ErrorMessage(); |
I think we need to finish #7854 first. Otherwise it is still not tested. |
…o ryanunderhill/python_optional_cuda
…o ryanunderhill/python_optional_cuda
To test it on a CPU-only machine: I suggest you opening https://github.com/microsoft/onnxruntime/blob/master/tools/ci_build/github/azure-pipelines/linux-gpu-ci-pipeline.yml and duplicating the "Linux_Test" job part. Please give the new job a different name, and change the machine pool from "centos7gpu" to "centos7cpu". The centos7cpu machine pool has cuda but doesn't have GPU cards. Once it is done, I will give you a new machine pool without cuda and GPUs. |
…om/microsoft/onnxruntime into ryanunderhill/python_optional_cuda
onnxruntime/python/_pybind_state.py
Outdated
f"Set the CUDNN_HOME environment variable to the path of the 'cuda' directory " | ||
f"in your CUDNN installation if necessary.") | ||
print(f"cuDNN {version_info.cudnn_version} not installed in {cudnn_bin_dir}. " | ||
f"Set the CUDNN_HOME environment variable to the path of the 'cuda' directory " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the user expected to do with this print statement if cudnn is obtained from the PATH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will alone fix #7965. For python vers >=3.8, PATH won't be consulted to find the deps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user is doing something non recommended, I don't think we can help them in that case.
--enable_onnx_tests --use_cuda --cuda_version=11.1 --cuda_home=/usr/local/cuda-11.1 --cudnn_home=/usr/local/cuda-11.1 \ | ||
--enable_pybind --build_java --ctest_path '' | ||
|
||
- job: Linux_Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation matters.
You may try to manually queue a build for this pipeline, I'm afraid this yaml change doesn't work.
} | ||
else | ||
{ | ||
if(!Env::Default().GetEnvironmentVar("CUDA_PATH").empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please add a note here saying why we're checking for this? something like - we want to allow the flexibility to use a gpu pkg on a CPU-only machine unless the user intended to run inferencing on the GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to add something else to the message for the user to check?
Like make sure they have all the correct version prerequisites and they are in the user's PATH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, error message updated
LGTM 👍 I'll let @snnn verify the pipeline changes. |
Linux GPU CI Pipeline doesn't work yet. |
@jywu-msft, @pranavsharma, @RyanUnderhill, @snnn Specifically, why would we allow successful import of onnxruntime-gpu package if CUDA/cuDNN can't be found? One of the most common pain points with using onnxruntime-gpu package was a cryptic error message when any of the required DLLs is not found (bug #6435). PR #6436 added (among other things) more informative error messages, but this seems to be removed here. |
@ivanst0 we want to make onnxruntime-gpu package work on cpu-only devices. So that most users only need one package: "onnxruntime-gpu". And the "onnxruntime" package is for who has a concern on package size. |
I see, thanks for the context. Is the intention to automatically fall back to CPU version when loading any of dependent DLLs fail? How will users debug such failures? My personal take is that in any serious application the DNN model is fixed size and either a. small enough so it can be evaluated on CPU without problems, or b. large enough to require GPU/CUDA (i.e. too slow to be usable on CPU). In this context I wouldn’t find the automatic fallback to be useful, especially if it makes it harder to troubleshoot the common errors with loading CUDA/cuDNN. This is based just on the use cases I observed. I assume this change was driven by user feedback/requests (although I wasn’t able to find the corresponding GitHub issue). |
We should not fallback to CPU automatically. That's why we're checking for the presence of the CUDA_PATH env variable and throwing an exception only if populated and the dep libs not found. A user will install CUDA (and hence have that env var populated) only on a GPU machine where she intends to run the model. |
Is |
Calling 'onnxruntime.InferenceSession(...)' is what causes the CUDA dlls to be loaded. Edit: As an example, here's what happens if you run it and CUDA can't be loaded: RuntimeError: D:\a_work\1\s\onnxruntime\python\onnxruntime_pybind_state.cc:497 onnxruntime::python::RegisterExecutionProviders CUDA_PATH is set but CUDA wasn't able to be loaded |
Thanks for explanation, Ryan! I still find the new version be a regression compared to previous state.
vs.
(1) could be potentially addressed by prepending PATH with the correct CUDA path before loading CUDA EP. |
@ivanst0 Since the shared library version calls LoadLibraryEx directly on our don't use CUDA_PATH to determine where to load the libraries from, if the correct version is anywhere on the PATH we will load it. That code was just to work around a Python restriction that our internal shared library loading bypasses (which is why we wanted to remove it). I agree the error messages are not perfect, but at least they will now no longer cause failures of their own (as the previous code would cause the abort, even if the load might have been successful had we tried it, this is bad). We want to improve the error messages for all languages as time goes on, based on what errors people typically run into. So this is not over yet... |
@RyanUnderhill will you add some tests for it? |
* fix boost download url (#7843) * Topo sort the model before saving (#7913) * checkin toposort * review comments * revert and add TODO * Add shape inference to custom symbolic functions (#7937) **Description**: As title. **Motivation and Context** - PyTorch ONNX exporter heavily depends on ONNX shape inference to export accurate and efficient model. Custom symbolic function exports the op as contrib ops, thus exporter is unable to perform standard onnx shape inference. Models with dynamic shape inputs are affected. * Fix missing files on linux (#8066) * [Mobile package] Update required operator config with additional ops for wav2vec2. (#8079) Add some additional ops to the mobile package that are needed for the wav2vec2 model. * Add module attribute to ORTModule to support HuggingFace Trainer save_model (#8088) * Fix input schema extrator for ORTModule (#8098) * Fix 32bit Android java API crash (#8122) * Fix 32bit Android java API crash * fix code formating * [Mobile package] Update required operator config with additional ops for newer version of Wav2Vec 2. (#8123) This is an update to #8079 The sample application motivating the original update changed to use an updated version of the model. Now, fewer ops are required. This change removes the previously added ops which are no longer needed. * Add int64 as a required type to ConstantOfShape as it's used by the pytorch converter for Pad. (#8128) It's also used pointlessly for torch.tensor.repeat (although that usage should always be able to be constant folded). * Update logic in props.xml to account for shared provider library changes (#8138) * Ortmodule override torch.manual_seed() (#8131) * Ortmodule override torch.manual_seed() * Fix Python Cuda loading issues (#7939) * Fix mac shared_provider warning (#8153) Co-authored-by: Guoyu Wang <[email protected]> Co-authored-by: Ye Wang <[email protected]> Co-authored-by: Bowen Bao <[email protected]> Co-authored-by: Ryan Hill <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: baijumeswani <[email protected]> Co-authored-by: Thiago Crepaldi <[email protected]> Co-authored-by: Scott McKay <[email protected]> Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: Sherlock <[email protected]>
Description: Previously, if built with Cuda enabled, the Python bindings would require the cuda provider to be loaded. But as it's now a shared library this isn't fatal.
This removes a bunch of now unnecessary code from _pybind_state.py that our internal shared provider DLL loading code handles already. This also enables running the GPU package on a CPU only machine. If CUDA_PATH is set, it will generate an error, as in this case the user is likely expecting CUDA to be used.
I changed GetProviderInfo_CUDA to return a reference (which now makes it clearer it can't fail, instead it'll throw an exception like it already did), and added TryGetProviderInfo_CUDA which will return nullptr if Cuda is not available. Python now uses TryGetProviderInfo_CUDA.