Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.8.0 raises Exception if cudnn not found in Program Files #7965

Closed
iperov opened this issue Jun 5, 2021 · 26 comments
Closed

v1.8.0 raises Exception if cudnn not found in Program Files #7965

iperov opened this issue Jun 5, 2021 · 26 comments
Assignees

Comments

@iperov
Copy link
Contributor

iperov commented Jun 5, 2021

v1.8.0 raises Exception if cudnn not found in C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.2\\bin

but my app is standalone and made for end-users who will not install CUDA/CUDNN sdk.

os LoadLibrary() automatically uses CUDA dlls provided in PATH environment

all works fine in v1.7.0 , can you fix it?

@snnn
Copy link
Member

snnn commented Jun 7, 2021

When possible , don't use "PATH" for locating dependent DLLs.

If you put these CUDA DLL in the same dir of your application exe, it should be fine.

"for end-users who will not install CUDA/CUDNN sdk", why do you provide the onnx runtime GPU build to them? Please tell us more about your usage. Is it a C/C++ program or python?

@iperov
Copy link
Contributor Author

iperov commented Jun 7, 2021

If you put these CUDA DLL in the same dir of your application exe, it should be fine.

I know. My CUDA dlls located near in project dir.

But 1.8.0 raises a hard exception if cudnn not found in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin

if not os.path.isfile(os.path.join(cudnn_bin_dir, f"cudnn64_{version_info.cudnn_version}.dll")):
raise ImportError(f"cuDNN {version_info.cudnn_version} not installed in {cudnn_bin_dir}. "
f"Set the CUDNN_HOME environment variable to the path of the 'cuda' directory "
f"in your CUDNN installation if necessary.")

there was no such code in 1.7.0 and works fine

@oliviajain oliviajain assigned oliviajain and skottmckay and unassigned oliviajain Jun 7, 2021
@oliviajain
Copy link
Contributor

oliviajain commented Jun 7, 2021

Cudnn documentation asks to copy cudnn files into the CUDA Toolkit directory located in Program Files. Maybe @skottmckay can give more context.

@snnn
Copy link
Member

snnn commented Jun 7, 2021

Now I got it. 1.8 assumes the CUDNN files are either located in the CUDA dir or %CUDNN_HOME% (which is a onnx runtime specific env variable). But 1.7 doesn't have such requirement, as long as the DLLs are in %PATH%, it is fine. So this is a breaking change.

@iperov
Copy link
Contributor Author

iperov commented Jun 7, 2021

Cudnn documentation asks to copy cudnn files into the CUDA Toolkit directory located in Program Files. Maybe @skottmckay can give more context.

that is for developers.

I am making an app for END-users.
They will use a stand-alone / portable app, which includes all necessary dependencies and libraries inside.

Requirement to install CUDA/CUDNN manually by an end-user is suicide.

Wake up developers ! What the hell are you doing ??

@jywu-msft
Copy link
Member

Cudnn documentation asks to copy cudnn files into the CUDA Toolkit directory located in Program Files. Maybe @skottmckay can give more context.

that is for developers.

I am making an app for END-users.
They will use a stand-alone / portable app, which includes all necessary dependencies and libraries inside.

Requirement to install CUDA/CUDNN manually by an end-user is suicide.

Wake up developers ! What the hell are you doing ??

I believe the change was done to address a new restrictions for secure python dll loading on Windows:
https://bugs.python.org/issue36085
https://docs.python.org/3/whatsnew/3.8.html#bpo-36085-whatsnew
https://docs.python.org/3/library/os.html#os.add_dll_directory
Toblerity/Fiona#851

This is currently only needed for python 3.8 and above.
So one option could be to move that check to

# Python 3.8 (and later) doesn't search system PATH when loading DLLs, so the CUDA location needs to be
?

However, this change will eventually be required for all users as they update their python version on Windows,
so I suspect that is why it is consistently enforced across all versions.

@skottmckay
Copy link
Contributor

@ivanst0 added the bulk of the new behavior in #6436 however that PR seemed to be more about handling multiple CUDA versions being on a machine.

I included the additional CUDNN_HOME check to be consistent with what the ORT build uses for an explicitly specified path to the CUDNN libraries (and part of the build uses the python bindings for tests). Previously the CUDNN documentation involved putting the binaries in a separate location to CUDA_HOME, but now that that has changed we could remove the usage of CUDNN_HOME from the build etc. That seems like a side issue though.

Is it that there's a requirement for a user to install CUDA/CUDNN, or that os.environ needs to have an entry saying where to find the CUDA dlls? If it's the latter, short term could that entry be added prior to importing the onnxruntime python module, pointing to wherever the CUDA dlls you want to be loaded are?

Long term, would it be valid to not fail if the CUDA environment variables aren't found (that doesn't mean the dlls aren't available), and do a check via ctypes.util.find_library instead after any calls to os.add_dll_directory (if any) are made? i.e. add the CUDA paths we look for using add_dll_directory if found, but also allow for a user having added path information.

@iperov
Copy link
Contributor Author

iperov commented Jun 8, 2021

cuda/cudnn bin path via os.environ is fine for me.

@ivanst0
Copy link
Member

ivanst0 commented Jun 8, 2021

Yes, if you are distributing CUDA/cuDNN DLLs with your Python app/package (in <LIB_DIR>\bin), I recommend setting the appropriate environment variable (e.g. CUDA_PATH_V11_2) to <LIB_DIR> instead of prepending <LIB_DIR>\bin to PATH, before importing onnxruntime package. This works across all supported Python versions (3.6 - 3.9).

@iperov, if this solution works for you please feel free to close this issue.

@skottmckay
Copy link
Contributor

@ivanst0 Is there a reason why we need to force someone to set the environment variable if the CUDA library would have been found?

i.e. what would the issue be with us making the calls to add_dll_directory if paths are available via the environment variable/s, but only failing if ctypes.util.find_library can't find the required cuda libraries?

If possible that seems slightly cleaner and more user friendly to me as the user doesn't need to discover the correct incantation for the CUDA_PATH_V... environment variable name (given it's based on the CUDA version ORT was built with).

@iperov
Copy link
Contributor Author

iperov commented Jun 8, 2021

Agree.

Also I am using onnxruntime with pytorch(latest version) which contains cuda libraries in site-packages so 1.7.0 onnxruntime use them automatically because they are accessible through PATH

explorer_2021-06-08_15-47-41

@snnn
Copy link
Member

snnn commented Jun 8, 2021

onnxruntime use them automatically because they are accessible through PATH

It just happened to work. onnx runtime and pytorch requires different CUDA and CUDNN versions. Even though sometimes the file names are the same, the versions are different.

@iperov
Copy link
Contributor Author

iperov commented Jun 8, 2021

why different if I choosed torch==1.8.1+cu111
onnxruntime 1.7.0 uses 11.0
onnxruntime 1.8.0 uses 11.1
and CUDA provides minor version backward compatibility.

@snnn
Copy link
Member

snnn commented Jun 8, 2021

onnxruntime 1.8.0 uses 11.0

CUDA provides minor version backward compatibility starting from 11.1.

And what about CUDNN version?

And what if onnxruntime was built with a newer CUDA version than pytorch?

@iperov
Copy link
Contributor Author

iperov commented Jun 8, 2021

please don't discuss offtopic

@jywu-msft
Copy link
Member

we plan on patching 1.8 release to fix this issue.

@iperov
Copy link
Contributor Author

iperov commented Jul 13, 2021

seems like someone removed code from pybind_state and now works as 1.7.0

what solution about securing dlls in future?

@iperov iperov closed this as completed Jul 14, 2021
@iperov
Copy link
Contributor Author

iperov commented Aug 21, 2022

looks like onnxruntime-gpu==1.12.1 does not work with cuda 11.5+
error is
Please make sure cudnn_cnn_infer64_8.dll is in your library path!

cuda 11.3 is fine.

@jywu-msft Can you write the cuda version requirements on the release page?

@jywu-msft
Copy link
Member

looks like onnxruntime-gpu==1.12.1 does not work with cuda 11.5+ error is Please make sure cudnn_cnn_infer64_8.dll is in your library path!

cuda 11.3 is fine.

@jywu-msft Can you write the cuda version requirements on the release page?

11.5 should work. I just tested it with onnxruntime-gpu 1.12.1 python package and it worked fine.
cudnn_cnn_infer64_8.dll is part of cuda 11.5 installation.
any other details about your environment? where are the CUDA 11.5 libs? what's in your PATH?

@iperov
Copy link
Contributor Author

iperov commented Aug 22, 2022

I don't use cuda "installation". Cuda is not an installation, but only bunch of dlls.

i am using cuda dlls from torch pip
torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
works fine with 1.12.1

but dlls from
python -m pip install torch==1.11.0+cu115 torchvision==0.12.0+cu115 -f https://download.pytorch.org/whl/torch_stable.html
does not work

@jywu-msft
Copy link
Member

I don't use cuda "installation". Cuda is not an installation, but only bunch of dlls.

i am using cuda dlls from torch pip torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html works fine with 1.12.1

but dlls from python -m pip install torch==1.11.0+cu115 torchvision==0.12.0+cu115 -f https://download.pytorch.org/whl/torch_stable.html does not work

tested torch==1.11.0+cu115 and that worked too.
I added the location of the cuda 11.5 libs to my PATH (in my environment, it's c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\Lib\site-packages\torch\lib) and it worked.

@iperov
Copy link
Contributor Author

iperov commented Aug 22, 2022

Because it uses dlls from your already installed "cuda kit" or other dirs from PATH.

I am using builder of portable all-in-one folder for DeepFaceLive project (written by me) (https://github.com/iperov/DeepFaceLive), where PATHs are limited by folder.
It has CUDA bin directory with dlls from torch==1.11.0+cu115 and it does not work Could not load library cudnn_cnn_infer64_8.dll. Error code 126
but cu113 works. Thus I cannot upgrade the project to cu115 due to this issue.
I can send you this folder for test.

cmd_2022-08-22_21-58-57

cmd_2022-08-22_21-59-56

@jywu-msft
Copy link
Member

Because it uses dlls from your already installed "cuda kit" or other dirs from PATH.

I am using builder of portable all-in-one folder for DeepFaceLive project (written by me) (https://github.com/iperov/DeepFaceLive), where PATHs are limited by folder. It has CUDA bin directory with dlls from torch==1.11.0+cu115 and it does not work Could not load library cudnn_cnn_infer64_8.dll. Error code 126 but cu113 works. Thus I cannot upgrade the project to cu115 due to this issue. I can send you this folder for test.

cmd_2022-08-22_21-58-57

cmd_2022-08-22_21-59-56

that error message is saying it could not load cudnn_cnn_infer64_8.dll , but it could also mean one of its dependencies is missing. I suspect that is most likely the cause.
I tested against the cuda lib location installed by torch (and removed all references in PATH to any system nvidia toolkit locations) and it worked.
try the suggestions in #6435 to see if they can help. (e.g. try running dependency walker on cudnn_cnn_infer64_8.dll)

@iperov
Copy link
Contributor Author

iperov commented Aug 24, 2022

ok I will check.

There is other issue .

same model produces different result with onnxruntime-gpu==1.12.1 and onnxruntime-gpu==1.11.0. New version produces buggy inference.

Every new release you introduce new bugs !! so tired.

@jywu-msft
Copy link
Member

jywu-msft commented Aug 24, 2022

ok I will check.

There is other issue .

same model produces different result with onnxruntime-gpu==1.12.1 and onnxruntime-gpu==1.11.0. New version produces buggy inference.

Every new release you introduce new bugs !! so tired.

I understand your frustration about bugs. Unfortunately, bugs will come with new features and changes. We will try our best to do better testing and avoid regressions. You can help us by opting into our Release Candidate testing. Every release we will have a couple weeks period where we publish Release Candidates and users can report issues before we finalize the release.
e.g. see #12133
Thank you!

Can you please file a separate issue with repro steps and assets for the "buggy inference" issue? Otherwise it is getting buried in this closed issue and others aren't looking at it.
It's difficult to say whether there is an ORT regression at this point. (I think you also updated CUDA version, right?)
Which execution provider do you use? CUDAExecutionProvider? Does the issue occur with CPUExecutionProvider ?

@iperov
Copy link
Contributor Author

iperov commented Aug 24, 2022

check issue 12706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants