Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDNN_STATUS_NOT_INITIALIZED #16

Open
sak96 opened this issue Dec 4, 2022 · 6 comments
Open

CUDNN_STATUS_NOT_INITIALIZED #16

sak96 opened this issue Dec 4, 2022 · 6 comments

Comments

@sak96
Copy link

sak96 commented Dec 4, 2022

Building the autoencoder.
Building the unet.
Timestep 0/30
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Torch("cuDNN error: CUDNN_STATUS_NOT_INITIALIZED\nException raised from createCuDNNHandle at /build/python-pytorch/src/pytorch-1.13.0-cuda/aten/src/ATen/cudnn/Handle.cpp:9 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x92 (0x7fdb34705bf2 in /usr/lib/libc10.so)\nframe #1: <unknown function> + 0xd89413 (0x7fdaeab89413 in /usr/lib/libtorch_cuda.so)\nframe #2: at::native::getCudnnHandle() + 0x7b8 (0x7fdaeaeded18 in /usr/lib/libtorch_cuda.so)\nframe #3: <unknown function> + 0x1055f89 (0x7fdaeae55f89 in /usr/lib/libtorch_cuda.so)\nframe #4: <unknown function> + 0x10505f4 (0x7fdaeae505f4 in /usr/lib/libtorch_cuda.so)\nframe #5: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0xad (0x7fdaeae50a2d in /usr/lib/libtorch_cuda.so)\nframe #6: <unknown function> + 0x3882bf4 (0x7fdaed682bf4 in /usr/lib/libtorch_cuda.so)\nframe #7: <unknown function> + 0x3882cad (0x7fdaed682cad in /usr/lib/libtorch_cuda.so)\nframe #8: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x226 (0x7fdae0416e46 in /usr/lib/libtorch_cpu.so)\nframe #9: at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x1097 (0x7fdadf821ff7 in /usr/lib/libtorch_cpu.so)\nframe #10: <unknown function> + 0x258518e (0x7fdae078518e in /usr/lib/libtorch_cpu.so)\nframe #11: at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x299 (0x7fdadff96759 in /usr/lib/libtorch_cpu.so)\nframe #12: at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x111 (0x7fdadf814bd1 in /usr/lib/libtorch_cpu.so)\nframe #13: <unknown function> + 0x2584c3e (0x7fdae0784c3e in /usr/lib/libtorch_cpu.so)\nframe #14: at::_ops::convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x15d (0x7fdadff43b3d in /usr/lib/libtorch_cpu.so)\nframe #15: <unknown function> + 0x43b0226 (0x7fdae25b0226 in /usr/lib/libtorch_cpu.so)\nframe #16: <unknown function> + 0x43b1060 (0x7fdae25b1060 in /usr/lib/libtorch_cpu.so)\nframe #17: at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x247 (0x7fdadff95a27 in /usr/lib/libtorch_cpu.so)\nframe #18: at::native::conv2d(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x20d (0x7fdadf81908d in /usr/lib/libtorch_cpu.so)\nframe #19: <unknown function> + 0x273d3f6 (0x7fdae093d3f6 in /usr/lib/libtorch_cpu.so)\nframe #20: at::_ops::conv2d::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x202 (0x7fdae053fad2 in /usr/lib/libtorch_cpu.so)\nframe #21: <unknown function> + 0x2f359e (0x55ad31d2559e in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #22: <unknown function> + 0x2fe5da (0x55ad31d305da in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #23: <unknown function> + 0x2b79bd (0x55ad31ce99bd in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #24: <unknown function> + 0x2bd561 (0x55ad31cef561 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #25: <unknown function> + 0x2e1ad0 (0x55ad31d13ad0 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #26: <unknown function> + 0x2c05f1 (0x55ad31cf25f1 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #27: <unknown function> + 0xd90f5 (0x55ad31b0b0f5 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #28: <unknown function> + 0x96438 (0x55ad31ac8438 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #29: <unknown function> + 0x975a1 (0x55ad31ac95a1 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #30: <unknown function> + 0xb496b (0x55ad31ae696b in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #31: <unknown function> + 0xa10ae (0x55ad31ad30ae in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #32: <unknown function> + 0xacbf1 (0x55ad31adebf1 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #33: <unknown function> + 0x621aee (0x55ad32053aee in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #34: <unknown function> + 0xacbc0 (0x55ad31adebc0 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #35: <unknown function> + 0x9b83c (0x55ad31acd83c in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #36: <unknown function> + 0x23290 (0x7fdade03c290 in /usr/lib/libc.so.6)\nframe #37: __libc_start_main + 0x8a (0x7fdade03c34a in /usr/lib/libc.so.6)\nframe #38: <unknown function> + 0x91905 (0x55ad31ac3905 in $HOME.cargo-target/debug/examples/stable-diffusion)\n")', $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/wrappers/tensor_generated.rs:6457:72
stack backtrace:
   0: rust_begin_unwind
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
   2: core::result::unwrap_failed
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/result.rs:1785:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/result.rs:1078:23
   4: tch::wrappers::tensor_generated::<impl tch::wrappers::tensor::Tensor>::conv2d
             at $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/wrappers/tensor_generated.rs:6457:9
   5: <tch::nn::conv::Conv<[i64; 2]> as tch::nn::module::Module>::forward
             at $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/nn/conv.rs:216:9
   6: tch::nn::module::<impl tch::wrappers::tensor::Tensor>::apply
             at $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/nn/module.rs:47:9
   7: diffusers::models::unet_2d::UNet2DConditionModel::forward
             at ./src/models/unet_2d.rs:237:18
   8: stable_diffusion::run
             at ./examples/stable-diffusion/main.rs:167:30
   9: stable_diffusion::main
             at ./examples/stable-diffusion/main.rs:200:9
  10: core::ops::function::FnOnce::call_once
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Not sure if this is regarding the library or some other issue.

@ajmwagar
Copy link

ajmwagar commented Dec 7, 2022

Dumb question. Do you have a CUDA enabled GPU on your system?

@sak96
Copy link
Author

sak96 commented Dec 8, 2022

oh yeah i forgot to give details about the machine.

% lspci | grep -i vga
01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1)
05:00.0 VGA compatible controller: Advanced Micro Devices ....

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   32C    P3    N/A /  N/A |      5MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       666      G   /usr/lib/Xorg                       4MiB |
+-----------------------------------------------------------------------------+

6Gb 3060

% nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

using export TORCH_CUDA_VERSION=cu118.
would any other details be required.

i am using f16 model

% sha256sum unet16.bin
5019a4fbb455dd9b75192afc3ecf8a8ec875e83812fd51029d2e19277edddebc  unet16.bin

@rockerBOO
Copy link

Could try something like

println!("Cuda available: {}", tch::Cuda::is_available());
println!("Cudnn available: {}", tch::Cuda::cudnn_is_available());

To see if the tch library can see it.

@sak96
Copy link
Author

sak96 commented Jan 2, 2023

Cuda available: true
Cudnn available: true
Cuda available: true
Cudnn available: true

i am not sure why it printed stuff twice though.

--- a/examples/stable-diffusion/main.rs
+++ b/examples/stable-diffusion/main.rs
@@ -196,6 +196,8 @@ fn run(args: Args) -> anyhow::Result<()> {

 fn main() -> anyhow::Result<()> {
     let args = Args::parse();
+    println!("Cuda available: {}", tch::Cuda::is_available());
+    println!("Cudnn available: {}", tch::Cuda::cudnn_is_available());
     if !args.autocast {
         run(args)
     } else {

EDIT:
i found that the cuda code is also part of the code.

@sak96
Copy link
Author

sak96 commented Jan 10, 2023

some solution i found was: pytorch/pytorch#16831 (comment)

tch::Cuda::cudnn_set_benchmark(false);

this did not help.

there is another issue with same stuff: tensorflow/tensorflow#6698 (comment) or https://stackoverflow.com/a/52634209
but i dont see any api which can be used to do the same in tch-rs. if you have any idea please let me know.

@eoriont
Copy link
Contributor

eoriont commented May 25, 2023

Do you have libtorch installed? I had this issue, then fixed it, then forgot exactly what fixed it 😑
The things I tried were: installing the cuda version of pytorch, installing cuda 11.8, installing cudnn 8.9.1, and installing libtorch. After libtorch, it just worked magically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants