Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED #464

Open
hjyu94 opened this issue Oct 9, 2023 · 4 comments

Comments

@hjyu94
Copy link

hjyu94 commented Oct 9, 2023

Hi. I'm trying to run the MeshGraphNets model but encountered an error with this command:

python -m meshgraphnets.run_model --mode=train --model=cloth --checkpoint_dir=meshgraphnets/dataset/chk --dataset_dir=meshgraphnets/dataset/flag_simple

The error is:

2023-10-09 14:41:01.689006: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2023-10-09 14:41:01.689021: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2023-10-09 14:41:01.689035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2023-10-09 14:41:01.689048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2023-10-09 14:41:01.689061: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2023-10-09 14:41:01.689074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2023-10-09 14:41:01.689088: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

...

2023-10-09 14:41:14.669500: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2023-10-09 14:41:14.669603: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Traceback (most recent call last):
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[{{node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul}}]]
         [[Model/loss/Mean/_6711]]
  (1) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[{{node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 130, in <module>
    app.run(main)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 125, in main
    learner(model, params)
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 82, in learner
    _, step, loss = sess.run([train_op, global_step, loss_op])
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/home/hyojeong/.local/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul (defined at /miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[Model/loss/Mean/_6711]]
  (1) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul (defined at /miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

I've searched suggestions to upgrade TensorFlow to version 2.x on stackflows.
but the meshgraphnets/requirements.txt specifies tensorflow-gpu>=1.15,<2.

Has anyone faced this issue? Should I upgrade TensorFlow? I did try once, but it caused another problem.

Please let me know if you need the full error details or package versions.

@BoyuanTang331
Copy link

Hi, when I run learning to simulate I met also this problem

*Solution

  1. check the compatibility for GPU driver/CUDA and cuDNN version/TensorFlow version
  2. set the memory growth in physic device for tf

@Xiaozl11
Copy link

我也遇到了类似的问题。
我在网上看到的解释是:tensorflow-gpu==1.15版本对应cuda10.0版本,可是cuda10只能在rtx20系以下运行,我是40系的显卡。只能用cpu进行训练。

当然也可能是其他的问题。

@kikispy
Copy link

kikispy commented Mar 12, 2024

我也遇到了类似的问题。 我在网上看到的解释是:tensorflow-gpu==1.15版本对应cuda10.0版本,但是cuda10只能在rtx20系以下运行,我是40系的显卡。只能用cpu进行训练。

当然也可能是其他的问题。

你好,我在40系显卡中也遇到了上面的问题,有解决思路么? 能升级至 tf2.0么 如回复,不胜感谢

@Xiaozl11
Copy link

Xiaozl11 commented Mar 12, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants