[MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED #464

hjyu94 · 2023-10-09T05:56:25Z

Hi. I'm trying to run the MeshGraphNets model but encountered an error with this command:

python -m meshgraphnets.run_model --mode=train --model=cloth --checkpoint_dir=meshgraphnets/dataset/chk --dataset_dir=meshgraphnets/dataset/flag_simple

The error is:

2023-10-09 14:41:01.689006: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2023-10-09 14:41:01.689021: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2023-10-09 14:41:01.689035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2023-10-09 14:41:01.689048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2023-10-09 14:41:01.689061: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2023-10-09 14:41:01.689074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2023-10-09 14:41:01.689088: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

...

2023-10-09 14:41:14.669500: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2023-10-09 14:41:14.669603: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Traceback (most recent call last):
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[{{node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul}}]]
         [[Model/loss/Mean/_6711]]
  (1) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[{{node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 130, in <module>
    app.run(main)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 125, in main
    learner(model, params)
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 82, in learner
    _, step, loss = sess.run([train_op, global_step, loss_op])
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/home/hyojeong/.local/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul (defined at /miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[Model/loss/Mean/_6711]]
  (1) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul (defined at /miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

I've searched suggestions to upgrade TensorFlow to version 2.x on stackflows.
but the meshgraphnets/requirements.txt specifies tensorflow-gpu>=1.15,<2.

Has anyone faced this issue? Should I upgrade TensorFlow? I did try once, but it caused another problem.

Please let me know if you need the full error details or package versions.

The text was updated successfully, but these errors were encountered:

BoyuanTang331 · 2023-12-06T01:38:12Z

Hi, when I run learning to simulate I met also this problem

*Solution

check the compatibility for GPU driver/CUDA and cuDNN version/TensorFlow version
set the memory growth in physic device for tf

Xiaozl11 · 2024-01-29T06:16:53Z

我也遇到了类似的问题。
我在网上看到的解释是：tensorflow-gpu==1.15版本对应cuda10.0版本，可是cuda10只能在rtx20系以下运行，我是40系的显卡。只能用cpu进行训练。

当然也可能是其他的问题。

kikispy · 2024-03-12T16:47:05Z

我也遇到了类似的问题。我在网上看到的解释是：tensorflow-gpu==1.15版本对应cuda10.0版本，但是cuda10只能在rtx20系以下运行，我是40系的显卡。只能用cpu进行训练。

当然也可能是其他的问题。

你好，我在40系显卡中也遇到了上面的问题，有解决思路么？能升级至 tf2.0么如回复，不胜感谢

Xiaozl11 · 2024-03-12T16:49:18Z

我在网上租了个2080ti，是可以运行的

…

---- 回复的原邮件 ---- | 发件人 | ***@***.***> | | 日期 | 2024年03月13日 00:47 | | 收件人 | ***@***.***> | | 抄送至 | ***@***.***>***@***.***> | | 主题 | Re: [google-deepmind/deepmind-research] [MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED (Issue #464) | 我也遇到了类似的问题。我在网上看到的解释是：tensorflow-gpu==1.15版本对应cuda10.0版本，但是cuda10只能在rtx20系以下运行，我是40系的显卡。只能用cpu进行训练。当然也可能是其他的问题。你好，我在40系显卡中也遇到了上面的问题，有解决思路么？能升级至 tf2.0么如回复，不胜感谢 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED #464

[MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED #464

hjyu94 commented Oct 9, 2023

BoyuanTang331 commented Dec 6, 2023

Xiaozl11 commented Jan 29, 2024

kikispy commented Mar 12, 2024

Xiaozl11 commented Mar 12, 2024 via email

[MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED #464

[MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED #464

Comments

hjyu94 commented Oct 9, 2023

BoyuanTang331 commented Dec 6, 2023

Xiaozl11 commented Jan 29, 2024

kikispy commented Mar 12, 2024

Xiaozl11 commented Mar 12, 2024 via email