Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InternalError #57

Open
qianlong0502 opened this issue Jan 11, 2023 · 0 comments
Open

InternalError #57

qianlong0502 opened this issue Jan 11, 2023 · 0 comments

Comments

@qianlong0502
Copy link

qianlong0502 commented Jan 11, 2023

Have you met this error? I am very confused. I have tried running this code on several different hosts and GPUs. But I still got this error.

python NGCF.py --dataset gowalla --regs [1e-5] --embed_size 64 --layer_size [64,64,64] --lr 0.0001 --save_flag 1 --pretrain 0 --batch_size 1024 --epoch 400 --verbose 1 --node_dropout [0.1] --mess_dropout [0.1,0.1,0.1]
n_users=29858, n_items=40981
n_interactions=1027370
n_train=810128, n_test=217242, sparsity=0.00084
already load adj matrix (70839, 70839) 0.16373920440673828
use the normalized adjacency matrix
using xavier initialization
2023-01-11 11:21:00.770564: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:884] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
without pretraining.
Epoch 0 [102.3s]: train==[549.91321=549.87742 + 0.00000]
Epoch 1 [100.1s]: train==[550.38794=550.35171 + 0.00000]
Epoch 2 [100.2s]: train==[551.38123=551.34350 + 0.00000]
Epoch 3 [100.2s]: train==[552.64801=552.60893 + 0.00000]
Epoch 4 [100.1s]: train==[554.05444=554.01392 + 0.00000]
Epoch 5 [100.2s]: train==[555.21527=555.17249 + 0.00000]
Epoch 6 [100.3s]: train==[557.17529=557.13013 + 0.00000]
Epoch 7 [100.2s]: train==[559.49915=559.45179 + 0.00000]
Epoch 8 [100.1s]: train==[561.67401=561.62488 + 0.00000]
2023-01-11 11:37:45.543238: E tensorflow/stream_executor/cuda/cuda_blas.cc:654] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256
         [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]]
         [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "NGCF.py", line 490, in <module>
    ret = test(sess, model, users_to_test, drop_flag=True)
  File "/root/our-mm-learning/codes/NGCF2/NGCF/utility/batch_test.py", line 167, in test
    model.mess_dropout: [0.] * len(eval(args.layer_size))})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256
         [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]]
         [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'MatMul_6', defined at:
  File "NGCF.py", line 360, in <module>
    model = NGCF(data_config=config, pretrain_data=pretrain_data)
  File "NGCF.py", line 101, in __init__
    self.batch_ratings = tf.matmul(self.u_g_embeddings, self.pos_i_g_embeddings, transpose_a=False, transpose_b=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 2122, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 4279, in mat_mul
    name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256
         [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]]
         [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant