We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have you met this error? I am very confused. I have tried running this code on several different hosts and GPUs. But I still got this error.
python NGCF.py --dataset gowalla --regs [1e-5] --embed_size 64 --layer_size [64,64,64] --lr 0.0001 --save_flag 1 --pretrain 0 --batch_size 1024 --epoch 400 --verbose 1 --node_dropout [0.1] --mess_dropout [0.1,0.1,0.1] n_users=29858, n_items=40981 n_interactions=1027370 n_train=810128, n_test=217242, sparsity=0.00084 already load adj matrix (70839, 70839) 0.16373920440673828 use the normalized adjacency matrix using xavier initialization 2023-01-11 11:21:00.770564: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:884] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. without pretraining. Epoch 0 [102.3s]: train==[549.91321=549.87742 + 0.00000] Epoch 1 [100.1s]: train==[550.38794=550.35171 + 0.00000] Epoch 2 [100.2s]: train==[551.38123=551.34350 + 0.00000] Epoch 3 [100.2s]: train==[552.64801=552.60893 + 0.00000] Epoch 4 [100.1s]: train==[554.05444=554.01392 + 0.00000] Epoch 5 [100.2s]: train==[555.21527=555.17249 + 0.00000] Epoch 6 [100.3s]: train==[557.17529=557.13013 + 0.00000] Epoch 7 [100.2s]: train==[559.49915=559.45179 + 0.00000] Epoch 8 [100.1s]: train==[561.67401=561.62488 + 0.00000] 2023-01-11 11:37:45.543238: E tensorflow/stream_executor/cuda/cuda_blas.cc:654] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256 [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]] [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "NGCF.py", line 490, in <module> ret = test(sess, model, users_to_test, drop_flag=True) File "/root/our-mm-learning/codes/NGCF2/NGCF/utility/batch_test.py", line 167, in test model.mess_dropout: [0.] * len(eval(args.layer_size))}) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256 [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]] [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Caused by op 'MatMul_6', defined at: File "NGCF.py", line 360, in <module> model = NGCF(data_config=config, pretrain_data=pretrain_data) File "NGCF.py", line 101, in __init__ self.batch_ratings = tf.matmul(self.u_g_embeddings, self.pos_i_g_embeddings, transpose_a=False, transpose_b=True) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 2122, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 4279, in mat_mul name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3392, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1718, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256 [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]] [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Have you met this error? I am very confused. I have tried running this code on several different hosts and GPUs. But I still got this error.
The text was updated successfully, but these errors were encountered: