You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create train.bin and test.bin following HugeCTR dlrm sample. md5sum is same.
split data using sok preprocessing split_bin.py. replace --slot_size_array with the list in HugeCTR dlrm sample train.py. other arguments are default. is it need to chage default dtype, i.e., int32, for label_raw_type dense_raw_type and category_raw_type?
after runing iteration 3790, some errors occur, it looks like something wrong with dataset.
[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>: File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/main.py", line 146, in <module>
[1,6]<stderr>: trainer.train(eval_in_last=False, early_stop=args.early_stop, epochs=args.epochs)
[1,6]<stderr>: File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 247, in train
[1,6]<stderr>: auc = evaluate(self._model, self._test_dataset, self._auc_thresholds)
[1,6]<stderr>: File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 20, in evaluate
[1,6]<stderr>: for idx, (samples, labels) in enumerate(dataset):
[1,6]<stderr>: File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 152, in __getitem__
[1,6]<stderr>: return self._prefetch_queue.get().result()
[1,6]<stderr>: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[1,6]<stderr>: return self.__get_result()
[1,6]<stderr>: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[1,6]<stderr>: raise self._exception
[1,6]<stderr>: File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[1,6]<stderr>: result = self.fn(*self.args, **self.kwargs)
[1,6]<stderr>: File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 205, in _get
[1,6]<stderr>: tf.RaggedTensor.from_row_lengths(flat_values, row_lengths[i])
[1,6]<stderr>: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[1,6]<stderr>: raise e.with_traceback(filtered_tb) from None
[1,6]<stderr>: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/check_ops.py", line 485, in _binary_assert
[1,6]<stderr>: raise errors.InvalidArgumentError(
[1,6]<stderr>:tensorflow.python.framework.errors_impl.InvalidArgumentError: Arguments to _from_row_partition do not form a valid RaggedTensor
[1,6]<stderr>:Condition x == y did not hold.
[1,6]<stderr>:First 1 elements of x:
[1,6]<stderr>:[8192]
[1,6]<stderr>:First 1 elements of y:
[1,6]<stderr>:[2]
To Reproduce
Steps to reproduce the behavior:
How to build including docker pull & docker run commands
How to run including the JSON config file used
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
OS: [e.g. Ubuntu xx.yy]
Graphic card: [e.g. a single NVIDIA H100]
CUDA version: [e.g. CUDA 11.x]
Docker image
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Hi @Orca-bit , is this bug reproducible every time? If so, I will try to reproduce it and then provide you with an answer. Additionally, I will also test the issue mentioned at #463.
@kanghui0204 yes, it is reproducible. By the way, could you share the md5sums of sok split datasets, I have checked md5sums of the hugectr datasets, i.e. train.bin ,test.bin and val.bin.
Describe the bug
after runing iteration 3790, some errors occur, it looks like something wrong with dataset.
To Reproduce
Steps to reproduce the behavior:
docker pull & docker run
commandsExpected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: