-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datasets.Dataset.map too slow even with num_proc #10
Comments
|
However, after some time, the process unexpectedly failed without any apparent reason. and the speed is also very slow. |
no report error |
I compared my setup with LLama Factory, and the parameters for 360-LLama-Factory are as follows: cutoff_len: 120,000 Map (num_proc=8): 66%|██████▋ | 68000/102561 [31:28<12:53, 44.69 examples/s] For the llama-factory: dataset: open_thoughts_114k 33%|███▎ | 474/1425 [10:12<20:30, 1.29s/it]�[A 33%|███▎ | 475/1425 [10:14<27:00, 1.71s/it]�[A 33%|███▎ | 476/1425 [10:16<27:53, 1.76s/it]�[A 33%|███▎ | 477/1425 [10:18<25:08, 1.59s/it]�[A 34%|███▎ | 478/1425 [10:18<20:27, 1.30s/it]�[A 34%|███▎ | 479/1425 [10:20<22:19, 1.42s/it]�[A 34%|███▎ | 480/1425 [10:21<18:49, 1.20s/it]�[A There is roughly a 100x difference in processing speed between the two setups. |
Running padding split on dataset (num_proc=8): 100%|█| 6284/6284 [01:24<00:00, 74.77 examples/s] We use the open source dataset Yukang/LongAlpaca-16k-length for testing. The specific settings and running results are as above. Our test results are all normal. |
After modifying my dataset processing pipeline, the speed of datasets.Dataset.map() remains slow (40-70 examples/sec). The dataset mapping operations are as follows:
padded_dataset = dataset.map(pad_sequence, batched=True, num_proc=data_args.preprocessing_num_workers)
sp_dataset = padded_dataset.map(sp_split, batched=True, num_proc=data_args.preprocessing_num_workers)
However, the speed is still too slow even after optimizing num_proc. The expected speed is much higher.
[rank1]: RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
The text was updated successfully, but these errors were encountered: