Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasets.Dataset.map too slow even with num_proc #10

Open
luckyfan-cs opened this issue Feb 2, 2025 · 5 comments
Open

datasets.Dataset.map too slow even with num_proc #10

luckyfan-cs opened this issue Feb 2, 2025 · 5 comments

Comments

@luckyfan-cs
Copy link

After modifying my dataset processing pipeline, the speed of datasets.Dataset.map() remains slow (40-70 examples/sec). The dataset mapping operations are as follows:

padded_dataset = dataset.map(pad_sequence, batched=True, num_proc=data_args.preprocessing_num_workers)
sp_dataset = padded_dataset.map(sp_split, batched=True, num_proc=data_args.preprocessing_num_workers)
However, the speed is still too slow even after optimizing num_proc. The expected speed is much higher.

[rank1]: RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

@HaoshengZou
Copy link
Collaborator

  1. Is your [rank1]: RuntimeError related to speed issue?
  2. We tested with small data that preprocessing_num_workers=8 achieves near 8x speedup compared to not setting preprocessing_num_workers (defaults to no multiprocessing). How much speedup have you seen turning on multiprocessing with preprocessing_num_workers=??

@luckyfan-cs
Copy link
Author

  1. it is not related to the speed issue.
  2. I set the Map function with num_proc=8, and initially, the progress was steady
    Map (num_proc=8): 66%|██████▋ | 68000/102561 [31:28<12:53, 44.69 examples/s]
    Map (num_proc=8): 67%|██████▋ | 69000/102561 [31:29<08:59, 62.20 examples/s]
    Map (num_proc=8): 68%|██████▊ | 70000/102561 [31:29<06:13, 87.11 examples/s]
    Map (num_proc=8): 68%|██████▊ | 70000/102561 [31:32<08:07, 66.84 examples/s]
    Map (num_proc=8): 70%|███████ | 72000/102561 [31:29<03:14, 157.17 examples/s]
    Map (num_proc=8): 69%|██████▉ | 71000/102561 [31:32<05:35, 94.03 examples/s]",

However, after some time, the process unexpectedly failed without any apparent reason. and the speed is also very slow.

@luckyfan-cs
Copy link
Author

no report error

@luckyfan-cs
Copy link
Author

I compared my setup with LLama Factory, and the parameters for 360-LLama-Factory are as follows:

cutoff_len: 120,000
overwrite_cache: true
preprocessing_num_workers: 16
max_new_tokens: 24,000

Map (num_proc=8): 66%|██████▋ | 68000/102561 [31:28<12:53, 44.69 examples/s]
Map (num_proc=8): 67%|██████▋ | 69000/102561 [31:29<08:59, 62.20 examples/s]
Map (num_proc=8): 68%|██████▊ | 70000/102561 [31:29<06:13, 87.11 examples/s]
Map (num_proc=8): 68%|██████▊ | 70000/102561 [31:32<08:07, 66.84 examples/s]
Map (num_proc=8): 70%|███████ | 72000/102561 [31:29<03:14, 157.17 examples/s]
Map (num_proc=8): 69%|██████▉ | 71000/102561 [31:32<05:35, 94.03 examples/s]

For the llama-factory:

dataset: open_thoughts_114k
template: qwen
cutoff_len: 24000
overwrite_cache: true
preprocessing_num_workers: 16

33%|███▎ | 474/1425 [10:12<20:30, 1.29s/it]�[A

33%|███▎ | 475/1425 [10:14<27:00, 1.71s/it]�[A

33%|███▎ | 476/1425 [10:16<27:53, 1.76s/it]�[A

33%|███▎ | 477/1425 [10:18<25:08, 1.59s/it]�[A

34%|███▎ | 478/1425 [10:18<20:27, 1.30s/it]�[A

34%|███▎ | 479/1425 [10:20<22:19, 1.42s/it]�[A

34%|███▎ | 480/1425 [10:21<18:49, 1.20s/it]�[A

There is roughly a 100x difference in processing speed between the two setups.

@gom168
Copy link
Collaborator

gom168 commented Feb 3, 2025

    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --preprocessing_num_workers 8 \
    --template qwen \
    --cutoff_len 120000 \
    --cache_dir .cache \
    --overwrite_cache \
    --sequence_parallel_size 2 \

Running padding split on dataset (num_proc=8): 100%|█| 6284/6284 [01:24<00:00, 74.77 examples/s]
Running padding split on dataset (num_proc=8): 100%|█| 6284/6284 [01:24<00:00, 73.95 examples/s]
Running sequence parallel split on dataset (num_proc=8): 100%|█| 6284/6284 [04:51<00:00, 21.58 examples/s]
Running sequence parallel split on dataset (num_proc=8): 100%|█| 6284/6284 [05:03<00:00, 20.71 examples/s]

We use the open source dataset Yukang/LongAlpaca-16k-length for testing. The specific settings and running results are as above. Our test results are all normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants