datasets.Dataset.map too slow even with num_proc #10

luckyfan-cs · 2025-02-02T07:01:30Z

After modifying my dataset processing pipeline, the speed of datasets.Dataset.map() remains slow (40-70 examples/sec). The dataset mapping operations are as follows:

padded_dataset = dataset.map(pad_sequence, batched=True, num_proc=data_args.preprocessing_num_workers)
sp_dataset = padded_dataset.map(sp_split, batched=True, num_proc=data_args.preprocessing_num_workers)
However, the speed is still too slow even after optimizing num_proc. The expected speed is much higher.

[rank1]: RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

HaoshengZou · 2025-02-02T12:36:33Z

Is your [rank1]: RuntimeError related to speed issue?
We tested with small data that preprocessing_num_workers=8 achieves near 8x speedup compared to not setting preprocessing_num_workers (defaults to no multiprocessing). How much speedup have you seen turning on multiprocessing with preprocessing_num_workers=??

luckyfan-cs · 2025-02-02T15:22:45Z

it is not related to the speed issue.
I set the Map function with num_proc=8, and initially, the progress was steady
Map (num_proc=8): 66%|██████▋ | 68000/102561 [31:28<12:53, 44.69 examples/s]
Map (num_proc=8): 67%|██████▋ | 69000/102561 [31:29<08:59, 62.20 examples/s]
Map (num_proc=8): 68%|██████▊ | 70000/102561 [31:29<06:13, 87.11 examples/s]
Map (num_proc=8): 68%|██████▊ | 70000/102561 [31:32<08:07, 66.84 examples/s]
Map (num_proc=8): 70%|███████ | 72000/102561 [31:29<03:14, 157.17 examples/s]
Map (num_proc=8): 69%|██████▉ | 71000/102561 [31:32<05:35, 94.03 examples/s]",

However, after some time, the process unexpectedly failed without any apparent reason. and the speed is also very slow.

luckyfan-cs · 2025-02-02T15:24:17Z

no report error

luckyfan-cs · 2025-02-02T15:29:55Z

I compared my setup with LLama Factory, and the parameters for 360-LLama-Factory are as follows:

cutoff_len: 120,000
overwrite_cache: true
preprocessing_num_workers: 16
max_new_tokens: 24,000

Map (num_proc=8): 66%|██████▋ | 68000/102561 [31:28<12:53, 44.69 examples/s]
Map (num_proc=8): 67%|██████▋ | 69000/102561 [31:29<08:59, 62.20 examples/s]
Map (num_proc=8): 68%|██████▊ | 70000/102561 [31:29<06:13, 87.11 examples/s]
Map (num_proc=8): 68%|██████▊ | 70000/102561 [31:32<08:07, 66.84 examples/s]
Map (num_proc=8): 70%|███████ | 72000/102561 [31:29<03:14, 157.17 examples/s]
Map (num_proc=8): 69%|██████▉ | 71000/102561 [31:32<05:35, 94.03 examples/s]

For the llama-factory:

dataset: open_thoughts_114k
template: qwen
cutoff_len: 24000
overwrite_cache: true
preprocessing_num_workers: 16

33%|███▎ | 474/1425 [10:12<20:30, 1.29s/it]�[A

33%|███▎ | 475/1425 [10:14<27:00, 1.71s/it]�[A

33%|███▎ | 476/1425 [10:16<27:53, 1.76s/it]�[A

33%|███▎ | 477/1425 [10:18<25:08, 1.59s/it]�[A

34%|███▎ | 478/1425 [10:18<20:27, 1.30s/it]�[A

34%|███▎ | 479/1425 [10:20<22:19, 1.42s/it]�[A

34%|███▎ | 480/1425 [10:21<18:49, 1.20s/it]�[A

There is roughly a 100x difference in processing speed between the two setups.

gom168 · 2025-02-03T02:47:28Z

    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --preprocessing_num_workers 8 \
    --template qwen \
    --cutoff_len 120000 \
    --cache_dir .cache \
    --overwrite_cache \
    --sequence_parallel_size 2 \

Running padding split on dataset (num_proc=8): 100%|█| 6284/6284 [01:24<00:00, 74.77 examples/s]
Running padding split on dataset (num_proc=8): 100%|█| 6284/6284 [01:24<00:00, 73.95 examples/s]
Running sequence parallel split on dataset (num_proc=8): 100%|█| 6284/6284 [04:51<00:00, 21.58 examples/s]
Running sequence parallel split on dataset (num_proc=8): 100%|█| 6284/6284 [05:03<00:00, 20.71 examples/s]

We use the open source dataset Yukang/LongAlpaca-16k-length for testing. The specific settings and running results are as above. Our test results are all normal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.Dataset.map too slow even with num_proc #10

datasets.Dataset.map too slow even with num_proc #10

luckyfan-cs commented Feb 2, 2025

HaoshengZou commented Feb 2, 2025

luckyfan-cs commented Feb 2, 2025

luckyfan-cs commented Feb 2, 2025

luckyfan-cs commented Feb 2, 2025

gom168 commented Feb 3, 2025

datasets.Dataset.map too slow even with num_proc #10

datasets.Dataset.map too slow even with num_proc #10

Comments

luckyfan-cs commented Feb 2, 2025

HaoshengZou commented Feb 2, 2025

luckyfan-cs commented Feb 2, 2025

luckyfan-cs commented Feb 2, 2025

luckyfan-cs commented Feb 2, 2025

gom168 commented Feb 3, 2025