Skip to content

Commit

Permalink
[Dataset Performance] Add num workers on dataset processing - labels,…
Browse files Browse the repository at this point in the history
… tokenization (#1189)

SUMMARY:
* Add `preprocessing_num_workers` to run dataset processing in parallel
for 2:4 example.

Before:
Tokenizing: 371.12 examples/s, 
Adding labels: 1890.18 examples/s,
Tokenizing: 333.39 examples/s
```bash
Tokenizing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:34<00:00, 371.12 examples/s]
Adding labels: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:06<00:00, 1890.18 examples/s]
Tokenizing:   9%|█████████▌                                                                                                     | 22077/256032 [00:59<11:41, 333.39 examples/s
```


After  (num_proc=8):
Tokenizing: 2703.93 examples/s, 
Adding labels: 5524.98 examples/s,
Tokenizing: 2925.98 examples/s
```bash
Tokenizing (num_proc=8): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:04<00:00, 2703.93 examples/s]
Adding labels (num_proc=8): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:02<00:00, 5524.98 examples/s]
Tokenizing (num_proc=8): 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 256032/256032 [01:27<00:00, 2925.98 examples/s]
```

TEST PLAN:
* Pass existing tests

Co-authored-by: Dipika Sikka <[email protected]>
  • Loading branch information
horheynm and dsikka authored Feb 25, 2025
1 parent d3d2d1d commit 77e4f4c
Showing 1 changed file with 2 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
bf16 = False # using full precision for training
lr_scheduler_type = "cosine"
warmup_ratio = 0.1
preprocessing_num_workers = 8

# this will run the recipe stage by stage:
# oneshot sparsification -> finetuning -> oneshot quantization
Expand All @@ -52,6 +53,7 @@
learning_rate=learning_rate,
lr_scheduler_type=lr_scheduler_type,
warmup_ratio=warmup_ratio,
preprocessing_num_workers=preprocessing_num_workers,
)
logger.info(
"llmcompressor does not currently support running compressed models in the marlin24 format." # noqa
Expand Down

0 comments on commit 77e4f4c

Please sign in to comment.