Add the example script for data pretokenization #101

mryab · 2024-04-18T11:39:20Z

This PR provides a minimal example of how to tokenize data locally before submitting it to the Finetuning API. The script supports loss masking (for padding tokens in this case) by adding the labels field.

Running to get the dataset with just labels and the attention mask:

python examples/tokenize_data.py

Running with loss labels as well (padding tokens will have an index of -100 and will be ignored in the loss):

python examples/tokenize_data.py --add-labels --out-filename data_with_labels.parquet

I checked that both together files check and together files upload work with the outputs, the filetype is reported to be Parquet.

TODO:

Adding a reference in README/website docs
Checking everything through the API

Nutlope

LGTM!

azahed98

LGTM!

Co-authored-by: Ben Athiwaratkun <[email protected]>

mryab requested a review from orangetin April 18, 2024 11:39

mryab changed the title ~~Add example of data pretokenization~~ Add the example script for data pretokenization Apr 19, 2024

mryab marked this pull request as ready for review April 19, 2024 15:12

mryab requested a review from azahed98 April 19, 2024 15:12

orangetin added the fine-tuning label Apr 22, 2024

mryab requested review from Nutlope and removed request for orangetin April 24, 2024 10:37

Nutlope approved these changes Apr 24, 2024

View reviewed changes

mryab force-pushed the add-parquet-examples branch from ffd5476 to 2d0dc35 Compare April 24, 2024 21:53

azahed98 approved these changes Apr 24, 2024

View reviewed changes

mryab and others added 3 commits April 30, 2024 14:46

Add example of data pretokenization

2e5a78c

Co-authored-by: Ben Athiwaratkun <[email protected]>

Replace underscores with dashes in argument names

e55eead

Add support for packing

a697c29

Co-authored-by: Ben Athiwaratkun <[email protected]>

mryab force-pushed the add-parquet-examples branch from 639a904 to a697c29 Compare April 30, 2024 13:46

Update poetry.lock

0911ad4

orangetin approved these changes Apr 30, 2024

View reviewed changes

orangetin merged commit 236edca into main Apr 30, 2024
6 of 11 checks passed

orangetin deleted the add-parquet-examples branch April 30, 2024 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the example script for data pretokenization #101

Add the example script for data pretokenization #101

mryab commented Apr 18, 2024 •

edited

Loading

Nutlope left a comment

azahed98 left a comment

Add the example script for data pretokenization #101

Add the example script for data pretokenization #101

Conversation

mryab commented Apr 18, 2024 • edited Loading

Nutlope left a comment

Choose a reason for hiding this comment

azahed98 left a comment

Choose a reason for hiding this comment

mryab commented Apr 18, 2024 •

edited

Loading