Add support for pre-tokenized streaming dataset finetuning #601

boomanaiden154 · 2023-09-15T22:20:15Z

This PR adds in support for performing fine-tuning training with a streaming dataset that has already been pre-tokenized.
The test passing is dependent on the fix in #600.

dakinggg · 2023-09-15T22:36:37Z

Thanks! Will take a look after we resolve #600

This patch fixes a pyright string concatenation warning and also adds typing information where necessary.

alextrott16

I am not opposed to supporting pre-tokenized streaming datasets. But, I am concerned that the tokens/labels structure used to identify pre-tokenized formats is too arbitrary. It's an issue mostly because we don't have any tools that someone can use to build a streaming dataset with the tokens/labels structure.

Rather than having a "backdoor" in StreamingFinetuningDataset that recognizes pre-tokenized formats from the tokens/labels structure, I'd prefer that we keep with the prompt/response structure and simply allow the code to determine whether tokenization is required based on whether the prompt/response values are strings or bytes. I think this will make it easier to simply add a pre-tokenization option to our existing tooling.

Please let me know if this ask is unclear. Thanks for helping to grow our codebase!

llmfoundry/data/finetuning/tasks.py

boomanaiden154 · 2023-09-25T21:38:50Z

That's a pretty reasonable request, especially given there's nothing upstream using this! I'll change the implementation to follow your suggestion.

I'm planning on writing up some tooling eventually designed for upstream use that uses this format (probably as a part of #611), but that'll come later.

alextrott16

Thanks for adding this and incorporating the suggestions! LGTM!!

tests/test_dataloader.py

dakinggg · 2024-03-06T20:04:51Z

This has been done by #945

boomanaiden154 force-pushed the ft-tokenized-pr branch from 4350de0 to 382c88a Compare September 18, 2023 03:16

boomanaiden154 added 2 commits September 20, 2023 14:39

Add support for pre-tokenized SD finetuning

83d36a5

Fix pyright warnings/errors

1831916

This patch fixes a pyright string concatenation warning and also adds typing information where necessary.

boomanaiden154 force-pushed the ft-tokenized-pr branch from dcc6e43 to 1831916 Compare September 20, 2023 21:47

Merge branch 'main' into ft-tokenized-pr

936433e

alextrott16 suggested changes Sep 25, 2023

View reviewed changes

llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved

Address reviewer feedback

8f0e308

alextrott16 approved these changes Sep 25, 2023

View reviewed changes

dakinggg reviewed Sep 25, 2023

View reviewed changes

tests/test_dataloader.py Show resolved Hide resolved

dakinggg and others added 7 commits September 25, 2023 18:31

Merge branch 'main' into ft-tokenized-pr

46da826

Fix FT dataset loading after key name change, make test more detailed

9cceb70

Merge branch 'main' into ft-tokenized-pr

2bf2698

Merge branch 'main' into ft-tokenized-pr

e838c87

Merge branch 'main' into ft-tokenized-pr

1f8f9e0

Merge branch 'main' into ft-tokenized-pr

c026201

Merge branch 'main' into ft-tokenized-pr

16cc21a

dakinggg closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for pre-tokenized streaming dataset finetuning #601

Add support for pre-tokenized streaming dataset finetuning #601

boomanaiden154 commented Sep 15, 2023

dakinggg commented Sep 15, 2023

alextrott16 left a comment

boomanaiden154 commented Sep 25, 2023

alextrott16 left a comment

dakinggg commented Mar 6, 2024

Add support for pre-tokenized streaming dataset finetuning #601

Add support for pre-tokenized streaming dataset finetuning #601

Conversation

boomanaiden154 commented Sep 15, 2023

dakinggg commented Sep 15, 2023

alextrott16 left a comment

Choose a reason for hiding this comment

boomanaiden154 commented Sep 25, 2023

alextrott16 left a comment

Choose a reason for hiding this comment

dakinggg commented Mar 6, 2024