Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pre-tokenized streaming dataset finetuning #601

Closed
wants to merge 11 commits into from

Conversation

boomanaiden154
Copy link
Contributor

This PR adds in support for performing fine-tuning training with a streaming dataset that has already been pre-tokenized.
The test passing is dependent on the fix in #600.

@dakinggg
Copy link
Collaborator

Thanks! Will take a look after we resolve #600

This patch fixes a pyright string concatenation warning and also adds
typing information where necessary.
Copy link
Contributor

@alextrott16 alextrott16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to supporting pre-tokenized streaming datasets. But, I am concerned that the tokens/labels structure used to identify pre-tokenized formats is too arbitrary. It's an issue mostly because we don't have any tools that someone can use to build a streaming dataset with the tokens/labels structure.

Rather than having a "backdoor" in StreamingFinetuningDataset that recognizes pre-tokenized formats from the tokens/labels structure, I'd prefer that we keep with the prompt/response structure and simply allow the code to determine whether tokenization is required based on whether the prompt/response values are strings or bytes. I think this will make it easier to simply add a pre-tokenization option to our existing tooling.

Please let me know if this ask is unclear. Thanks for helping to grow our codebase!

llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
@boomanaiden154
Copy link
Contributor Author

That's a pretty reasonable request, especially given there's nothing upstream using this! I'll change the implementation to follow your suggestion.

I'm planning on writing up some tooling eventually designed for upstream use that uses this format (probably as a part of #611), but that'll come later.

Copy link
Contributor

@alextrott16 alextrott16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this and incorporating the suggestions! LGTM!!

@dakinggg
Copy link
Collaborator

dakinggg commented Mar 6, 2024

This has been done by #945

@dakinggg dakinggg closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants