-
Notifications
You must be signed in to change notification settings - Fork 10
Added MessagesDataloader so we can just use messages
in our datasets rather than tokenized inputs
#92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Discussed offline with/ @SeanKski. There is a limitation in the current unified_tokenize_dataset.py, where it only supports converting single prompts to messages. We would like to have an additional preprocessing script that directly ingests a JSONL file for data preprocessing, where each row contains the following keys: |
elif key == 'messages': | ||
# the messages should be [num_batches_per_update, batch_size, num_turns] | ||
# need to flatten this to [num_batches_per_update * batch_size, num_turns] | ||
ret_batch[key] = [ | ||
message_chain for batch in curr_values | ||
for message_chain in batch | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this flatten this correctly?
curr_values
is just a raw list of messages, so I think this would be unpacking all of the messages for multi-turn. Let me know if I'm interpreting the code incorrectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm good point! I originally thought this was [num_batches_per_update, batch_size, num_turns]
(especially since there's concatting and flattening happening for the other keys), but you're right this is already in [num_batches_per_update * batch_size, num_turns]
. So I just removed this part and now we should be good!
@@ -0,0 +1,170 @@ | |||
# Copyright 2025 MosaicML ComposeRL authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file supposed to be a part of the PR? Or is it still okay to be using unified_tokenize_dataset.py
?
done! check the new readme and |
This adds the
MessagesStreamingDataset
, which allows us to use chat-ml messages as our main data interface. This can essentially be thought of as the same as thePromptStreamingDataset
but with the tokenizing of the prompts happening on the fly, rather than requiring the samples to be pre-tokenized.What this allows us to do:
messages
in our batches so we don't have to untokenize + unchat-template to use chat interfaces such as bptI've updated the README and example yamls to use messages rather than tokenized-prompts
Couple things to note:
vllm.chat
rather thanvllm.generate
, but I'm keeping the PRs separate for house-keeping reasons[dataset.__getitem__(0) for _ in range(10)]
will freshly tokenize the same sample 10 times. It's a minor thing, but if anyone has ideas on how to properly do that caching that'd be dope!MCLI runs:
mcli logs grpo-reorg-tVrNI0
PromptDataset
:mcli logs grpo-reorg-prompt-Az2nwg
MessagesDataset
:mcli logs grpo-reorg-messages-HMdrix
(note: since this uses @abaheti95's open-r1 filtered dataset which he only has tokenized versions of, I had to untokenize + unchat-template these to convert them to messages, so there might be some differences caused from that)