Skip to content

Added MessagesDataloader so we can just use messages in our datasets rather than tokenized inputs #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

SeanKski
Copy link
Collaborator

@SeanKski SeanKski commented Jun 18, 2025

This adds the MessagesStreamingDataset, which allows us to use chat-ml messages as our main data interface. This can essentially be thought of as the same as the PromptStreamingDataset but with the tokenizing of the prompts happening on the fly, rather than requiring the samples to be pre-tokenized.

What this allows us to do:

  1. Re-use the same dataset across models, rather than having to create a different pre-tokenized MDS dataset for different tokenizers
  2. Have the raw messages in our batches so we don't have to untokenize + unchat-template to use chat interfaces such as bpt
  3. More easily support tool-use and multi-turn

I've updated the README and example yamls to use messages rather than tokenized-prompts

Couple things to note:

  • This is part 1 of a closely-linked PR that moves to using vllm.chat rather than vllm.generate, but I'm keeping the PRs separate for house-keeping reasons
  • I couldn't come up with a clean way of caching the tokenizations for samples, so currently, [dataset.__getitem__(0) for _ in range(10)] will freshly tokenize the same sample 10 times. It's a minor thing, but if anyone has ideas on how to properly do that caching that'd be dope!

MCLI runs:

  • the control run: (recreation of @jdchang1's run from his reorg PR, runs from compose-rl main)mcli logs grpo-reorg-tVrNI0
  • a run with PromptDataset: mcli logs grpo-reorg-prompt-Az2nwg
  • a run with MessagesDataset: mcli logs grpo-reorg-messages-HMdrix (note: since this uses @abaheti95's open-r1 filtered dataset which he only has tokenized versions of, I had to untokenize + unchat-template these to convert them to messages, so there might be some differences caused from that)

@abaheti95
Copy link
Collaborator

Discussed offline with/ @SeanKski. There is a limitation in the current unified_tokenize_dataset.py, where it only supports converting single prompts to messages. We would like to have an additional preprocessing script that directly ingests a JSONL file for data preprocessing, where each row contains the following keys: messages and verified_answer. This can be directly converted into MDS that the dataloader can handle.

Comment on lines 686 to 692
elif key == 'messages':
# the messages should be [num_batches_per_update, batch_size, num_turns]
# need to flatten this to [num_batches_per_update * batch_size, num_turns]
ret_batch[key] = [
message_chain for batch in curr_values
for message_chain in batch
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this flatten this correctly?

curr_values is just a raw list of messages, so I think this would be unpacking all of the messages for multi-turn. Let me know if I'm interpreting the code incorrectly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm good point! I originally thought this was [num_batches_per_update, batch_size, num_turns] (especially since there's concatting and flattening happening for the other keys), but you're right this is already in [num_batches_per_update * batch_size, num_turns]. So I just removed this part and now we should be good!

@@ -0,0 +1,170 @@
# Copyright 2025 MosaicML ComposeRL authors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file supposed to be a part of the PR? Or is it still okay to be using unified_tokenize_dataset.py?

@SeanKski
Copy link
Collaborator Author

SeanKski commented Jun 23, 2025

Discussed offline with/ @SeanKski. There is a limitation in the current unified_tokenize_dataset.py, where it only supports converting single prompts to messages. We would like to have an additional preprocessing script that directly ingests a JSONL file for data preprocessing, where each row contains the following keys: messages and verified_answer. This can be directly converted into MDS that the dataloader can handle.

done! check the new readme and scripts/data/messages_dataset_to_mds.py. Each messages sample gets passed to MDS with messages and metadata and so the verified_answer is located on inside the metadata json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants