Added MessagesDataloader so we can just use `messages` in our datasets rather than tokenized inputs #92

SeanKski · 2025-06-18T00:22:36Z

This adds the MessagesStreamingDataset, which allows us to use chat-ml messages as our main data interface. This can essentially be thought of as the same as the PromptStreamingDataset but with the tokenizing of the prompts happening on the fly, rather than requiring the samples to be pre-tokenized.

What this allows us to do:

Re-use the same dataset across models, rather than having to create a different pre-tokenized MDS dataset for different tokenizers
Have the raw messages in our batches so we don't have to untokenize + unchat-template to use chat interfaces such as bpt
More easily support tool-use and multi-turn

I've updated the README and example yamls to use messages rather than tokenized-prompts

Couple things to note:

This is part 1 of a closely-linked PR that moves to using vllm.chat rather than vllm.generate, but I'm keeping the PRs separate for house-keeping reasons
I couldn't come up with a clean way of caching the tokenizations for samples, so currently, [dataset.__getitem__(0) for _ in range(10)] will freshly tokenize the same sample 10 times. It's a minor thing, but if anyone has ideas on how to properly do that caching that'd be dope!

MCLI runs:

the control run: (recreation of @jdchang1's run from his reorg PR, runs from compose-rl main)mcli logs grpo-reorg-tVrNI0
a run with PromptDataset: mcli logs grpo-reorg-prompt-Az2nwg
a run with MessagesDataset: mcli logs grpo-reorg-messages-HMdrix (note: since this uses @abaheti95's open-r1 filtered dataset which he only has tokenized versions of, I had to untokenize + unchat-template these to convert them to messages, so there might be some differences caused from that)

…y done it

abaheti95 · 2025-06-19T00:20:46Z

Discussed offline with/ @SeanKski. There is a limitation in the current unified_tokenize_dataset.py, where it only supports converting single prompts to messages. We would like to have an additional preprocessing script that directly ingests a JSONL file for data preprocessing, where each row contains the following keys: messages and verified_answer. This can be directly converted into MDS that the dataloader can handle.

bcui-db · 2025-06-23T20:24:01Z

compose_rl/algorithms/online/callback.py

+                elif key == 'messages':
+                    # the messages should be [num_batches_per_update, batch_size, num_turns]
+                    # need to flatten this to [num_batches_per_update * batch_size, num_turns]
+                    ret_batch[key] = [
+                        message_chain for batch in curr_values
+                        for message_chain in batch
+                    ]


Does this flatten this correctly?

curr_values is just a raw list of messages, so I think this would be unpacking all of the messages for multi-turn. Let me know if I'm interpreting the code incorrectly.

hmmm good point! I originally thought this was [num_batches_per_update, batch_size, num_turns] (especially since there's concatting and flattening happening for the other keys), but you're right this is already in [num_batches_per_update * batch_size, num_turns]. So I just removed this part and now we should be good!

bcui-db · 2025-06-23T21:01:21Z

scripts/data/messages_dataset_to_mds.py

@@ -0,0 +1,170 @@
+# Copyright 2025 MosaicML ComposeRL authors


Is this file supposed to be a part of the PR? Or is it still okay to be using unified_tokenize_dataset.py?

SeanKski · 2025-06-23T22:27:03Z

Discussed offline with/ @SeanKski. There is a limitation in the current unified_tokenize_dataset.py, where it only supports converting single prompts to messages. We would like to have an additional preprocessing script that directly ingests a JSONL file for data preprocessing, where each row contains the following keys: messages and verified_answer. This can be directly converted into MDS that the dataloader can handle.

done! check the new readme and scripts/data/messages_dataset_to_mds.py. Each messages sample gets passed to MDS with messages and metadata and so the verified_answer is located on inside the metadata json

SeanKski added 13 commits June 10, 2025 22:19

added raw_untokenized_texts to the batch

e08f7fe

added intial messages dataloader

8ec9f69

pushed tokenizer to messages class

a6edade

added messages to the dataset prep

62b0237

added different messages option to preprocesser

455ce9a

updated README to use messages rather than prompts

b0fe610

added tests

e4ac438

fixed bugs with messages collator

854a4da

Merge main into seank/chat_messages

1ade679

adding back changes to callback

a9b2d8f

vllm hotfix

d0c81fe

removed raw_untokenized texts from default reward batch

830a131

updated local_grpo and local_ppo yamls

576e153

SeanKski requested review from bcui-db, dakinggg, gupta-abhay, abaheti95 and jdchang1 as code owners June 18, 2025 00:22

SeanKski and others added 2 commits June 17, 2025 17:25

Merge branch 'main' into seank/chat_messages

448ab98

ruff is a cruel and hard to please master, but looks like i've finall…

f212b03

…y done it

added messages_dataset_to_mds script

6d2b4a3

bcui-db reviewed Jun 23, 2025

View reviewed changes

added option for metadata in messages dataset

93556d5

bcui-db reviewed Jun 23, 2025

View reviewed changes

SeanKski added 4 commits June 23, 2025 21:42

updated readme and made pyright happy

4f3ae65

undid accidental commit to generation_utils

3c5283f

removed metadata from batch for now

f241fb0

fixed messages collating based on brandon's comment

cc546a7

Merge branch 'main' into seank/chat_messages

db61308

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added MessagesDataloader so we can just use `messages` in our datasets rather than tokenized inputs #92

Added MessagesDataloader so we can just use `messages` in our datasets rather than tokenized inputs #92

Uh oh!

SeanKski commented Jun 18, 2025 •

edited

Loading

Uh oh!

abaheti95 commented Jun 19, 2025

Uh oh!

bcui-db Jun 23, 2025

Uh oh!

SeanKski Jun 23, 2025

Uh oh!

bcui-db Jun 23, 2025

Uh oh!

SeanKski commented Jun 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

		@@ -0,0 +1,170 @@
		# Copyright 2025 MosaicML ComposeRL authors

Added MessagesDataloader so we can just use messages in our datasets rather than tokenized inputs #92

Are you sure you want to change the base?

Added MessagesDataloader so we can just use messages in our datasets rather than tokenized inputs #92

Uh oh!

Conversation

SeanKski commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abaheti95 commented Jun 19, 2025

Uh oh!

bcui-db Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

SeanKski Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

bcui-db Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

SeanKski commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Added MessagesDataloader so we can just use `messages` in our datasets rather than tokenized inputs #92

Added MessagesDataloader so we can just use `messages` in our datasets rather than tokenized inputs #92

SeanKski commented Jun 18, 2025 •

edited

Loading

SeanKski commented Jun 23, 2025 •

edited

Loading