Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConcatDataset allows Packed sub datasets #1708

Closed
pbontrager opened this issue Sep 28, 2024 · 4 comments
Closed

ConcatDataset allows Packed sub datasets #1708

pbontrager opened this issue Sep 28, 2024 · 4 comments

Comments

@pbontrager
Copy link
Contributor

Currently we don’t allow data packing with concat dataset. We can discuss enabling that, but in the meantime there’s a big in recipes wheee a user can set packed =True to each dataset in the list . This will individually pack each one but not use the right collate function, also, all of the datasets in the list should probably be packed together. To fix this for now, we can have ConcatDataset check that there are no packed datasets in the list.

@krammnic
Copy link
Contributor

krammnic commented Oct 9, 2024

I see that this issue without "community welcome" label, but because it became little bit stale here are few thoughts. Probably something like:

for dataset in self._datasets:
    if isinstance(dataset, PackedDataset):
        raise ...

While initializing ConcatDataset will solve it, but knowing that we can import PackedDataset to ConcatDataset due to circular import, probably the better idea is to do something from this list:

  1. Introduce some is_packed: bool property
  2. Add utility to check this condition in utils(looks like bad idea)

So final solution will consist of:

  1. Do some of the checks described above
  2. Add unit test
  3. Add some information in docs?

If you not mind, I might try to do a PR.

@RdoubleA
Copy link
Contributor

RdoubleA commented Oct 9, 2024

There shouldn't be a circular import if ConcatDataset imports PackedDataset, as _packed.py does not currently depend on _concat.py. Would prefer doing the check within ConcatDataset, otherwise you would have to update all recipes.

@krammnic
Copy link
Contributor

krammnic commented Oct 9, 2024

Ok, will recheck this, as I see torchtune does not have any custom exceptions, so I assume that raising some ValueError will be fine. Will create PR soon for this

@joecummings
Copy link
Contributor

Closed by #1796

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants