Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the data collator #34

Closed
wants to merge 1 commit into from
Closed

Conversation

EmYassir
Copy link
Collaborator

@EmYassir EmYassir commented Mar 26, 2024

Changelogs

  • enumerate the changes of that PR.
    Fixed the data collator.
    #fix

Checklist:

  • Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted. Eventually consider making a new tutorial for new features.
  • Write concise and explanatory changelogs below.
  • If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

discussion related to that PR

@EmYassir EmYassir changed the title fixing the data collator Fixing the data collator Mar 26, 2024
@@ -86,7 +85,7 @@ def __call__(self, samples: List[Union[List[int], Any, Dict[str, Any]]]):
)
else:
batch = tokenizer.pad(
examples,
samples,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pad will fail if you are not using a copy of samples on which you pop the unused key.

It's designed to either tokenize on the fly or pad a pretokenized data. If the data is pretokenized your changes will raise an error.

@maclandrol maclandrol closed this Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants