Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why there are many \x00 tokens around the title? #2

Open
thsno02 opened this issue Jan 9, 2024 · 1 comment
Open

Why there are many \x00 tokens around the title? #2

thsno02 opened this issue Jan 9, 2024 · 1 comment
Labels

Comments

@thsno02
Copy link
Owner

thsno02 commented Jan 9, 2024

In the OWT, I'm weird about the \x00 tokens, these tokens appear as:

  • The last sentence of the previous txt + \x00 + current txt title + \x00 + the first sentence of the current txt
  • For the first line, current txt title + \x00 + the first sentence of the current txt
  • For the last line, the last sentence of the current txt + '%'
@thsno02
Copy link
Owner Author

thsno02 commented Jan 9, 2024

At first, I think I should do some pre-processing to handle \x00 so the model will not learn about it OR I should re-download the datasets.

If I accept it as a feature, what I need is to catenate the file into a single file and to divide it into train.bin and val.bin.

@thsno02 thsno02 added the dataset label Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant