Why there are many `\x00` tokens around the title? #2

thsno02 · 2024-01-09T01:49:04Z

In the OWT, I'm weird about the \x00 tokens, these tokens appear as:

The last sentence of the previous txt + \x00 + current txt title + \x00 + the first sentence of the current txt
For the first line, current txt title + \x00 + the first sentence of the current txt
For the last line, the last sentence of the current txt + '%'

The text was updated successfully, but these errors were encountered:

thsno02 · 2024-01-09T01:54:36Z

At first, I think I should do some pre-processing to handle \x00 so the model will not learn about it OR I should re-download the datasets.

https://huggingface.co/datasets/Skylion007/openwebtext/discussions/4 showed they had the same issue as me => it's not a problem, it's a feature. I haven't found a blog talking about this feature, thus it doesn't matter in reproducing the GPT-2.

If I accept it as a feature, what I need is to catenate the file into a single file and to divide it into train.bin and val.bin.

thsno02 added the dataset label Jan 9, 2024

Provide feedback