Handling filter data with low token len #40

KeshavSingh29 · 2024-09-06T04:37:26Z

While processing the data with the filtering scripts, I found that a lot of text in the corpus has very low num_tokens.

Can you please explain how you handle such data for training the model?

Do you have some threshold for num_tokens ? below which you discard such training data or
Do you concatenate all the data with <|endoftext|> token? and then use it ?

Note: I use o200k_base tokenizer for tokenization.

I am trying to train a similar model from scratch as well , so any help would be great!

Example cases I found :

{'text': 'Logan Township is the name of some places in the U.S. state of Minnesota:\nLogan Township, Aitkin County, Minnesota\nLogan Township, Grant County, Minnesota\n', 'meta': {'id': '141260', 'title': 'Logan Township, Minnesota', 'url': 'https://en.wikipedia.org/wiki/Logan%20Township%2C%20Minnesota'}, 'tokens': 'Logan Township is the name of some places in the U.S. state of Minnesota:\nLogan Township, Aitkin County, Minnesota\nLogan Township, Grant County, Minnesota\n', 'token_ids': [2719, 270, 74101, 382, 290, 1308, 328, 1236, 9610, 306, 290, 601, 1242, 13, 2608, 328, 31680, 734, 2719, 270, 74101, 11, 355, 278, 6342, 8269, 11, 31680, 198, 2719, 270, 74101, 11, 36945, 8269, 11, 31680, 198], 'num_tokens': 38}
{'text': 'Wallace is the name of two unincorporated communities in the State of Michigan:\nWallace, Menominee County, Michigan\nWallace, Alcona County, Michigan', 'meta': {'id': '183821', 'title': 'Wallace, Michigan', 'url': 'https://en.wikipedia.org/wiki/Wallace%2C%20Michigan'}, 'tokens': 'Wallace is the name of two unincorporated communities in the State of Michigan:\nWallace, Menominee County, Michigan\nWallace, Alcona County, Michigan', 'token_ids': [35120, 675, 382, 290, 1308, 328, 1920, 537, 2768, 52166, 780, 15061, 306, 290, 5388, 328, 23349, 734, 35120, 675, 11, 10841, 310, 45409, 8269, 11, 23349, 198, 35120, 675, 11, 1667, 135931, 8269, 11, 23349], 'num_tokens': 36}
{'text': 'Breuillet may refer to places in France:\n\n Breuillet, Charente-Maritime\n Breuillet, Essonne', 'meta': {'id': '450658', 'title': 'Breuillet', 'url': 'https://en.wikipedia.org/wiki/Breuillet'}, 'tokens': 'Breuillet may refer to places in France:\n\n Breuillet, Charente-Maritime\n Breuillet, Essonne', 'token_ids': [49930, 39788, 1340, 6716, 316, 9610, 306, 10128, 1402, 13014, 39788, 11, 9331, 1576, 100478, 46951, 198, 13014, 39788, 11, 12256, 25335], 'num_tokens': 22}
{'text': 'This is a list of episodes from the anime Burst Angel.\n\nBurst Angel', 'meta': {'id': '7526726', 'title': 'List of Burst Angel episodes', 'url': 'https://en.wikipedia.org/wiki/List%20of%20Burst%20Angel%20episodes'}, 'tokens': 'This is a list of episodes from the anime Burst Angel.\n\nBurst Angel', 'token_ids': [2500, 382, 261, 1562, 328, 29873, 591, 290, 35868, 194427, 32801, 364, 169652, 32801], 'num_tokens': 14}
{'text': 'Internet slang', 'meta': {'id': '10220550', 'title': 'TYVM', 'url': 'https://en.wikipedia.org/wiki/TYVM'}, 'tokens': 'Internet slang', 'token_ids': [34831, 102448], 'num_tokens': 2}

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling filter data with low token len #40

Handling filter data with low token len #40

KeshavSingh29 commented Sep 6, 2024

Handling filter data with low token len #40

Handling filter data with low token len #40

Comments

KeshavSingh29 commented Sep 6, 2024