You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While processing the data with the filtering scripts, I found that a lot of text in the corpus has very low num_tokens.
Can you please explain how you handle such data for training the model?
Do you have some threshold for num_tokens ? below which you discard such training data or
Do you concatenate all the data with <|endoftext|> token? and then use it ?
Note: I use o200k_base tokenizer for tokenization.
I am trying to train a similar model from scratch as well , so any help would be great!
Example cases I found :
{'text': 'Logan Township is the name of some places in the U.S. state of Minnesota:\nLogan Township, Aitkin County, Minnesota\nLogan Township, Grant County, Minnesota\n', 'meta': {'id': '141260', 'title': 'Logan Township, Minnesota', 'url': 'https://en.wikipedia.org/wiki/Logan%20Township%2C%20Minnesota'}, 'tokens': 'Logan Township is the name of some places in the U.S. state of Minnesota:\nLogan Township, Aitkin County, Minnesota\nLogan Township, Grant County, Minnesota\n', 'token_ids': [2719, 270, 74101, 382, 290, 1308, 328, 1236, 9610, 306, 290, 601, 1242, 13, 2608, 328, 31680, 734, 2719, 270, 74101, 11, 355, 278, 6342, 8269, 11, 31680, 198, 2719, 270, 74101, 11, 36945, 8269, 11, 31680, 198], 'num_tokens': 38}
{'text': 'Wallace is the name of two unincorporated communities in the State of Michigan:\nWallace, Menominee County, Michigan\nWallace, Alcona County, Michigan', 'meta': {'id': '183821', 'title': 'Wallace, Michigan', 'url': 'https://en.wikipedia.org/wiki/Wallace%2C%20Michigan'}, 'tokens': 'Wallace is the name of two unincorporated communities in the State of Michigan:\nWallace, Menominee County, Michigan\nWallace, Alcona County, Michigan', 'token_ids': [35120, 675, 382, 290, 1308, 328, 1920, 537, 2768, 52166, 780, 15061, 306, 290, 5388, 328, 23349, 734, 35120, 675, 11, 10841, 310, 45409, 8269, 11, 23349, 198, 35120, 675, 11, 1667, 135931, 8269, 11, 23349], 'num_tokens': 36}
{'text': 'Breuillet may refer to places in France:\n\n Breuillet, Charente-Maritime\n Breuillet, Essonne', 'meta': {'id': '450658', 'title': 'Breuillet', 'url': 'https://en.wikipedia.org/wiki/Breuillet'}, 'tokens': 'Breuillet may refer to places in France:\n\n Breuillet, Charente-Maritime\n Breuillet, Essonne', 'token_ids': [49930, 39788, 1340, 6716, 316, 9610, 306, 10128, 1402, 13014, 39788, 11, 9331, 1576, 100478, 46951, 198, 13014, 39788, 11, 12256, 25335], 'num_tokens': 22}
{'text': 'This is a list of episodes from the anime Burst Angel.\n\nBurst Angel', 'meta': {'id': '7526726', 'title': 'List of Burst Angel episodes', 'url': 'https://en.wikipedia.org/wiki/List%20of%20Burst%20Angel%20episodes'}, 'tokens': 'This is a list of episodes from the anime Burst Angel.\n\nBurst Angel', 'token_ids': [2500, 382, 261, 1562, 328, 29873, 591, 290, 35868, 194427, 32801, 364, 169652, 32801], 'num_tokens': 14}
{'text': 'Internet slang', 'meta': {'id': '10220550', 'title': 'TYVM', 'url': 'https://en.wikipedia.org/wiki/TYVM'}, 'tokens': 'Internet slang', 'token_ids': [34831, 102448], 'num_tokens': 2}
The text was updated successfully, but these errors were encountered:
While processing the data with the filtering scripts, I found that a lot of text in the corpus has very low num_tokens.
Can you please explain how you handle such data for training the model?
<|endoftext|>
token? and then use it ?Note: I use
o200k_base
tokenizer for tokenization.I am trying to train a similar model from scratch as well , so any help would be great!
Example cases I found :
The text was updated successfully, but these errors were encountered: