Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling filter data with low token len #40

Open
KeshavSingh29 opened this issue Sep 6, 2024 · 0 comments
Open

Handling filter data with low token len #40

KeshavSingh29 opened this issue Sep 6, 2024 · 0 comments

Comments

@KeshavSingh29
Copy link

While processing the data with the filtering scripts, I found that a lot of text in the corpus has very low num_tokens.

Can you please explain how you handle such data for training the model?

  • Do you have some threshold for num_tokens ? below which you discard such training data or
  • Do you concatenate all the data with <|endoftext|> token? and then use it ?

Note: I use o200k_base tokenizer for tokenization.

I am trying to train a similar model from scratch as well , so any help would be great!

Example cases I found :

{'text': 'Logan Township is the name of some places in the U.S. state of Minnesota:\nLogan Township, Aitkin County, Minnesota\nLogan Township, Grant County, Minnesota\n', 'meta': {'id': '141260', 'title': 'Logan Township, Minnesota', 'url': 'https://en.wikipedia.org/wiki/Logan%20Township%2C%20Minnesota'}, 'tokens': 'Logan Township is the name of some places in the U.S. state of Minnesota:\nLogan Township, Aitkin County, Minnesota\nLogan Township, Grant County, Minnesota\n', 'token_ids': [2719, 270, 74101, 382, 290, 1308, 328, 1236, 9610, 306, 290, 601, 1242, 13, 2608, 328, 31680, 734, 2719, 270, 74101, 11, 355, 278, 6342, 8269, 11, 31680, 198, 2719, 270, 74101, 11, 36945, 8269, 11, 31680, 198], 'num_tokens': 38}
{'text': 'Wallace is the name of two unincorporated communities in the State of Michigan:\nWallace, Menominee County, Michigan\nWallace, Alcona County, Michigan', 'meta': {'id': '183821', 'title': 'Wallace, Michigan', 'url': 'https://en.wikipedia.org/wiki/Wallace%2C%20Michigan'}, 'tokens': 'Wallace is the name of two unincorporated communities in the State of Michigan:\nWallace, Menominee County, Michigan\nWallace, Alcona County, Michigan', 'token_ids': [35120, 675, 382, 290, 1308, 328, 1920, 537, 2768, 52166, 780, 15061, 306, 290, 5388, 328, 23349, 734, 35120, 675, 11, 10841, 310, 45409, 8269, 11, 23349, 198, 35120, 675, 11, 1667, 135931, 8269, 11, 23349], 'num_tokens': 36}
{'text': 'Breuillet may refer to places in France:\n\n Breuillet, Charente-Maritime\n Breuillet, Essonne', 'meta': {'id': '450658', 'title': 'Breuillet', 'url': 'https://en.wikipedia.org/wiki/Breuillet'}, 'tokens': 'Breuillet may refer to places in France:\n\n Breuillet, Charente-Maritime\n Breuillet, Essonne', 'token_ids': [49930, 39788, 1340, 6716, 316, 9610, 306, 10128, 1402, 13014, 39788, 11, 9331, 1576, 100478, 46951, 198, 13014, 39788, 11, 12256, 25335], 'num_tokens': 22}
{'text': 'This is a list of episodes from the anime Burst Angel.\n\nBurst Angel', 'meta': {'id': '7526726', 'title': 'List of Burst Angel episodes', 'url': 'https://en.wikipedia.org/wiki/List%20of%20Burst%20Angel%20episodes'}, 'tokens': 'This is a list of episodes from the anime Burst Angel.\n\nBurst Angel', 'token_ids': [2500, 382, 261, 1562, 328, 29873, 591, 290, 35868, 194427, 32801, 364, 169652, 32801], 'num_tokens': 14}
{'text': 'Internet slang', 'meta': {'id': '10220550', 'title': 'TYVM', 'url': 'https://en.wikipedia.org/wiki/TYVM'}, 'tokens': 'Internet slang', 'token_ids': [34831, 102448], 'num_tokens': 2}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant