You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Certain tokenizers have the ability to outperform the fixed factor of "4", which leaves the horde with the belief the worker is generating tokens faster than is possible, where in reality, the tokenizer may simply generate more tokens than that on average. You can analyze the effect different tokenizers have with respect to the ratio of characters_input / tokens_generated here: https://huggingface.co/spaces/Xenova/the-tokenizer-playground
The horde text model text reference could have a tokenizer_efficiency field added, and the AI-Horde updated to use, to reduce this problem. The text reference uses the huggingface names as the canonical name in the reference, and so the huggingface client library could be used to retrieve the tokenizer.
I propose the following process:
A tokenizer_efficiency and tokenizer_vocab_size (for posterity) field be added to the model reference
A script written to download all of the tokenizers and their configurations (each is on the order of megabytes. De-deduping may also be possible. The tokenizer_vocab_size is scraped from the tokenizer.json (it is the array length of the vocab list in the model object)
A large amount of random (read: representative) text is generated and saved as a fixed dataset that will be used against all tokenizers. The count of characters of this dataset is also saved.
Each tokenizer tokenizes the fixed dataset and the resulting number of tokens are counted.
tokenizer_efficiency = characters_input / tokens_generated
The existing text model reference updated with these fields
An potential alternative approach would involve downloading/using the tokenizers API-side somehow (microservice?) but I suspect this would introduce enormous and unnecessary complications as well as add an unacceptable delay to generations.
The text was updated successfully, but these errors were encountered:
The central issue revolves around the following function:
AI-Horde/horde/classes/kobold/processing_generation.py
Lines 90 to 101 in f6952ed
Certain tokenizers have the ability to outperform the fixed factor of "4", which leaves the horde with the belief the worker is generating tokens faster than is possible, where in reality, the tokenizer may simply generate more tokens than that on average. You can analyze the effect different tokenizers have with respect to the ratio of
characters_input / tokens_generated
here:https://huggingface.co/spaces/Xenova/the-tokenizer-playground
The horde text model text reference could have a
tokenizer_efficiency
field added, and the AI-Horde updated to use, to reduce this problem. The text reference uses the huggingface names as the canonical name in the reference, and so the huggingface client library could be used to retrieve the tokenizer.I propose the following process:
tokenizer_efficiency
andtokenizer_vocab_size
(for posterity) field be added to the model referencetokenizer_efficiency = characters_input / tokens_generated
An potential alternative approach would involve downloading/using the tokenizers API-side somehow (microservice?) but I suspect this would introduce enormous and unnecessary complications as well as add an unacceptable delay to generations.
The text was updated successfully, but these errors were encountered: