Improved text token counting #463

tazlin · 2024-10-19T16:26:52Z

The central issue revolves around the following function:

AI-Horde/horde/classes/kobold/processing_generation.py

Lines 90 to 101 in f6952ed

    
           def get_things_count(self, generation=None): 
        
               if generation is None: 
        
                   if self.generation is None: 
        
                       return 0 
        
                   generation = self.generation 
        
               quick_token_count = math.ceil(len(generation) / 4) 
        
               if quick_token_count < 20: 
        
                   quick_token_count = 20 
        
               if self.wp.things > quick_token_count: 
        
                   # logger.debug([self.wp.things,quick_token_count]) 
        
                   return quick_token_count 
        
               return self.wp.things

Certain tokenizers have the ability to outperform the fixed factor of "4", which leaves the horde with the belief the worker is generating tokens faster than is possible, where in reality, the tokenizer may simply generate more tokens than that on average. You can analyze the effect different tokenizers have with respect to the ratio of characters_input / tokens_generated here:
https://huggingface.co/spaces/Xenova/the-tokenizer-playground

The horde text model text reference could have a tokenizer_efficiency field added, and the AI-Horde updated to use, to reduce this problem. The text reference uses the huggingface names as the canonical name in the reference, and so the huggingface client library could be used to retrieve the tokenizer.

I propose the following process:

A tokenizer_efficiency and tokenizer_vocab_size (for posterity) field be added to the model reference
A script written to download all of the tokenizers and their configurations (each is on the order of megabytes. De-deduping may also be possible. The tokenizer_vocab_size is scraped from the tokenizer.json (it is the array length of the vocab list in the model object)
A large amount of random (read: representative) text is generated and saved as a fixed dataset that will be used against all tokenizers. The count of characters of this dataset is also saved.
Each tokenizer tokenizes the fixed dataset and the resulting number of tokens are counted.
tokenizer_efficiency = characters_input / tokens_generated
The existing text model reference updated with these fields
A CI workflow to automate the collection and enforcement of these new fields for https://github.com/Haidra-Org/AI-Horde-text-model-reference.
The relevant AI-Horde code update to utilize it.

An potential alternative approach would involve downloading/using the tokenizers API-side somehow (microservice?) but I suspect this would introduce enormous and unnecessary complications as well as add an unacceptable delay to generations.

The text was updated successfully, but these errors were encountered:

tazlin added the enhancement New feature or request label Oct 19, 2024

tazlin self-assigned this Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved text token counting #463

Improved text token counting #463

tazlin commented Oct 19, 2024

Improved text token counting #463

Improved text token counting #463

Comments

tazlin commented Oct 19, 2024