This ALEA project contains the research pipeline and output artifacts for the KL3M tokenizers, which are used as part of the KL3M family of embedding and generative models.
Read more below to learn more about how the KL3M tokenizers were trained and how they are different.
If you just want to use these tokenizers, you can find them available from the Hugging Face Hub like most other tokenizers:
python with transformers
:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-003-64k")
python with tokenizers
:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("alea-institute/kl3m-003-64k")
Like many other tokenizers, the KL3M tokenizers are BPE tokenizers trained with
the tokenizers
library. However, unlike most other tokenizers:
- The KL3M tokenizers were trained on data sources that are free of copyright or licensing issues.
- The KL3M tokenizers were trained primarily on legal, financial, and governmental works, resulting in:
- A vocabulary that is better aligned with professional use cases.
- A vocabulary that is less likely to include toxic or informal language.
- The KL3M tokenizers include a large number of controlled tokens related to parsing common formats like JSON or HTML.
- The KL3M tokenizers include a large number of controlled tokens related to legal citations and financial abbreviations.
- The KL3M tokenizers include support for a variety of tasks including causal (generative) and masked (embedding) tasks.
We also build "character" tokenizers that are designed to be used for low-level tasks like spelling correction or OCR correction.
- kl3m-001-32k: README | Hugging Face | Original KL3M tokenizer (Nov 2023)
- kl3m-003-64k: README | Hugging Face | Updated KL3M tokenizer (March 2024)
- kl3m-004-128k-uncased: README | Hugging Face | Updated KL3M tokenizer (November 2024)
- kl3m-004-128k-cased: README | Hugging Face | Updated KL3M tokenizer (November 2024)
- kl3m-004-char-4k-cased: README | Hugging Face | Updated KL3M tokenizer (December 2024)
- kl3m-004-char-8k-cased: README | Hugging Face | Updated KL3M tokenizer (December 2024)
- kl3m-004-char-16k-cased: README | Hugging Face | Updated KL3M tokenizer (December 2024)
You can generate your own comparisons against tiktoken
like this:
poetry run python3 examples/compare_tokenizers.py
+=============+================+================+=======================+=========================+
| gpt-4o | kl3m-001-32k | kl3m-003-64k | kl3m-004-128k-cased | kl3m-004-128k-uncased |
+=============+================+================+=======================+=========================+
| The | The | The | The | the |
+-------------+----------------+----------------+-----------------------+-------------------------+
| Compt | | ĠComptroller | ĠComptroller | Ġcomptroller |
+-------------+----------------+----------------+-----------------------+-------------------------+
| roller | Comp | Ġof | Ġof | Ġof |
+-------------+----------------+----------------+-----------------------+-------------------------+
| of | troller | Ġthe | Ġthe | Ġthe |
+-------------+----------------+----------------+-----------------------+-------------------------+
| the | | ĠCurrency | ĠCurrency | Ġcurrency |
+-------------+----------------+----------------+-----------------------+-------------------------+
| Currency | of | Ġshall | Ġshall | Ġshall |
+-------------+----------------+----------------+-----------------------+-------------------------+
| shall | | | | |
+-------------+----------------+----------------+-----------------------+-------------------------+
| | the | | | |
+-------------+----------------+----------------+-----------------------+-------------------------+
| | | | | |
+-------------+----------------+----------------+-----------------------+-------------------------+
| | C | | | |
+-------------+----------------+----------------+-----------------------+-------------------------+
| | urrency | | | |
+-------------+----------------+----------------+-----------------------+-------------------------+
| | | | | |
+-------------+----------------+----------------+-----------------------+-------------------------+
| | shall | | | |
+=============+================+================+=======================+=========================+
+===============+================+================+=======================+=========================+
| gpt-4o | kl3m-001-32k | kl3m-003-64k | kl3m-004-128k-cased | kl3m-004-128k-uncased |
+===============+================+================+=======================+=========================+
| This | This | This | This | this |
+---------------+----------------+----------------+-----------------------+-------------------------+
| Securities | | ĠSecurities | ĠSecurities | Ġsecurities |
+---------------+----------------+----------------+-----------------------+-------------------------+
| Purchase | Securities | ĠPurchase | ĠPurchase | Ġpurchase |
+---------------+----------------+----------------+-----------------------+-------------------------+
| Agreement | | ĠAgreement | ĠAgreement | Ġagreement |
+---------------+----------------+----------------+-----------------------+-------------------------+
| ( | Purchase | Ġ( | Ġ( | Ġ( |
+---------------+----------------+----------------+-----------------------+-------------------------+
| this | | this | this | this |
+---------------+----------------+----------------+-----------------------+-------------------------+
| " | Agreement | Ġ" | Ġ" | Ġ" |
+---------------+----------------+----------------+-----------------------+-------------------------+
| Agreement | | Agreement | Agreement | agreement |
+---------------+----------------+----------------+-----------------------+-------------------------+
| ") | ( | ") | ") | ") |
+---------------+----------------+----------------+-----------------------+-------------------------+
| is | this | Ġis | Ġis | Ġis |
+---------------+----------------+----------------+-----------------------+-------------------------+
| dated | | Ġdated | Ġdated | Ġdated |
+---------------+----------------+----------------+-----------------------+-------------------------+
| as | " | Ġas | Ġas | Ġas |
+---------------+----------------+----------------+-----------------------+-------------------------+
| of | Agreement | Ġof | Ġof | Ġof |
+---------------+----------------+----------------+-----------------------+-------------------------+
| November | ") | ĠNovember | ĠNovember | Ġnovember |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | | Ġ21 | Ġ21 | Ġ21 |
+---------------+----------------+----------------+-----------------------+-------------------------+
| 21 | is | , | , | , |
+---------------+----------------+----------------+-----------------------+-------------------------+
| , | | Ġ2017 | Ġ2017 | Ġ2017 |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | dated | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| 201 | | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| 7 | as | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | of | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | November | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | 21 | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | , | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | | | | |
+---------------+----------------+----------------+-----------------------+-------------------------+
| | 2017 | | | |
+===============+================+================+=======================+=========================+
+=======================+=======================+========================+
| kl3m-004-4k-cased | kl3m-004-8k-cased | kl3m-004-16k-cased |
+=======================+=======================+========================+
| KE | KE | KE |
+-----------------------+-----------------------+------------------------+
| GUL | GUL | G |
+-----------------------+-----------------------+------------------------+
| AT | AT | UL |
+-----------------------+-----------------------+------------------------+
| ED | ED | ATED |
+-----------------------+-----------------------+------------------------+
| ĠN | ĠN | ĠNAT |
+-----------------------+-----------------------+------------------------+
| AT | AT | URAL |
+-----------------------+-----------------------+------------------------+
| UR | UR | ĠGAS |
+-----------------------+-----------------------+------------------------+
| AL | AL | .âĢĶ |
+-----------------------+-----------------------+------------------------+
| ĠG | ĠG | The |
+-----------------------+-----------------------+------------------------+
| AS | AS | Ġt |
+-----------------------+-----------------------+------------------------+
| . | . | Ġe |
+-----------------------+-----------------------+------------------------+
| âĢĶ | âĢĶ | Ġr |
+-----------------------+-----------------------+------------------------+
| The | The | Ġm |
+-----------------------+-----------------------+------------------------+
| Ġt | Ġt | Ġ' |
+-----------------------+-----------------------+------------------------+
| Ġe | Ġe | Ġr |
+-----------------------+-----------------------+------------------------+
| Ġr | Ġr | Ġe |
+-----------------------+-----------------------+------------------------+
| Ġm | Ġm | Ġg |
+-----------------------+-----------------------+------------------------+
| Ġ' | Ġ' | Ġu |
+-----------------------+-----------------------+------------------------+
| Ġr | Ġr | Ġl |
+-----------------------+-----------------------+------------------------+
| Ġe | Ġe | Ġa |
+-----------------------+-----------------------+------------------------+
| Ġg | Ġg | Ġt |
+-----------------------+-----------------------+------------------------+
| Ġu | Ġu | Ġe |
+-----------------------+-----------------------+------------------------+
| Ġl | Ġl | Ġd |
+-----------------------+-----------------------+------------------------+
| Ġa | Ġa | Ġn |
+-----------------------+-----------------------+------------------------+
| Ġt | Ġt | Ġa |
+-----------------------+-----------------------+------------------------+
| Ġe | Ġe | Ġt |
+-----------------------+-----------------------+------------------------+
| Ġd | Ġd | Ġu |
+-----------------------+-----------------------+------------------------+
| Ġn | Ġn | |
+-----------------------+-----------------------+------------------------+
| Ġa | Ġa | |
+-----------------------+-----------------------+------------------------+
| Ġt | Ġt | |
+-----------------------+-----------------------+------------------------+
| Ġu | Ġu | |
+=======================+=======================+========================+
You can replicate the training process for tokenizers like this:
kl3m-001-32k:
$ PYTHONPATH=. poetry run python3 kl3m_tokenizers/tokenizers/kl3m_001/train_tokenizer.py --vocab_size 4096 --pad2 samples/usc.1000.jsonl.gz tokenizer-4k
Output
[00:00:00] Pre-processing sequences ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 0 / 0[00:00:00] Tokenize words ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 31728 / 31728
[00:00:00] Count pairs ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 31728 / 31728
[00:00:00] Compute merges ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 2204 / 2204
Added 1612 custom tokens to size=4023.
Adding power-of-2 padding tokens.
Padded vocab to 4096 tokens.
Special tokens: 7
Custom tokens: 1612
Power-of-2 pad tokens: 73
Final vocab size: 4096
Training time: 0.67 seconds
Output path: tokenizer-4k/
kl3m-003-64k:
$ PYTHONPATH=. poetry run python3 kl3m_tokenizers/tokenizers/kl3m_003/train_tokenizer.py --vocab_size 8192 --pad2 samples/usc.1000.jsonl.gz tokenizer-8k
Output
Training tokenizer.
[00:00:00] Pre-processing sequences ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 9788 / 0Finished parsing samples/usc.1000.jsonl.gz: 10000 records
[00:00:00] Pre-processing sequences ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 0 / 0[00:00:00] Tokenize words ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 31699 / 31699
[00:00:00] Count pairs ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 31699 / 31699
[00:00:00] Compute merges ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 3472 / 3472
Adding custom tokens.
Adding power-of-2 padding tokens.
Padded vocab to 8192 tokens.
Special tokens: 13
Custom tokens: 4393
Power-of-2 pad tokens: 58
Final vocab size: 8192
Training time: 1.31 seconds
Output path: tokenizer-8k/
This ALEA project is released under the MIT License. See the LICENSE file for details.
If you encounter any issues or have questions about using this ALEA project, please open an issue on GitHub.
To learn more about ALEA and its software and research projects like KL3M, visit the ALEA website.