updated char comparison sorting and readme

alea-institute · Dec 31, 2024 · 511e9ce · 511e9ce
1 parent 0d8134f
commit 511e9ce
Show file tree

Hide file tree

Showing 2 changed files with 77 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -36,6 +36,8 @@ the `tokenizers` library.  However, unlike most other tokenizers:
 4. The KL3M tokenizers include a large number of controlled tokens related to legal citations and financial abbreviations.
 5. The KL3M tokenizers include support for a variety of tasks including causal (generative) and masked (embedding) tasks.
 
+We also build "character" tokenizers that are designed to be used for low-level tasks like spelling correction or OCR correction.
+
 
 ## Roadmap
 
@@ -46,9 +48,11 @@ the `tokenizers` library.  However, unlike most other tokenizers:
 * [x] **kl3m-004-char-4k-cased**: [README](kl3m-004-char-4k-cased/README.md)  |  [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-char-4k-cased) | Updated KL3M tokenizer (December  2024) 
 * [x] **kl3m-004-char-8k-cased**: [README](kl3m-004-char-8k-cased/README.md)  |  [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-char-8k-cased) | Updated KL3M tokenizer (December  2024)
 * [x] **kl3m-004-char-16k-cased**: [README](kl3m-004-char-16k-cased/README.md)  |  [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-char-16k-cased) | Updated KL3M tokenizer (December 2024)
-* 
+
 ## Examples
 
+### "Normal" Tokenizers
+
 You can generate your own comparisons against `tiktoken` like this:
 ```bash
 poetry run python3 examples/compare_tokenizers.py
@@ -154,6 +158,75 @@ poetry run python3 examples/compare_tokenizers.py
 +===============+================+================+=======================+=========================+
 ```
 
+### Character Tokenizers
+
++=======================+=======================+========================+
+|   kl3m-004-4k-cased   |   kl3m-004-8k-cased   |   kl3m-004-16k-cased   |
++=======================+=======================+========================+
+| KE                    | KE                    | KE                     |
++-----------------------+-----------------------+------------------------+
+| GUL                   | GUL                   | G                      |
++-----------------------+-----------------------+------------------------+
+| AT                    | AT                    | UL                     |
++-----------------------+-----------------------+------------------------+
+| ED                    | ED                    | ATED                   |
++-----------------------+-----------------------+------------------------+
+| ĠN                    | ĠN                    | ĠNAT                   |
++-----------------------+-----------------------+------------------------+
+| AT                    | AT                    | URAL                   |
++-----------------------+-----------------------+------------------------+
+| UR                    | UR                    | ĠGAS                   |
++-----------------------+-----------------------+------------------------+
+| AL                    | AL                    | .âĢĶ                   |
++-----------------------+-----------------------+------------------------+
+| ĠG                    | ĠG                    | The                    |
++-----------------------+-----------------------+------------------------+
+| AS                    | AS                    | Ġt                     |
++-----------------------+-----------------------+------------------------+
+| .                     | .                     | Ġe                     |
++-----------------------+-----------------------+------------------------+
+| âĢĶ                   | âĢĶ                   | Ġr                     |
++-----------------------+-----------------------+------------------------+
+| The                   | The                   | Ġm                     |
++-----------------------+-----------------------+------------------------+
+| Ġt                    | Ġt                    | Ġ'                     |
++-----------------------+-----------------------+------------------------+
+| Ġe                    | Ġe                    | Ġr                     |
++-----------------------+-----------------------+------------------------+
+| Ġr                    | Ġr                    | Ġe                     |
++-----------------------+-----------------------+------------------------+
+| Ġm                    | Ġm                    | Ġg                     |
++-----------------------+-----------------------+------------------------+
+| Ġ'                    | Ġ'                    | Ġu                     |
++-----------------------+-----------------------+------------------------+
+| Ġr                    | Ġr                    | Ġl                     |
++-----------------------+-----------------------+------------------------+
+| Ġe                    | Ġe                    | Ġa                     |
++-----------------------+-----------------------+------------------------+
+| Ġg                    | Ġg                    | Ġt                     |
++-----------------------+-----------------------+------------------------+
+| Ġu                    | Ġu                    | Ġe                     |
++-----------------------+-----------------------+------------------------+
+| Ġl                    | Ġl                    | Ġd                     |
++-----------------------+-----------------------+------------------------+
+| Ġa                    | Ġa                    | Ġn                     |
++-----------------------+-----------------------+------------------------+
+| Ġt                    | Ġt                    | Ġa                     |
++-----------------------+-----------------------+------------------------+
+| Ġe                    | Ġe                    | Ġt                     |
++-----------------------+-----------------------+------------------------+
+| Ġd                    | Ġd                    | Ġu                     |
++-----------------------+-----------------------+------------------------+
+| Ġn                    | Ġn                    |                        |
++-----------------------+-----------------------+------------------------+
+| Ġa                    | Ġa                    |                        |
++-----------------------+-----------------------+------------------------+
+| Ġt                    | Ġt                    |                        |
++-----------------------+-----------------------+------------------------+
+| Ġu                    | Ġu                    |                        |
++=======================+=======================+========================+
+
+
 ## Training a New Tokenizer
 
 You can replicate the training process for tokenizers like this:

diff --git a/examples/compare_char_tokenizers.py b/examples/compare_char_tokenizers.py
@@ -84,9 +84,9 @@ def print_separator(char="-"):
             # pretty formatting for these so they are as easy to inspect as possible
             print_token_sequences(
                 {
-                    "kl3m-004-4k": kl3m_4k_tokens,
-                    "kl3m-004-8k": kl3m_8k_tokens,
-                    "kl3m-004-16k": kl3m_16k_tokens,
+                    "0 kl3m-004-4k-cased": kl3m_4k_tokens,
+                    "1 kl3m-004-8k-cased": kl3m_8k_tokens,
+                    "2 kl3m-004-16k-cased": kl3m_16k_tokens,
                 }
             )
         except KeyboardInterrupt: