Skip to content

Commit

Permalink
updated char comparison sorting and readme
Browse files Browse the repository at this point in the history
  • Loading branch information
mjbommar committed Dec 31, 2024
1 parent 0d8134f commit 511e9ce
Show file tree
Hide file tree
Showing 2 changed files with 77 additions and 4 deletions.
75 changes: 74 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ the `tokenizers` library. However, unlike most other tokenizers:
4. The KL3M tokenizers include a large number of controlled tokens related to legal citations and financial abbreviations.
5. The KL3M tokenizers include support for a variety of tasks including causal (generative) and masked (embedding) tasks.

We also build "character" tokenizers that are designed to be used for low-level tasks like spelling correction or OCR correction.


## Roadmap

Expand All @@ -46,9 +48,11 @@ the `tokenizers` library. However, unlike most other tokenizers:
* [x] **kl3m-004-char-4k-cased**: [README](kl3m-004-char-4k-cased/README.md) | [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-char-4k-cased) | Updated KL3M tokenizer (December 2024)
* [x] **kl3m-004-char-8k-cased**: [README](kl3m-004-char-8k-cased/README.md) | [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-char-8k-cased) | Updated KL3M tokenizer (December 2024)
* [x] **kl3m-004-char-16k-cased**: [README](kl3m-004-char-16k-cased/README.md) | [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-char-16k-cased) | Updated KL3M tokenizer (December 2024)
*

## Examples

### "Normal" Tokenizers

You can generate your own comparisons against `tiktoken` like this:
```bash
poetry run python3 examples/compare_tokenizers.py
Expand Down Expand Up @@ -154,6 +158,75 @@ poetry run python3 examples/compare_tokenizers.py
+===============+================+================+=======================+=========================+
```

### Character Tokenizers

+=======================+=======================+========================+
| kl3m-004-4k-cased | kl3m-004-8k-cased | kl3m-004-16k-cased |
+=======================+=======================+========================+
| KE | KE | KE |
+-----------------------+-----------------------+------------------------+
| GUL | GUL | G |
+-----------------------+-----------------------+------------------------+
| AT | AT | UL |
+-----------------------+-----------------------+------------------------+
| ED | ED | ATED |
+-----------------------+-----------------------+------------------------+
| ĠN | ĠN | ĠNAT |
+-----------------------+-----------------------+------------------------+
| AT | AT | URAL |
+-----------------------+-----------------------+------------------------+
| UR | UR | ĠGAS |
+-----------------------+-----------------------+------------------------+
| AL | AL | .âĢĶ |
+-----------------------+-----------------------+------------------------+
| ĠG | ĠG | The |
+-----------------------+-----------------------+------------------------+
| AS | AS | Ġt |
+-----------------------+-----------------------+------------------------+
| . | . | Ġe |
+-----------------------+-----------------------+------------------------+
| âĢĶ | âĢĶ | Ġr |
+-----------------------+-----------------------+------------------------+
| The | The | Ġm |
+-----------------------+-----------------------+------------------------+
| Ġt | Ġt | Ġ' |
+-----------------------+-----------------------+------------------------+
| Ġe | Ġe | Ġr |
+-----------------------+-----------------------+------------------------+
| Ġr | Ġr | Ġe |
+-----------------------+-----------------------+------------------------+
| Ġm | Ġm | Ġg |
+-----------------------+-----------------------+------------------------+
| Ġ' | Ġ' | Ġu |
+-----------------------+-----------------------+------------------------+
| Ġr | Ġr | Ġl |
+-----------------------+-----------------------+------------------------+
| Ġe | Ġe | Ġa |
+-----------------------+-----------------------+------------------------+
| Ġg | Ġg | Ġt |
+-----------------------+-----------------------+------------------------+
| Ġu | Ġu | Ġe |
+-----------------------+-----------------------+------------------------+
| Ġl | Ġl | Ġd |
+-----------------------+-----------------------+------------------------+
| Ġa | Ġa | Ġn |
+-----------------------+-----------------------+------------------------+
| Ġt | Ġt | Ġa |
+-----------------------+-----------------------+------------------------+
| Ġe | Ġe | Ġt |
+-----------------------+-----------------------+------------------------+
| Ġd | Ġd | Ġu |
+-----------------------+-----------------------+------------------------+
| Ġn | Ġn | |
+-----------------------+-----------------------+------------------------+
| Ġa | Ġa | |
+-----------------------+-----------------------+------------------------+
| Ġt | Ġt | |
+-----------------------+-----------------------+------------------------+
| Ġu | Ġu | |
+=======================+=======================+========================+


## Training a New Tokenizer

You can replicate the training process for tokenizers like this:
Expand Down
6 changes: 3 additions & 3 deletions examples/compare_char_tokenizers.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,9 +84,9 @@ def print_separator(char="-"):
# pretty formatting for these so they are as easy to inspect as possible
print_token_sequences(
{
"kl3m-004-4k": kl3m_4k_tokens,
"kl3m-004-8k": kl3m_8k_tokens,
"kl3m-004-16k": kl3m_16k_tokens,
"0 kl3m-004-4k-cased": kl3m_4k_tokens,
"1 kl3m-004-8k-cased": kl3m_8k_tokens,
"2 kl3m-004-16k-cased": kl3m_16k_tokens,
}
)
except KeyboardInterrupt:
Expand Down

0 comments on commit 511e9ce

Please sign in to comment.