Skip to content

Commit

Permalink
adding character tokenizer training source and tokenizers
Browse files Browse the repository at this point in the history
  • Loading branch information
mjbommar committed Dec 31, 2024
1 parent 02f1740 commit 0d8134f
Show file tree
Hide file tree
Showing 22 changed files with 169,391 additions and 550 deletions.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,10 @@ the `tokenizers` library. However, unlike most other tokenizers:
* [x] **kl3m-003-64k**: [README](kl3m-003-64k/README.md) | [Hugging Face](https://huggingface.co/alea-institute/kl3m-003-64k) | Updated KL3M tokenizer (March 2024)
* [x] **kl3m-004-128k-uncased**: [README](kl3m-004-128k-uncased/README.md) | [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-128k-uncased) | Updated KL3M tokenizer (November 2024)
* [x] **kl3m-004-128k-cased**: [README](kl3m-004-128k-cased/README.md) | [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-128k-cased) | Updated KL3M tokenizer (November 2024)

* [x] **kl3m-004-char-4k-cased**: [README](kl3m-004-char-4k-cased/README.md) | [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-char-4k-cased) | Updated KL3M tokenizer (December 2024)
* [x] **kl3m-004-char-8k-cased**: [README](kl3m-004-char-8k-cased/README.md) | [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-char-8k-cased) | Updated KL3M tokenizer (December 2024)
* [x] **kl3m-004-char-16k-cased**: [README](kl3m-004-char-16k-cased/README.md) | [Hugging Face](https://huggingface.co/alea-institute/kl3m-004-char-16k-cased) | Updated KL3M tokenizer (December 2024)
*
## Examples

You can generate your own comparisons against `tiktoken` like this:
Expand Down
93 changes: 93 additions & 0 deletions examples/compare_char_tokenizers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
"""
Compare the ALEA character tokenizers with each other.
"""

# imports

# packages
import tiktoken
from transformers import AutoTokenizer


def print_token_sequences(sequences: dict[str, list[str]], padding: int = 1):
"""
Print running columns of token sequences for easy comparison with improved formatting.
Args:
sequences (dict[str, list[str]]): A mapping of tokenizer names to token sequences.
padding (int): Number of spaces to add as padding on each side of cell content.
Returns:
None
"""
# get the keys and max widths
keys = sorted(list(sequences.keys()))
max_widths = [
max(len(key), max(len(token) for token in sequences[key])) for key in keys
]
cell_widths = [width + 2 * padding for width in max_widths]

# helper functions for rows and horizontal rules
def print_row(row_data):
print("|", end="")
for i, item in enumerate(row_data):
print(f" {item:{cell_widths[i]}} |", end="")
print()

def print_separator(char="-"):
print(f"+{'+'.join([char * (width + 2) for width in cell_widths])}+")

# headers
print_separator("=")
print_row(keys)
print_separator("=")

# ge the dim
max_rows = max(len(sequence) for sequence in sequences.values())
for row in range(max_rows):
row_data = []
for key in keys:
token = sequences[key][row] if row < len(sequences[key]) else ""
row_data.append(token)
print_row(row_data)
if row < max_rows - 1:
print_separator()

# footer
print_separator("=")


if __name__ == "__main__":
# get the tokenizers
kl3m_4k = AutoTokenizer.from_pretrained("alea-institute/kl3m-004-char-4k-cased")
kl3m_8k = AutoTokenizer.from_pretrained("alea-institute/kl3m-004-char-8k-cased")
kl3m_16k = AutoTokenizer.from_pretrained("alea-institute/kl3m-004-char-16k-cased")

# input loop
while True:
try:
# get the input
text = input("Enter text to tokenize (q to quit): ")
if text.lower() == "q":
break

# tokenize
kl3m_4k_encoded = kl3m_4k(text)
kl3m_8k_encoded = kl3m_8k(text)
kl3m_16k_encoded = kl3m_16k(text)

# get human-readable tokens
kl3m_4k_tokens = kl3m_4k.convert_ids_to_tokens(kl3m_4k_encoded["input_ids"])
kl3m_8k_tokens = kl3m_8k.convert_ids_to_tokens(kl3m_8k_encoded["input_ids"])
kl3m_16k_tokens = kl3m_16k.convert_ids_to_tokens(kl3m_16k_encoded["input_ids"])

# pretty formatting for these so they are as easy to inspect as possible
print_token_sequences(
{
"kl3m-004-4k": kl3m_4k_tokens,
"kl3m-004-8k": kl3m_8k_tokens,
"kl3m-004-16k": kl3m_16k_tokens,
}
)
except KeyboardInterrupt:
break
138 changes: 138 additions & 0 deletions kl3m-004-char-16k-cased/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
---
language:
- en
- es
- fr
- de
library_name: tokenizers
license: cc-by-4.0
tags:
- kl3m
- kl3m-004
- alea
- legal
- financial
date: '2024-12-30T00:00:00.000Z'
---

# kl3m-004-char-16k-cased

The `kl3m-004-char-16k-cased` **case-sensitive** tokenizer is a domain-specific **character-based** tokenizer trained
on a stratified sample of nearly 2M documents across general, legal, and financial domains from the `kl3m-data` project,
including American English, British English, Spanish, German, French, Italian, and other common EU languages.

This tokenizer uses the standard Byte-Pair Encoding (BPE) tokenizer from `tokenizers`/`transformers`, but modifies the
training process to restrict the vocabulary to tokens that are at most 3 characters long. Models trained with this tokenizer
should be able to handle a number of use cases that are otherwise difficult to handle with standard tokenizers, such as
low-resource spell-checking, OCR correction, whitespace normalization, and other tasks that require a high degree of character-level
granularity.

## Model Details

### Summary

- **Vocabulary**: 16,384 tokens
- **Tokenizer type:** BPE with 1-4 character tokens
- **Special token support:** Both causal and masked language modeling
- **Language(s) (NLP):** Primarily English, Spanish, German, French, with a small percentage of other EU languages.
- **Data Sources**: See [`kl3m-data`](https://github.com/alea-institute/kl3m-data) repository.
- **Developed by:** [ALEA Institute](https://aleainstitute.ai).
- **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)

For more information about the `kl3m-004` tokenizers, see the [kl3m-004-128k-cased tokenizer](https://huggingface.co/alea-institute/kl3m-004-128k-cased).

#### Special Tokens for both Embedding and Generative Models

For both training and inference efficiency, we intended this tokenizer vocabulary to be
usable for both embedding and generative models. As such, we included special tokens
suitable for both causal and masked language modeling tasks.

* `<|start|>`: `0`
* `<|end|>`: `1`
* `<|pad|>`: `2`
* `<|unk|>`: `3`
* `<|sep|>`: `4`
* `<|cls|>`: `5`
* `<|mask|>`: `6`

We also added a number of chat and instruction tokens that were not included in `kl3m-001-32k`, including:

* `<|system|>`: `7`
* `</|system|>`: `8`
* `<|user|>`: `9`
* `</|user|>`: `10`
* `<|instruction|>`: `11`
* `</|instruction|>`: `12`

These tokens are identical to those used in the `kl3m-003-64k` tokenizer.

### Replication

The entire data collection and preprocesing pipeline is being made available, along with
training data, as part of the [ALEA Institute](https://aleainstitute.ai) [KL3M project](https://aleainstitute.ai/work/kl3m/).

The source code to used to train the tokenizer is available on GitHub at:
[https://github.com/alea-institute/kl3m-embedding-research](https://github.com/alea-institute/kl3m-embedding-research)

The data pipeline will be available on GitHub and S3 in the near future.

This specific tokenizer was trained using the following command:

```bash
PYTHONPATH=. poetry run python3 \
kl3m_tokenizers/tokenizers/kl3m_004/train_char_tokenizer.py \
--min_frequency 1000 \
--vocab_size 16384 \
--pad2 \
--max_chars 4 \
sample.20241223173012.jsonl.gz \
./kl3m-004-char-16k-cased/
```

```text
Training tokenizer.
[00:33:12] Pre-processing sequences █████████████████████████████████████████████████████████████ 1849344 / 0
[00:33:32] Pre-processing sequences █████████████████████████████████████████████████████████████ 0 / 0
[00:00:21] Tokenize words █████████████████████████████████████████████████████████████ 20286360 / 20286360
[00:01:01] Count pairs █████████████████████████████████████████████████████████████ 20286360 / 20286360
[00:12:39] Compute merges █████████████████████████████████████████████████████████████ 16036 / 16036
Adding power-of-2 padding tokens.
Padded vocab to 16384 tokens.
Special tokens: 13
Power-of-2 pad tokens: 13
Final vocab size: 16384
Training time: 2863.67 seconds
Output path: kl3m-004-char-16k-cased
```

### Uses
This tokenizer is intended to be used for English, Spanish, German, or French language tasks where
character-level details are important, such as OCR correction, spell-checking, or tasks where word boundaries
are not well-defined.

For a standard BPE "word" tokenizer with a larger vocabulary size, consider using the `kl3m-004-128k-cased` or
`kl3m-004-128k-uncased` tokenizers.

### Recommendations
The kl3m-004-char-16k-cased tokenizer may be particularly useful when character-level details are important but
resource constraints are not as severe. For smaller vocabularies with better resource efficiency, consider using the
kl3m-004-char-4k-cased or kl3m-004-char-8k-cased tokenizers.

### How to Get Started with the Model
Use the code below to get started with the model.

```
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-004-char-16k-cased')
```

### Citation
Tokenizer and dataset publications are pending.

## Contact

For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-embedding-research).

![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)
Loading

0 comments on commit 0d8134f

Please sign in to comment.