Releases: llm-jp/llm-jp-tokenizer
Releases · llm-jp/llm-jp-tokenizer
Release v3.0b2
- Update for Hugging Face Tokenizer
- Following Llama tokenizer's manner, add BOS instead of EOD
- No changes for SentencePiece model
Release ver2.1
Hugging Face Fast Tokenizer
Specification
- tokenizer class:
PreTrainedTokenizerFast
- Unigram Byte-fallback model
- vocab size: 50,570
Requirements
transformers>=4.34.0
tokenizers>=0.14.0
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v1.0")
- The tokenizer configuration files are bundled with LLM-jp models which are distributed from Hugging Face Hub
- The tokenizer can be instantiated using the usual method
AutoTokenizer.from_pretrained(model_name_or_path)
- The minimum set of HF tokenizer is placed in
/hf/ver2.1/code10k_en20k_ja30k.ver2.1_hf_fast
SentencePiece Tokenizer
Specification
- SentencePiece Unigram Byte-fallback model
- vocab size: 50,570
Requirements
sentencepiece>=0.1.99
protobuf<3.21.0
Usage
from sentencepiece import SentencePieceProcessor
sp = SentencePieceProcessor("models/ver2.1/code10k_en20k_ja30k.ver2.1.model")
- The minimum set of HF tokenizer is placed in
/models/ver2.1/code10k_en20k_ja30k.ver2.1.model
Release ver2.2
What's Changed
- Improve for gpt neox by @hiroshi-matsuda-rit in #3
- Release hf-slow ver2.1 alpha1 by @hiroshi-matsuda-rit in #4
- Strictly distinguishing slow and fast tokenizers by @hiroshi-matsuda-rit in #5
- update docstrings by @hiroshi-matsuda-rit in #6
- Release hf_fast.a2 by @hiroshi-matsuda-rit in #7
- Release hf_slow.a2 by @hiroshi-matsuda-rit in #8
- add trust_remote_code setting files by @hiroshi-matsuda-rit in #9
- Apply Apache 2.0 by @hiroshi-matsuda-rit in #10
- Release ver2.2 by @hiroshi-matsuda-rit in #11
New Contributors
- @hiroshi-matsuda-rit made their first contribution in #3
Full Changelog: https://github.com/llm-jp/llm-ja-tokenizer/commits/v2.2