Skip to content

Releases: llm-jp/llm-jp-tokenizer

Release v3.0b2

07 Aug 04:20
c693f9d
Compare
Choose a tag to compare
  • Update for Hugging Face Tokenizer
    • Following Llama tokenizer's manner, add BOS instead of EOD
  • No changes for SentencePiece model

Release ver2.1

18 Oct 06:36
132f216
Compare
Choose a tag to compare

Hugging Face Fast Tokenizer

Specification

  • tokenizer class: PreTrainedTokenizerFast
    • Unigram Byte-fallback model
  • vocab size: 50,570

Requirements

  • transformers>=4.34.0
  • tokenizers>=0.14.0

Usage

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v1.0")
  • The tokenizer configuration files are bundled with LLM-jp models which are distributed from Hugging Face Hub
  • The tokenizer can be instantiated using the usual method AutoTokenizer.from_pretrained(model_name_or_path)
  • The minimum set of HF tokenizer is placed in /hf/ver2.1/code10k_en20k_ja30k.ver2.1_hf_fast

SentencePiece Tokenizer

Specification

  • SentencePiece Unigram Byte-fallback model
  • vocab size: 50,570

Requirements

  • sentencepiece>=0.1.99
  • protobuf<3.21.0

Usage

from sentencepiece import SentencePieceProcessor
sp = SentencePieceProcessor("models/ver2.1/code10k_en20k_ja30k.ver2.1.model")

Release ver2.2

09 Oct 02:54
f0cd49d
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/llm-jp/llm-ja-tokenizer/commits/v2.2