07 Aug 04:20

hiroshi-matsuda-rit

v3.0b2

c693f9d

Release v3.0b2 Latest

Latest

Update for Hugging Face Tokenizer
- Following Llama tokenizer's manner, add BOS instead of EOD
No changes for SentencePiece model

Assets 2

18 Oct 06:36

hiroshi-matsuda-rit

v2.1

132f216

Release ver2.1

Hugging Face Fast Tokenizer

Specification

tokenizer class: PreTrainedTokenizerFast
- Unigram Byte-fallback model
vocab size: 50,570

Requirements

transformers>=4.34.0
tokenizers>=0.14.0

Usage

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v1.0")

The tokenizer configuration files are bundled with LLM-jp models which are distributed from Hugging Face Hub
The tokenizer can be instantiated using the usual method AutoTokenizer.from_pretrained(model_name_or_path)
The minimum set of HF tokenizer is placed in /hf/ver2.1/code10k_en20k_ja30k.ver2.1_hf_fast

SentencePiece Tokenizer

Specification

SentencePiece Unigram Byte-fallback model
vocab size: 50,570

Requirements

sentencepiece>=0.1.99
protobuf<3.21.0

Usage

from sentencepiece import SentencePieceProcessor
sp = SentencePieceProcessor("models/ver2.1/code10k_en20k_ja30k.ver2.1.model")

The minimum set of HF tokenizer is placed in /models/ver2.1/code10k_en20k_ja30k.ver2.1.model

Assets 3

09 Oct 02:54

hiroshi-matsuda-rit

v2.2

f0cd49d

Release ver2.2

What's Changed

Improve for gpt neox by @hiroshi-matsuda-rit in #3
Release hf-slow ver2.1 alpha1 by @hiroshi-matsuda-rit in #4
Strictly distinguishing slow and fast tokenizers by @hiroshi-matsuda-rit in #5
update docstrings by @hiroshi-matsuda-rit in #6
Release hf_fast.a2 by @hiroshi-matsuda-rit in #7
Release hf_slow.a2 by @hiroshi-matsuda-rit in #8
add trust_remote_code setting files by @hiroshi-matsuda-rit in #9
Apply Apache 2.0 by @hiroshi-matsuda-rit in #10
Release ver2.2 by @hiroshi-matsuda-rit in #11

New Contributors

@hiroshi-matsuda-rit made their first contribution in #3

Full Changelog: https://github.com/llm-jp/llm-ja-tokenizer/commits/v2.2

Contributors

hiroshi-matsuda-rit

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hugging Face Fast Tokenizer

Specification

Requirements

Usage

SentencePiece Tokenizer

Specification

Requirements

Usage

What's Changed

New Contributors

Contributors

Releases: llm-jp/llm-jp-tokenizer

Release v3.0b2

Release ver2.1

Hugging Face Fast Tokenizer

Specification

Requirements

Usage

SentencePiece Tokenizer

Specification

Requirements

Usage

Release ver2.2

What's Changed

New Contributors

Contributors