Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

Overview of Japanese LLMs

[ English | Français | 日本語 ]

Evolution of parameter sizes for Japanese LLMs and non-Japanese LLMs. The information on the Japanese models is derived from this article, while the information on the non-Japanese models can be referred from the Models table on LifeArchitect.ai. However, due to space constraints in the figure, some models have been omitted. Additionally, estimates are included in the parameter count for non-Japanese models. Please notify us of any corrections, additions, or updates.

A list of publicly available LLMs trained with a focus on Japanese, along with their evaluation benchmarks, maintained by volunteers from various sources like academic papers and other public resources.

::: warning Caution

We can't guarantee the accuracy or completeness of any information here.
Some information is based on conjecture and might not reflect your specific use case.
While many models are released under permissive licenses like MIT or Apache 2.0, some are subject to more restrictive terms including non-commercial use clauses (e.g CC BY-NC-SA 4.0) or other stipulations. :::

Please point out any errors on the issues page. Feel free to contribute directly with a pull request.

::: details Table of Contents [[toc]] :::

Text Generation Models

For multimodal models, see below.

Models built from scratch

General purpose

	Architecture	Max Context Length	Training Data	Developer	License / Terms of Use
Sarashina2-8x70B	Mixtral (8x70b (465b))	8,192	Sparse Upcycling on Sarashina2 (70B)	SB Intuitions	Sarashina Model NonCommercial License
LLM-jp-3 172B	Llama (172b, 172b-instruct3)	4,096	Pre-training: llm-jp-corpus-v3 (2.1T tokens) Instruction Tuning: ichikara-instruction, answer-carefully, magpie-sft-v1.0, Daring-Anteater, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft-ja, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k DPO: synthetic data	Research and Development Center for Large Language Models	Pre-trained model: LLM-jp-3 172B Terms of Use Post-trained model: llm-jp-3-172b-instruct3 Terms of Use
LLM-jp-3 172B beta2	Llama (172b-beta2, 172b-beta2-instruct2)	4,096	Pre-training: part of llm-jp-corpus-v3 (1.4T tokens) Instruction Tuning: ichikara-instruction, answer-carefully, magpie-sft-v1.0, Daring-Anteater, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft-ja, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k	Research and Development Center for Large Language Models	LLM-jp-3 172B beta2 Terms of Use
LLM-jp-3 172B beta1	Llama (172b-beta1, 172b-beta1-instruct)	4,096	Pre-training: part of llm-jp-corpus-v3 (0.7T tokens) Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2, Aya Dataset, ichikara-instruction-format, Daring-Anteater, FLAN	Research and Development Center for Large Language Models	LLM-jp-3 172B beta1 Terms of Use
LLM-jp-3 172B alpha	Llama (172b-alpha1, 172b-alpha1-instruct, 172b-alpha2, 172b-alpha2-instruct)	4,096	Pre-training: part of llm-jp-corpus-v3 (alpha1: 0.7T tokens, alpha2: 1.4T tokens) Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2, Aya Dataset, ichikara-instruction-format, Daring-Anteater, FLAN	Research and Development Center for Large Language Models	Apache 2.0
Stockmark-100b	Llama (100b, 100b-instruct-v0.1)	4,096	Pre-training: RedPajama, Japanese Wikipedia, Japanese mC4, Japanese CommonCrawl, Japanese Patent, Stockmark Web Corpus (910B tokens) Instruction Tuning (LoRA): ichikara-instruction	Stockmark	MIT
PLaMo-100B-Pretrained	Llama¹ (100b)	4,096	Pre-training: Japanese CommonCrawl, RefinedWeb, undisclosed (2.0T tokens)	Preferred Elements (Preferred Networks)	PLaMo Non-Commercial License
Sarashina2	Llama (7b, 13b, 70b)	7b, 13b: 4,096 70b: 8,192	Pre-training: Japanese Common Crawl, SlimPajama, StarCoder (2.1T tokens)	SB Intuitions	MIT
Sarashina1	GPT-NeoX (7b, 13b, 65b)	2,048	Pre-training: Japanese Common Crawl (1T tokens)	SB Intuitions	MIT
Tanuki-8×8B	Tanuki (MoE) (47b) (v1.0, v1.0-AWQ, v1.0-GPTQ-4bit, v1.0-GPTQ-8bit, v1.0-GGUF)	4,096	Pre-training: various Web & synthetic datasets（1.7T tokens） SFT, DPO: various synthetic datasets ²	Matsuo Lab LLM Development Project	Apache 2.0
CyberAgentLM3 (CALM3)	Llama (22b-chat)	16,384	undisclosed (2.0T tokens)	CyberAgent	Apache 2.0
LLM-jp-3 13B	Llama (1.8b, 1.8b-instruct, 3.7b, 3.7b-instruct, 13b, 13b-instruct)	4,096	Pre-training: llm-jp-corpus-v3 (2.1T tokens) Instruction Tuning: ichikara-instruction, answer-carefully, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k	Research and Development Center for Large Language Models	Apache 2.0
llm-jp-3-3.7b-instruct-EZO	Llama (3.7b-instruct-EZO-Common, 3.7b-instruct-EZO-Humanities)	4,096	additionally trained on LLM-jp-3 (3.7B)	Axcxept	Apache 2.0
LLM-jp-13B v2.0	Llama (13b-v2.0, 13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0, 13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0, 13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0)	4,096	Pre-training: llm-jp-corpus-v2 (260B tokens) Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2	LLM-jp	Apache 2.0
Fugaku-LLM	GPT (13B, 13B-instruct, 13B-instruct-gguf)	2,048	Pre-training: undisclosed dataset Instruction Tuning: OASST1, Dolly Dataset, GSM8K	Titech, Tohoku Univ., Fujitsu, RIKEN, Nagoya Univ., CyberAgent, Kotoba Technologies	Fugaku-LLM Terms of Use
LLM-jp-13B v1.1	GPT (13b-instruct-lora-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1, 13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1, 13b-dpo-lora-hh_rlhf_ja-v1.1)	2,048	Instruction Tuning (LoRA or Full-parameter FT): Dolly Dataset, OASST1, ichikara-instruction DPO (LoRA): HH RLHF	LLM-jp	Apache 2.0
LLM-jp-13B	GPT (1.3b-v1.0, 13b-v1.0, 13b-instruct-full-jaster-v1.0, 13b-instruct-full-jaster-dolly-oasst-v1.0, 13b-instruct-full-dolly-oasst-v1.0, 13b-instruct-lora-jaster-v1.0, 13b-instruct-lora-jaster-dolly-oasst-v1.0, 13b-instruct-lora-dolly-oasst-v1.0)	2,048	Pre-training: llm-jp-corpus (Wikipedia, Japanese mC4, The Pile, Stack) (300B tokens) Instruction Tuning (Full-parameter FT or LoRA): jaster, Dolly Dataset, OASST1	LLM-jp	Apache 2.0
PLaMo-13B	Llama³ (13b, 13b-instruct, 13b-instruct-nc)	base: 4,096 instruct, instruct-nc: 8,192	Pre-training: C4, Project Gutenberg, RedPajama, Japanese Wikipedia, Japanese mC4 (1.5T tokens) Instruction Tuning: Dolly, HH RLHF, OASST1, wikinews (+Alpaca in NC model)	Preferred Networks	Apache 2.0 (CC BY-NC 4.0 as for NC model)
Stockmark-13b	Llama (13b, 13b-instruct)	2,048	Pre-training: Japanese Wikipedia, Japanese CC-100, Japanese mC4, Japanese CommonCrawl, Japanese Patent, Stockmark Web Corpus (220B tokens) Instruction Tuning (LoRA): ichikara-instruction	Stockmark	base: MIT instruct: CC BY-NC-SA 4.0
Weblab-10B	GPT-NeoX (10b, 10b-instruction-sft)	2,048	Japanese mC4, The Pile (600B tokens) Instruction Tuning: Alpaca, FLAN	University of Tokyo Matsuo Lab	CC BY‑NC 4.0
Tanuki-8B	Tanuki (8b) (v1.0, v1.0-AWQ, v1.0-GPTQ-4bit, v1.0-GPTQ-8bit, v1.0-GGUF)	4,096	Pre-training: various Web & synthetic datasets（1.3T tokens） SFT, DPO: various synthetic datasets ²	Matsuo Lab LLM Development Project	Apache 2.0
Japanese StableLM Alpha	GPT-NeoX (base-alpha-7b, instruct-alpha-7b, instruct-alpha-7b-v2)	2,048	Wikipedia, Japanese CC‑100, Japanese mC4, Japanese OSCAR, RedPajama, private datasets⁴ (750B tokens) Instruction Tuning: Dolly, HH‑RLHF, wikinews, Alpaca (discarded in v2)	Stability AI	base: Apache 2.0 instruct (v1): Research license instruct (v2): Apache 2.0
CyberAgentLM2 (CALM2)	Llama (7b, 7b-chat, 7b-chat-dpo-experimental)	base: 4,096 chat: 32,768	publicly available Japanese and English datasets (details unknown) (1.3T tokens) DPO: Chatbot Arena Conversations JA (calm2) Dataset	CyberAgent	Apache 2.0 (CC BY 4.0 as for DPO model)
OpenCALM	GPT-NeoX (small, medium, large, 1b(1.4b), 3b(2.7b), 7b(6.8b))	2,048	Japanese Wikipedia, Japanese mC4, Japanese CC‑100	CyberAgent	CC BY‑SA 4.0
Stormy	GPT-NeoX (7b(6.8b))	2,048	OpenCALM fine-tuned on llm-japanese-dataset v0 non-translation tasks	University of Tokyo Izumi Lab	CC BY‑SA 4.0
rinna GPT (En-Ja Bilingual)	GPT-NeoX (4b(3.8b), 4b(3.8b)-8k, 4b(3.8b)-instruction-sft, 4b(3.8b)-instruction-ppo)	8k model: 8,192 others: 2,048	Wikipedia, Japanese CC‑100, Japanese C4, RedPajama, The Pile (524B tokens) Instruction Tuning: HH‑RLHF, FLAN PPO: HH‑RLHF for reinforcement learning 8k: trained with long context	rinna	MIT
japanese-large-lm	GPT-NeoX (1.7b, 3.6b, 1.7b-instruction-sft, 3.6b-instruction-sft)	2,048	Japanese Wikipedia, Japanese CC‑100, Japanese C4, Japanese OSCAR and private datasets (650GB) Instruction Tuning: OASST1	LINE	Apache 2.0
rinna GPT (Japanese only)	GPT / GPT-NeoX (xsmall, small, medium, 1b, neox-small, neox-3.6b, neox-3.6b-instruction-sft, neox-3.6b-instruction-sft-v2, neox-3.6b-instruction-ppo)	≤ 2,048	Japanese Wikipedia, Japanese CC‑100 (1b and up models add Japanese mC4) Instruction Tuning: HH‑RLHF, FLAN, SHP PPO: HH‑RLHF for reinforcement learning	rinna	MIT
RetrievaT5	T5 (small (short), small (medium), small (long), base (short), base (medium), base (long), large (short), large (medium), large (long), xl(3b))		Japanese Wikipedia, Japanese mC4	Retrieva	CC BY‑SA 4.0
Spiral-RetNet-3b-base	RetNet (3b)	2,048	Wikipedia, Japanese CC-100, CulturaX	Spiral.AI	MIT
kotomamba-2.8B	Mamba (2.8B-v1.0)	2,048	Japanese Wikipedia, Swallow Corpus, SlimPajama	Kotoba Technologies	Apache 2.0
ABEJA GPT	GPT / GPT-NeoX (large, neox-2.7b)		Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR	ABEJA	MIT
WasedaGPT	GPT (small, xl(1.5b))		Japanese Wikipedia, Japanese CC‑100	Waseda Kawahara Lab	CC BY‑SA 4.0
StockmarkGPT	GPT-NeoX (1.4b)		Japanese Wikipedia (0.88B tokens), Japanese CC‑100 (10.5B tokens), private data (8.6B tokens)	Stockmark	MIT
YellowbackGPT	GPT-NeoX (1.3b)		Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR	Yellowback	Apache 2.0
Sarashina2.1-1B	Llama (1b)	8,192	Japanese and English data on the web (10T tokens)	SB Intuitions	Sarashina Model NonCommercial License
colorfulscoop GPT	GPT (small)		Japanese Wikipedia	Colorful Scoop	CC BY‑SA 3.0
TitechGPT	GPT (medium, medium-reversed) ⁵		Japanese Wikipedia, Japanese CC‑100	Titech Okazaki Lab	CC BY‑SA 4.0
KyotoUniversityGPT	GPT (small, medium, large)		Japanese Wikipedia (3.2GB), Japanese CC‑100 (85GB), Japanese OSCAR (54GB)	Kyoto University Language Media Processing Lab	CC BY‑SA 4.0
JapaneseBART	BART (base, large)		Japanese Wikipedia (18M sentences)	Kyoto University Language Media Processing Lab	CC BY‑SA 4.0
Megagon Labs T5	T5 (base)		Japanese mC4 (782 GB), Japanese wiki40b (2 GB)	Megagon Labs (Recruit Co.,Ltd.)	Apache 2.0

Domain Specific

	Domain	Architecture	Training Data	Developer	License
Japanese Dialog Transformer	Dialog	Transformer	Twitter japanese reply pairs	NTT	Evaluation Licence
Japanese News BART	Business	BART (base)	Japanese business news articles (21M articles)	Stockmark	MIT
AcademicBART	Science	BART (base)	CiNii Japanese Papers	Ehime University AI Lab	Apache 2.0

Models built off non-Japanese LLMs (w/ continual pre-training on Japanese)

General purpose

	Base Model	Training Data	Developer	License / Terms of Use
Llama 3.1 Swallow 70B (70B-v0.1, 70B-Instruct-v0.1, 70B-Instruct-v0.3)	Llama 3.1 (70b)	Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie	Swallow Project	Llama 3.1 Community License (Gemma Terms of Use is also applied to the Instruct model)
cyberagent/Llama-3.1-70B-Japanese-Instruct-2407	Llama 3.1 (70b)	undisclosed	CyberAgent	Llama 3.1 Community License
Llama 3 Swallow 70B (70B-v0.1, 70B-Instruct-v0.1)	Llama 3 (70b)	Pre-training: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath Instruction Tuning: OASST1 ⁶	Swallow Project	Llama 3 Community License
turing-motors/Llama-3-heron-brain-70B-v0.3	Llama 3 (70b)	additionally trained on Llama 3 Swallow 70B (details undisclosed)	Turing	Llama 3 Community License
Llama 3 Youko 70B (70b, 70b-instruct, 70b-gptq, 70b-instruct-gptq)	Llama 3 (70b)	Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (5B tokens) Instruction Tuning: undisclosed datasetト⁷	rinna	Llama 3 Community License
Swallow 70B (70b-hf, 70b-instruct-hf, 70b-instruct-v0.1, 70b-NVE-hf, 70b-NVE-instruct-hf)	Llama 2 (70b)	Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile Instruction Tuning: Dolly Dataset, HH RLHF, OASST1 *v0.1: OASST1, OASST2	Swallow Project	Llama 2 Community License
KARAKURI LM (70b-v0.1, 70b-chat-v0.1)	Llama 2 (70b)	Pre-training: mC4, CC100, OSCAR, RedPajama, undisclosed dataset (16B tokens) SteerLM: OASST2, undisclosed dataset	KARAKURI	Llama 2 Community License⁸
Japanese Stable LM Beta 70B (base-beta-70b, instruct-beta-70b)	Llama 2 (70b)	Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3) (100B tokens) Instruction Tuning: Dolly Dataset, HH RLHF, OASST1	Stability AI	Llama 2 Community License
Swallow-MX 8x7B (8x7b-NVE-v0.1)	Mixtral-8x7B-Instruct-v0.1 (46.7b)	Pre-training: Algebraic Stack, Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile, The Vault	Swallow Project	Apache 2.0
KARAKURI LM 8x7B Instruct v0.1 (8x7b-instruct-v0.1)	Mixtral-8x7B-Instruct-v0.1 (46.7b)	trained Swallow-MX 8x7B on the following datasets: Dolly Dataset, OASST2, HelpSteer, glaive-code-assistant-v3, glaive-function-calling-v2, synthetic_text_to_sql, MetaMathQA, orca-math-word-problems-200k, rag-dataset-12000, rag-hallucination-dataset-1000, undisclosed dataset	KARAKURI	Apache 2.0 (?)⁹
KARAKURI LM 8x7B Chat v0.1 (8x7b-chat-v0.1)	Mixtral-8x7B-Instruct-v0.1 (46.7b)	trained Swallow-MX 8x7B on OASST2, HelpSteer, and undisclosed datasets using SteerLM	KARAKURI	Apache 2.0
ABEJA-Mixtral-8x7B-japanese (8x7B-v0.1-japanese, 8x7B-Instruct-v0.1-japanese, 8x7B-Instruct-v0.1-japanese-alpha, 8x7B-Instruct-v0.1-japanese-alpha-merged)	Mixtral-8x7B-Instruct-v0.1 (46.7b) *The model without "Instruct" in its name is based on Mixtral-8x7B-v0.1	Pre-training: Japanese CC, Redpajama, undisclosed dataset （450B tokens）	ABEJA	Apache 2.0
Nekomata 14B (14b, 14b-instruction, 14b-gguf, 14b-instruction-gguf)	Qwen (14b)	Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (66B tokens) Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset	rinna	Tongyi Qianwen LICENSE
Swallow 13B (13b-hf, 13b-instruct-hf, 13b-instruct-v0.1, 13b-NVE-hf)	Llama 2 (13b)	Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile Instruction Tuning: Dolly Dataset, HH RLHF, OASST1 *v0.1: OASST1, OASST2	Swallow Project	Llama 2 Community License
LEIA-Swallow-13B (13b)	Llama 2 (13b)	additionally trained Swallow 13B using LEIA	Individual (Ikuya Yamada, Ryokan Ri)	Llama 2 Community License
ELYZA-japanese-Llama-2-13b (13b, 13b-instruct, 13b-fast, 13b-fast-instruct)	Llama 2 (13b)	Pre-training: Japanese Wikipedia, Japanese OSCAR, and other crawled data (18B tokens) Instruction Tuning: undisclosed dataset	ELYZA	Llama 2 Community License
cyberagent/Mistral-Nemo-Japanese-Instruct-2408	Mistral NeMo (12b)	undisclosed	CyberAgent	Apache 2.0
Llama 3.1 Swallow 8B (8B-v0.1, 8B-Instruct-v0.1, 8B-v0.2, 8B-Instruct-v0.2, 8B-Instruct-v0.3)	Llama 3.1 (8b)	Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie	Swallow Project	Llama 3.1 Community License (Gemma Terms of Use is also applied to the Instruct model)
Llama 3 Swallow 8B (8B-v0.1, 8B-Instruct-v0.1)	Llama 3 (8b)	Pre-training: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath Instruction Tuning: OASST1 ⁶	Swallow Project	Llama 3 Community License
turing-motors/Llama-3-heron-brain-8B-v0.3	Llama 3 (8b)	additionally trained on Llama 3 Swallow 8B (details undisclosed)	Turing	Llama 3 Community License
Llama 3 Youko 8B (8b, 8b-instruct, 8b-gptq, 8b-instruct-gptq)	Llama 3 (8b)	Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (22B tokens) Instruction Tuning⁷: Aya Dataset (Japanese subset), FLAN, Dolly Dataset, HH RLHF, OASST1, OASST2, MetaMathQA, CodeAlpaca Dataset, undisclosed dataset DPO: HelpSteer, HelpSteer2, undisclosed dataset	rinna	Llama 3 Community License
Llama 3 ELYZA JP 8B (8B, 8B-GGUF, 8B-AWQ)	Llama 3 (8b)	undisclosed	ELYZA	Llama 3 Community License
Llama 3 neoAI 8B Chat v0.1 (8B-Chat-v0.1)	Llama 3 (8b)	undisclosed	neoAI	Llama 3 Community License
Llama 3 tedllm (v0)	Llama 3 (8b)	Pre-training: Japanese generic corpus	Tokyo Electron Device	Llama 3 Community License
Swallow 7B (7b-hf, 7b-instruct-hf, 7b-instruct-v0.1, 7b-NVE-hf, 7b-NVE-instruct-hf, 7b-plus-hf)	Llama 2 (7b)	Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile Instruction Tuning: Dolly Dataset, HH RLHF, OASST1 *v0.1: OASST1, OASST2	Swallow Project	Llama 2 Community License
LEIA-Swallow-7B (7b)	Llama 2 (7b)	additionally trained Swallow 7B using LEIA	Individual (Ikuya Yamada, Ryokan Ri)	Llama 2 Community License
ELYZA-japanese-Llama-2-7b (7b, 7b-instruct, 7b-fast, 7b-fast-instruct)	Llama 2 (7b)	Pre-training: Japanese Wikipedia, Japanese OSCAR, and other crawled data (18B tokens) Instruction Tuning: undisclosed dataset	ELYZA	Llama 2 Community License
Youri 7B (7b, 7b-instruction, 7b-chat, 7b-gptq, 7b-instruction-gptq, 7b-chat-gptq)	Llama 2 (7b)	Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (40B tokens) Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset	rinna	Llama 2 Community License
houou-7b (instruction-7b-v1, instruction-7b-v2, instruction-7b-v3)	Llama 2 (7b)	Instruction-tuned Youri 7B (base) on ichikara-instruction	MoneyForward	Llama 2 Community License
Japanese Stable LM Beta 7B (base-beta-7b, base-ja_vocab-beta-7b, instruct-beta-7b, instruct-ja_vocab-beta-7b)	Llama 2 (7b)	Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3) (100B tokens) Instruction Tuning: Dolly Dataset, HH RLHF, OASST1	Stability AI	Llama 2 Community License
SambaLingo-Japanese (Base, Chat)	Llama 2 (7b)	Pre-training: CulturaX Instruction Tuning: ultrachat_200k DPO: ultrafeedback, cai-conversation-harmless	SambaNova Systems	Llama 2 Community License (?)⁹
blue-lizard (blue-lizard)	Llama 2 (7b)	undisclosed	Deepreneur	Llama 2 Community License
Swallow-MS 7B (7b-v0.1, 7b-instruct-v0.1)	Mistral-7B-v0.1 (7b)	Pre-training: Algebraic Stack, Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile Instruction Tuning: Dolly Dataset, OASST1	Swallow Project	Apache 2.0
RakutenAI-7B (7B, 7B-instruct, 7B-chat)	Mistral-7B-v0.1 (7b)	Pre-training: undisclosed Instruction Tuning: Dolly Dataset, OASST1, datasets converted from the train split of NLU datasets (like jaster), undisclosed dataset	Rakuten	Apache 2.0
Japanese Stable LM Gamma 7B (base-gamma-7b, instruct-gamma-7b)	Mistral-7B-v0.1 (7b)	Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3) (100B tokens) Instruction Tuning: Dolly Dataset, HH RLHF, wikinews subset of llm-japanese-dataset	Stability AI	Apache 2.0
ChatNTQ JA 7B (7b-v1.0)	Mistral-7B-v0.1 (7b)	Instruction-tuned Japanese Stable LM Gamma 7B (base) on their own datasets	NTQ Solution	Apache 2.0
Shisa Gamma 7B (7b-v1)	Mistral-7B-v0.1 (7b)	Instruction-tuned Japanese Stable LM Gamma 7B (base) on ultra-orca-boros-en-ja	AUGMXNT	Apache 2.0 (?)⁹
Shisa 7B (base-7b-v1, 7b-v1)	Mistral-7B-v0.1 (7b)	Pre-training: shisa-pretrain-en-ja-v1 (8B tokens) Instruction Tuning & DPO: ultra-orca-boros-en-ja, shisa-en-ja-dpo-v1	AUGMXNT	Apache 2.0 (?)⁹
Karasu (7B, 7B-chat, 7B-chat-plus, 7B-chat-plus-unleashed)	Mistral-7B-v0.1 (7b)	Additionally trained Shisa 7B (base) on Aozora Bunko, Japanese Law Precedent Dataset, Japanese Wikipedia, Japanese domain webscrapes from the Japanese subset of CulturaX, UltraChat 200k (7B tokens) Instruction Tuning: ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, undisclosed dataset	Lightblue	Apache 2.0 (?)⁹
Nekomata 7B (7b, 7b-instruction, 7b-gguf, 7b-instruction-gguf)	Qwen (7b)	Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (66B tokens) Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset	rinna	Tongyi Qianwen LICENSE
lightblue/japanese-mpt-7b	MPT (7b)	Japanese mC4	Lightblue	Apache 2.0
Japanese Stable LM 3B-4E1T (3b-4e1t-base, 3b-4e1t-instruct)	StableLM-3B-4E1T (3b)	Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3) (100B tokens) Instruction Tuning: Dolly Dataset, HH RLHF, wikinews subset of llm-japanese-dataset	Stability AI	Apache 2.0
kotomamba-2.8B-CL	mamba-2.8b-slimpj (2.8b)	Japanese Wikipedia, Swallow Corpus, SlimPajama	Kotoba Technologies	Apache 2.0
Gemma 2 Baku 2B (2b, 2b-it)	Gemma 2 (2b)	Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (80B tokens) OPRO: undisclosed dataset ¹⁰	rinna	Gemma Terms of Use
Japanese Stable LM 2 1.6B (base, instruct)	Stable LM 2 1.6B (1.6b)	Pre-training: Wikipedia, CulturaX Instruction Tuning: jaster, ichikara-instruction, alpaca-gpt4-japanese, ultra-orca-boros-en-ja-v1	Stability AI	STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE
karasu-1.1B	TinyLlama (1.1b)	Pre-training: Japanese OSCAR, Japanese mC4 (3B tokens)	Lightblue	Apache 2.0

Domain specific

	Domain	Base Model	Developer	License
Llama3-Preferred-MedSwallow-70B (70B)	Medicine	Llama 3 (70b)	Preferred Networks	Llama 3 Community License
AIgroup-CVM-utokyohospital/MedSwallow-70b	Medicine	Llama 2 (70b)	University of Tokyo Hospital Department of Cardiovascular Medicine AI Group	CC BY-NC-SA 4.0
nekomata-14b-pfn-qfin (qfin, qfin-inst-merge)	Finance	Qwen (14b)	Preferred Networks	Tongyi Qianwen LICENSE
Watashiha-Llama-2-13B-Ogiri-sft (sft, sft-neuron)	Oogiri	Llama 2 (13b)	Watashiha	Llama 2 Community License
ELYZA-japanese-CodeLlama-7b (7b, 7b-instruct)	Coding	Code Llama (7b)	ELYZA	Llama 2 Community License
AIBunCho/japanese-novel-gpt-j-6b	Storytelling	GPT-J (6b)	Individual (Hiroyuki Osone)	CreativeML OpenRAIL-M License
NovelAI/genji-jp	Storytelling	GPT-J (6b)	NovelAI	？

Models built off non-Japanese LLMs (w/ post-training on Japanese)

General purpose

	Base Model	Training Data	Developer	License / Terms of Use
AXCXEPT/EZO-Qwen2.5-72B-Instruct AXCXEPT/EZO-AutoCoTRAG-Qwen2.5-72B-Instruct_q4	Qwen2.5 (72b)		Axcxept	Qwen License
ao-Karasu (72B)	Qwen1.5 (72b)	ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, Japanese technical blogs, News stories, QA site answers, undisclosed dataset	Lightblue	Tongyi Qianwen LICENSE (?)⁹
AXCXEPT/Llama-3.1-70B-EZO-1.1-it	Llama 3.1 (70b)		Axcxept	Llama 3.1 Community License
Llama 3 shisa-v1-llama3-70b (70b)	Llama 3 (70b)	ultra-orca-boros-en-ja-v1	Shisa.AI	Llama 3 Community License (?)⁹
AIgroup-CVM-utokyohospital/Llama-2-70b-chat-4bit-japanese	Llama 2 (70b)		University of Tokyo Hospital Department of Cardiovascular Medicine AI Group	Llama 2 Community License
doshisha-mil/llama-2-70b-chat-4bit-japanese-v1	Llama 2 (70b)		Doshisha University Media Informatics Lab	？
AXCXEPT/EZO-Qwen2.5-32B-Instruct AXCXEPT/EZO-AutoCoTRAG-Qwen2.5-32B-Instruct	Qwen2.5 (32b)		Axcxept	Apache 2.0
Qarasu (14B-chat-plus-unleashed)	Qwen (14b)	ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, undisclosed dataset	Lightblue	Tongyi Qianwen LICENSE (?)⁹
Sparticle/llama-2-13b-chat-japanese-lora	Llama 2 (13b)		Sparticle	？
izumi-lab/llama-13b-japanese-lora-v0-1ep	Llama (13b)		University of Tokyo Izumi Lab	？
AXCXEPT/EZO-Common-9B-gemma-2-it	Gemma 2 (9b)		Axcxept	Gemma Terms of Use
AXCXEPT/EZO-Humanities-9B-gemma-2-it	Gemma 2 (9b)		Axcxept	Gemma Terms of Use
AXCXEPT/Llama-3.1-8B-EZO-1.1-it	Llama 3.1 (8b)		Axcxept	Llama 3.1 Community License
Llama 3 Suzume 8B (8B-japanese, 8B-japanese-gguf)	Llama 3 (8b)	megagonlabs/instruction_ja, ShareGPT, undisclosed dataset	Lightblue	Llama 3 Community License (?)⁹
Llama 3 shisa-v1-llama3-8b (8b)	Llama 3 (8b)	ultra-orca-boros-en-ja-v1	Shisa.AI	Llama 3 Community License (?)⁹
AXCXEPT/Llama-3-EZO-8b-Common-it	Llama 3 (8b)		Axcxept	Llama 3 Community License
ganchengguang/Yoko-7B-Japanese-v1	Llama 2 (7b)		Yokohama National University Mori Lab	？
Sparticle/llama-2-7b-chat-japanese-lora	Llama 2 (7b)		Sparticle	？
izumi-lab/llama-7b-japanese-lora-v0-5ep	Llama (7b)		University of Tokyo Izumi Lab	？
lightblue/jod	Mistral-7B-SlimOrca (7b)		Lightblue	Apache 2.0
NTQAI/chatntq-7b-jpntuned	RWKV-4 World (7b)		NTQ Solution	？
Borea (Jp, Common, Coding)	Phi-3.5 (3.8b)		Axcxept	MIT
AXCXEPT/EZO-Llama-3.2-3B-Instruct-dpoE	Llama 3.2 (3b)		Axcxept	Llama 3.2 Community License
Gemma-2-JPN (2b-jpn-it)	Gemma 2 (2b)		Google	Gemma Terms of Use
AXCXEPT/EZO-gemma-2-2b-jpn-it	Gemma 2 (2b)		Axcxept	Gemma Terms of Use
AXCXEPT/EZO-Common-T2-2B-gemma-2-it	Gemma 2 (2b)		Axcxept	Gemma Terms of Use

Domain specific

	Domain	Base Model	Developer	License
JMedLoRA (llama2-jmedlora-6.89ep)	Medicine	Llama 2 (70b)	University of Tokyo Hospital Department of Cardiovascular Medicine AI Group	CC BY-NC 4.0
AXCXEPT/Qwen2.5-Math-7B-Instruct-jp-EZO_OREO	Mathematics	Qwen2.5-Math-7B-Instruct (7b)	Axcxept	Apache 2.0

Merged models

	Original Models (Japanese LLMs in bold)	Developer	License
EQUES/MedLLama3-JP-v2	Llama 3 Swallow 8B (Instruct), OpenBioLLM-8B, MMed-Llama 3 8B, Llama 3 ELYZA JP 8B	EQUES	Llama 3 Community License
EvoLLM-JP-A (v1-7B)	Shisa Gamma 7B (v1), Arithmo2 Mistral 7B, Abel 7B 002	Sakana AI	Apache 2.0
EvoLLM-JP (v1-7B, v1-10B)	Shisa Gamma 7B (v1), WizardMath-7B-V1.1, Abel 7B 002	Sakana AI	MICROSOFT RESEARCH LICENSE

API-based models

	Max Context Length	Developer	Platform
Solar mini chat ja (solar-1-mini-chat-ja)	32,768	Upstage	self-owned
AI Novelist	2,400 ~ 8,192	Bit192	self-owned
LHTM-OPT		alt Inc.	AWS Marketplace
tsuzumi (tsuzumi-7b)		NTT	Azure AI Foundry

Encoder models

General purpose

	Architecture	Max Input Length	Training Data	Developer	License	HuggingFace? ¹¹
KyotoUniBERT	BERT (base, large)	512	Japanese Wikipedia (18M articles)	Kyoto University Language Media Processing Lab	Apache 2.0	△
TohokuUniversityBERT	BERT (base, large)	512	base (v1): Japanese Wikipedia (17M articles / 2.6GB) base (v2) & large: Japanese Wikipedia 4.0GB base (v3) & large (v2): Japanese Wikipedia (4.9GB), Japanese CC‑100 (74.3GB)	Tohoku University NLP Group	base (v1, v2) & large: CC BY‑SA 3.0 base (v3) & large (v2): Apache 2.0	◯ (base (v1), base (v1, char-level), base (v2), base (v2, char-level), large, large (char-level), base (v3), base (v3, char-level), large (v2), large (v2, char-level))
TohokuNLP BERT-alpha 500M	Llama-based encoder¹²	4,096 or 8,192	Japanese subset of llm-jp-corpus-v3	Tohoku University NLP Group	Apache 2.0	◯ (sq4096-alpha, sq8192-alpha)
NICT BERT	BERT (base)	512	Japanese Wikipedia	NICT	CC BY 4.0	△
Laboro BERT	BERT (base, large)	512	Japanese Web Corpus (News and blogs, etc) (12GB)	Laboro.AI	CC BY‑NC 4.0	✕
colorfulscoop BERT	BERT (base)	512	Japanese Wikipedia	Colorful Scoop	CC BY‑SA 3.0	◯
UniversityOfTokyoBERT	BERT (small)	512	Japanese Wikipedia (2.9GB)	University of Tokyo Izumi Lab	CC BY‑SA 4.0	◯
chiTra (Sudachi Transformers)	BERT (base)	512	NINJAL Web Japanese Corpus (148GB)	NINJAL, WAP Tokushima Laboratory of AI and NLP	Apache 2.0	△
ACCMS BERT	BERT (base)	512	Japanese Wikipedia (3.3GB)	Kyoto University ACCMS	CC BY‑SA 4.0	◯
HitachiBERT	BERT (base)	512	Japanese Wikipedia, Japanese CC‑100	Hitachi	CC BY‑NC‑SA 4.0	◯¹³
RetrievaBERT	BERT ¹⁴	2,048	Japanese CommonCrawl, RefinedWeb, Chinese Wikipedia, Korean Wikipedia, The Stack	Retrieva	Apache 2.0	◯
Bandai Namco DistilBERT	DistilBERT	512	(Distillation of TohokuUniversityBERT(base))	Bandai Namco Research	MIT	◯
Laboro DistilBERT	DistilBERT	512	(Distillation of Laboro BERT(base))	Laboro.AI	CC BY‑NC 4.0	◯
LINE DistilBERT	DistilBERT	512	(Distillation of LINE internal BERT model)	LINE	Apache 2.0	◯
rinna RoBERTa	RoBERTa (base)	512	Japanese Wikipedia, Japanese CC‑100	rinna	MIT	◯
WasedaRoBERTa	RoBERTa (base, large)	512	Japanese Wikipedia, Japanese CC‑100	Waseda Kawahara Lab	CC BY‑SA 4.0	◯ (base, large, large (seq512))¹⁵
InformatixRoBERTa	RoBERTa (base)	512	Japanese Wikipedia, Web Articles (25GB)	Informatix	Apache 2.0	△
KyotoUniversityRoBERTa	RoBERTa (base, large)	512	Japanese Wikipedia, Japanese CC‑100	Kyoto University Language Media Processing Lab	CC BY‑SA 4.0	◯ (base (char-level), large (char-level))
YokohamaNationalRoBERTa	RoBERTa (base)	512	Japanese Wikipedia (3.45GB)	Yokohama National University Mori Lab	Apache 2.0	◯
Megagon Labs RoBERTa	RoBERTa (base)¹⁶	1,282	Japanese mC4 (200M sentences)	Megagon Labs (Recruit Co.,Ltd.)	MIT	◯
ACCMS RoBERTa	RoBERTa (base)	512	Japanese Wikipedia (3.3GB) + Japanese CC‑100 (70GB)	Kyoto University ACCMS	CC BY‑SA 4.0	◯
CinnamonELECTRA	ELECTRA (small)	512	Japanese Wikipedia	Cinnamon	Apache 2.0	◯
Megagon Labs ELECTRA	ELECTRA (base)	512	Japanese mC4 (200M sentences)	Megagon Labs (Recruit Co.,Ltd.)	MIT	◯
UniversityOfTokyoELECTRA	ELECTRA (small, base)	512	Japanese Wikipedia (2.9GB)	University of Tokyo Izumi Lab	CC BY‑SA 4.0	◯ (small, base)
JapaneseRoFormer	RoFormer (base)	512	Japanese Wikipedia (3.45GB)	Yokohama National University Mori Lab	Apache 2.0	◯
JapaneseLUKE	LUKE (base, large)	512	Japanese Wikipedia	Studio Ousia	Apache 2.0	◯ (base, large)
KyotoUniversityDeBERTaV2	DeBERTaV2 (tiny, base, large)	512	Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR (171GB)	Kyoto University Language Media Processing Lab	CC BY‑SA 4.0	◯ (tiny, tiny (char-level), base, large)
KyotoUniversityDeBERTaV3	DeBERTaV3 (base)	512	llm-jp-corpus	Kyoto University Language Media Processing Lab	Apache 2.0	◯
UniversityOfTokyoDeBERTaV2	DeBERTaV2 (small, base)	512	Japanese Wikipedia, Japanese Wikinews, Japanese CC-100, Japanese mC4, Japanese OSCAR	University of Tokyo Izumi Lab	CC BY-SA 4.0	◯ (small, base)
GLOBIS DeBERTaV3	DeBERTaV3 (xsmall, base, large)	512	Wikipedia, WikiBooks, Aozora Bunko, Japanese CC-100, Japanese mC4, Japanese OSCAR	GLOBIS	CC BY-SA 4.0	◯ (xsmall, base, large)
JapaneseBigBird	BigBird (base)	4,096	Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR	Waseda Kawahara Lab	CC BY‑SA 4.0	◯
JapaneseLayoutLM	LayoutLM (base)	512	Pre-trained on Japanese Wikipedia, initialized with TohokuUniversityBERT	The Japan Research Institute, Limited	CC BY-SA 3.0	◯

Domain Specific

	Domain	Architecture	Training Data	Developer	License	HuggingFace?
JapaneseBlogELECTRA	Colloquial language	ELECTRA (small)	Japanese Blog Corpus (354M sentences)	Kitami Institute of Technology Masui-Ptaszynski Lab	CC BY‑SA 4.0	◯
JapaneseSpokenLanguageBERT	Spoken language	BERT (base)	Additional training for TohokuUniversityBERT using Corpus of Spontaneous Japanese (CSJ) (In the DAPT model, the diet record is also used)	Retrieva	Apache 2.0	◯
AcademicRoBERTa	Science	RoBERTa (base)	CiNii Japanese Papers (6.3M sentences)	Ehime University AI Lab	Apache 2.0	◯
local-politics-BERT	Politics	BERT (base)	Wikipedia, Minutes of the National Diet, Minutes of the Local Assembly	Japanese Local Assembly Minutes Corpus Project	CC BY-SA 4.0	◯ (SC-min, SC-minwiki, SC-2M-wiki, SC-2M-min, SC-2M-minwiki, FP-min, FP-minwiki) ¹⁷
UBKE-LUKE	Economics	LUKE (base)	Japanese Wikipedia, Securities Reports, Economic News Articles	Uzabase	CC BY-NC	◯
JapaneseFinancialBERT	Finance	BERT (small, base)¹⁸	Japanese Wikipedia, Japanese Financial Corpus (27M sentences/5.2GB)	University of Tokyo Izumi Lab	CC BY‑SA 4.0	◯ (small, base)
JapaneseFinancialELECTRA	Finance	ELECTRA (small)	Japanese Wikipedia (20M sentences/2.9GB), Japanese Financial Corpus (27M sentences/5.2GB)	University of Tokyo Izumi Lab	CC BY‑SA 4.0	◯
JapaneseNewsBERT	Business	BERT (base)	Japanese Business Articles (3M articles)	Stockmark	CC BY 4.0	△
JapaneseNewsXLNet	Business	XLNet (base)	Japanese Business Articles (3M articles)	Stockmark	？	◯ ※ Unofficial release
JapaneseNewsALBERT	Business	ALBERT (base)	Japanese Business Articles (3M articles)	Stockmark	？	△
MinpakuBERT	Cultural Heritage	BERT (base)	Additional training with National Museum of Ethnology's cultural heritage data on top of Tohoku University BERT	University of Hyogo Ohshima Lab	MIT	◯ (minpaku-v1, minpaku-v3, minpaku-v3-no-additional-token)
UTH-BERT	Medicine	BERT (base)	Japanese Medical Records(120M lines)	University of Tokyo Hospital Medical AI Development Course	CC BY‑NC‑SA 4.0	△
medBERTjp	Medicine	BERT (base)	Japanese Wikipedia, Japanese Medical Corpus ("今日の診療プレミアム/Today's Care Premium" Web Version)	Osaka University Hospital Medical Informatics Lab	CC BY‑NC‑SA 4.0	△
JMedRoBERTa	Medicine	RoBERTa (base)	Japanese Medical Papers (11M sentences/1.8GB)	NII Aizawa Lab	CC BY‑NC‑SA 4.0	◯ (ManbyoWordPiece, SentencePiece)¹⁹

Sentence and Document Embeddings ²⁰

Bi-Encoders

Single-representation bi-encoders

	Max Context Length	Developer	License
sbintuitions/sarashina-embedding-v1-1b	8,192	SB Intuitions	Sarashina Model NonCommercial License
RoSEtta (pkshatech/RoSEtta-base-ja)	1,024	PKSHA Technology	Apache 2.0
GLuCoSE v2 (pkshatech/GLuCoSE-base-ja-v2)	512	PKSHA Technology	Apache 2.0
Ruri (cl-nagoya/ruri-pt-small, cl-nagoya/ruri-pt-base, cl-nagoya/ruri-pt-large, cl-nagoya/ruri-small, cl-nagoya/ruri-base, cl-nagoya/ruri-large)	512	Nagoya University Sasano Group	Apache 2.0
Japanese SimCSE (cl-nagoya/unsup-simcse-ja-base, cl-nagoya/unsup-simcse-ja-large, cl-nagoya/sup-simcse-ja-base, cl-nagoya/sup-simcse-ja-large)	512	Nagoya University Sasano Group	CC BY-SA 4.0
GLuCoSE (pkshatech/GLuCoSE-base-ja)	512	PKSHA Technology	Apache 2.0
colorfulscoop/sbert-base-ja		Colorful Scoop	CC BY‑SA 4.0
MU-Kindai/SBERT-JSNLI-base MU-Kindai/SBERT-JSNLI-large		Kindai University	？
MU-Kindai/Japanese-SimCSE-BERT-base-unsup MU-Kindai/Japanese-SimCSE-BERT-large-unsup MU-Kindai/Japanese-SimCSE-RoBERTa-base-unsup MU-Kindai/Japanese-SimCSE-BERT-base-sup MU-Kindai/Japanese-SimCSE-BERT-large-sup		Kindai University	MIT
pkshatech/simcse-ja-bert-base-clcmlp		PKSHA Technology	CC BY‑SA 4.0
MU-Kindai/Japanese-MixCSE-BERT-base MU-Kindai/Japanese-MixCSE-BERT-large		Kindai University	MIT
MU-Kindai/Japanese-DiffCSE-BERT-base		Kindai University	MIT
bclavie/fio-base-japanese-v0.1		Individual (Benjamin Clavié)
cl-nagoya/shioriha-large-pt		Nagoya University Sasano Group

Multi-representation bi-encoders

	Developer	License
JaColBERTv2.5 (JaColBERTv2.4, JaColBERTv2.5)	Answer.AI	MIT
JaColBERTv2 (JaColBERTv2)	Individual (Benjamin Clavié)	MIT
JaColBERT (JaColBERT)	Individual (Benjamin Clavié)	MIT

Cross-Encoders

	Developer	License
Ruri-Reranker (cl-nagoya/ruri-reranker-stage1-small, cl-nagoya/ruri-reranker-stage1-base, cl-nagoya/ruri-reranker-stage1-large, cl-nagoya/ruri-reranker-small, cl-nagoya/ruri-reranker-base, cl-nagoya/ruri-reranker-large)	Nagoya University Sasano Group	Apache 2.0
hotchpotch/japanese-reranker-cross-encoder-xsmall-v1 hotchpotch/japanese-reranker-cross-encoder-small-v1 hotchpotch/japanese-reranker-cross-encoder-base-v1 hotchpotch/japanese-reranker-cross-encoder-large-v1 hotchpotch/japanese-bge-reranker-v2-m3-v1	Individual (Yuichi Tateno)	MIT

Vision-Language Models

Text+Image to Text

Models built from scratch

General purpose

	Architecture	Training Data	Developer	License / Terms of Use
llava-calm2-siglip (llava-calm2-siglip)	LLaVA-1.5	coversational data generated from MS-COCO and VisualGenome	CyberAgent	Apache 2.0
LLM-jp-3 VILA 14B (14b)	LLaVA-1.5	Japanese image text pairs, LLaVA-Pretrain, Japanese interleaved data, coyo (subset), mmc4-core (subset), llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja, LLaVA-1.5 instruction data (subset)	Research and Development Center for Large Language Models	Apache 2.0 & OpenAI Terms of Use
Heron (blip-ja-stablelm-base-7b-v0, blip-ja-stablelm-base-7b-v1, blip-ja-stablelm-base-7b-v1-llava-620k, git-ja-stablelm-base-7b-v0, git-ELYZA-fast-7b-v0, git-ja-stablelm-base-7b-v1)	BLIP-2 / GIT	v1: LLaVA-Instruct-150K-JA or LLaVA-Instruct-620K-JA v0: LLaVA-Instruct-150K-JA, Japanese STAIR Captions, Japanese Visual Genome VQA dataset	Turing	CC BY-NC 4.0
Japanese Stable VLM (japanese-stable-vlm)	LLaVA-1.5	Japanese CC12M, STAIR Captions, Japanese Visual Genome VQA dataset	Stability AI	STABILITY AI JAPANESE STABLE VLM COMMUNITY LICENSE
Japanese InstructBLIP Alpha (japanese-instructblip-alpha)	InstructBLIP	Japanese CC12M, STAIR Captions, Japanese Visual Genome VQA dataset	Stability AI	JAPANESE STABLELM RESEARCH LICENSE
rinna MiniGPT-4 (bilingual-gpt-neox-4b-minigpt4)	MiniGPT-4	CC12M, COCO 2014, Visual Genome, STAIR Captions, Japanese Visual Genome VQA dataset	rinna	MIT

Domain Specific

	Architecture	Domain	Developer	License
watashiha/Watashiha-Llama-2-13B-Ogiri-sft-vlm	LLaVA	Oogiri	Watashiha	Llama 2 Community License

Models built off non-Japanese VLMs

	Base Model	Training Data	Developer	License
AXCXEPT/EZO-InternVL2-26B	InternVL2	-	Axcxept	MIT

Merged models

	Original Models (Japanese LLMs in bold)	Developer	License
Llama-3-EvoVLM-JP-v2 (v2)	Mantis-8B-SigLIP-Llama-3, Llama-3-ELYZA-JP-8B, Bunny-v1.1-Llama-3-8B-V	Sakana AI	Llama 3 Community License
AXCXEPT/Llama-3-EZO-VLM-1	- (trained from Llama-3-EvoVLM-JP-v2)	Axcxept	Llama 3 Community License
EvoVLM-JP (v1-7B)	Shisa Gamma 7B (v1), LLaVA-1.6-Mistral-7B	Sakana AI	Apache 2.0

Text to Image

General Purpose

	Architecture	Training Data	Developer	License
CommonArt β (commonart-beta)	PixArt-Σ	CommonCatalog-cc-by, Megalith-10M, Smithonian Open Access, ArtBench (CC-0 only)	AI Picasso	Apache 2.0
EvoSDXL-JP (v1)	Stable Diffusion	- (merged from several diffusion models, including Japanese Stable Diffusion XL)	Sakana AI	Apache 2.0²¹
Japanese Stable Diffusion XL (japanese-stable-diffusion-xl)	Stable Diffusion	undisclosed	Stability AI	STABILITY AI JAPANESE STABLE DIFFUSION XL COMMUNITY LICENSE
TohokuUniversity Stable Diffusion (base, refiner)	Stable Diffusion	WMT2023 Shared Task English-Japanese parallel corpus, about 13 million captions from laion2B-multi	Tohoku University NLP Group	CreativeML OpenRAIL-M License
rinna Stable Diffusion (japanese-stable-diffusion)	Stable Diffusion	LAION-5B Japanese Subset (100M images)	rinna	CreativeML OpenRAIL-M License

Domain Specific

	Architecture	Domain	Developer	License
Evo-Nishikie (v1)	Stable Diffusion (ControlNet)	Ukiyo-e	Sakana AI	Apache 2.0²¹
Evo-Ukiyoe (v1)	Stable Diffusion	Ukiyo-e	Sakana AI	Apache 2.0²¹

Text to Video

	Architecture	Training Data	Developer	License
AIdeaLab VideoJP (AIdeaLab-VideoJP)	CogVideoX	Pixabay, FineVideo	AIdeaLab	Apache 2.0

Others

	Architecture	Training Data	Developer	License
LY CLIP (clip-japanese-base)	CLIP	CommonCrawl, CC12M, YFCC100M	LY Corp.	Apache 2.0
Recruit CLIP (japanese-clip-vit-b-32-roberta-base)	CLIP	about 120 million captions from laion2B-multi	Recruit Co.,Ltd.	CC BY-4.0
Japanese Stable CLIP (japanese-stable-clip-vit-l-16)	SigLIP	CC12M translated to Japanese, STAIR Captions	Stability AI	STABILITY AI JAPANESE STABLE CLIP COMMUNITY LICENSE
rinna CLIP (japanese-clip-vit-b-16)	CLIP	CC12M translated to Japanese	rinna	Apache 2.0
rinna CLOOB (japanese-cloob-vit-b-16)	CLOOB	CC12M translated to Japanese	rinna	Apache 2.0
HAKUHODO Technologies CLIP (base, deeper, wider)	CLIP	about 120 million captions from laion2B-multi	HAKUHODO Technologies	CC BY-NC-SA 4.0

Speech-Language Models

Automatic Speech Recognition

	Architecture	Training Data	Developer	License
Kotoba-Whisper (v1.0, v1.0-ggml, v1.0-faster, v1.1, bilingual-v1.0, bilingual-v1.0-ggml, bilingual-v1.0-faster, v2.0, v2.0-ggml, v2.0-faster, v2.1, v2.2)	Distil-Whisper	ReazonSpeech	Kotoba Technologies	Apache 2.0
Nue ASR (nue-asr)	Nue ASR (HuBERT + LLM)	ReazonSpeech	rinna	Apache 2.0
ReazonSpeech (espnet-v1, espnet-next, espnet-v2, nemo-v2)	ESPnet (Conformer-Transducer) / NeMo (FastConformer-RNNT)	ReazonSpeech	Reazon Holdings	Apache 2.0

Others

	Architecture	Training Data	Developer	License
Kotoba-Speech (v0.1)	Transformer	undisclosed	Kotoba Technologies	Apache 2.0
UniversityOfTokyoHuBERT (base-jtube)	HuBERT	JTubeSpeech	University of Tokyo Saruwatari & Takamichi Lab	MIT
rinna HuBERT (base, large)	HuBERT	ReazonSpeech	rinna	Apache 2.0
Reazon wav2vec 2.0 (base, large)	wav2vec 2.0	ReazonSpeech	Reazon Holdings	Apache 2.0
rinna wav2vec 2.0 (base)	wav2vec 2.0	ReazonSpeech	rinna	Apache 2.0

Evaluation Benchmarks for Japanese LLMs

Hybrid Benchmarks

	Description	Developer
Nejumi LLM Leaderboard3	Evaluates the Japanese language capabilities of LLMs from three perspectives: language understanding ability, application ability, and alignment (including controllability and safety). For more details, see this article.	Weights & Biases
Japanese LLM Evaluation	Conducts a comprehensive evaluation of various LLMs based on three types of tasks: Japanese language understanding and generation tasks, Japanese multi-turn dialogue tasks, and English language understanding and generation tasks. Also publishes swallow-evaluation, an evaluation script that integrates and improves existing LLM evaluation tools.	Swallow Project

Traditional Benchmarks based on Natural Language Understanding tasks

	Description	Developer
Open Japanese LLM Leaderboard	Evaluates Japanese language models in 16 different tasks using llm-jp-eval.	LLM-jp, Hugging Face
llm-jp-eval	A tool that evaluates Japanese LLMs automatically across multiple datasets. The complete list of supported datasets can be found here (which also includes tasks such as JNLI and JCommonsenseQA from JGLUE).	LLM-jp
JP Language Model Evaluation Harness	A fork by Stability AI of EleutherAI/lm-evaluation-harness. It is a tool for automatically evaluating Japanese LLMs across multiple datasets. The complete list of supported datasets can be found here (which also includes tasks such as JNLI and JCommonsenseQA from JGLUE). There is a detailed summary of the evaluation results by rinna: [rinna] Benchmark of Stability-AI/lm-evaluation-harness	Stability AI
JGLUE	Japanese version of the GLUE benchmark suite, including the MARC-ja, JCoLA, JSTS, JNLI, JSQuAD, and JCommonsenseQA tasks. JCoLA is by the University of Tokyo's Oseki Lab. See here and here (ja only) for further details about each task.	Waseda University Kawahara Lab and Yahoo
JMMLU	A benchmark constructed as a Japanese version of the MMLU Benchmark, consisting of multiple-choice questions from a wide range of academic fields including natural sciences, humanities, and social sciences. In addition to translating the original MMLU, it features newly added problems based on the unique cultural background of Japan (Japan-specific problems).	Waseda University Kawahara Lab

Benchmarks on open-ended generative tasks

	Description	Developer
Japanese MT-bench	The Japanese version of MT-bench asks about multi-turn conversational ability. It includes 80 questions, 10 each, from 8 categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities. Some questions have been modified to fit with Japanese culture during the production of the Japanese version. It also includes a script that performs a 10-level absolute evaluation by GPT-4.	Stability AI
ELYZA-tasks-100	Ranking based on model responses to 100 complex and diverse tasks, including tasks testing summarization, correction, abstraction, induction, and other skills. Uses humans to score the model responses and then ranks models based on their mean scores.	ELYZA
Preferred Generation Benchmark (pfgen-bench)	A benchmark to measure the Japanese language generation ability of LLMs based on 50 common sense questions unique to the Japanese context. It evaluates along three axes: Fluency, Truthfulness, and Helpfulness. The evaluation is conducted without using LLM-as-a-Judge by calculating n-gram or rule-based metrics.	Preferred Elements (Preferred Networks)
Rakuda Benchmark	Ranking based on model answers to 40 open-ended questions on Japanese geography, history, politics, and society. Uses GPT-4 to judge model outputs pairwise, and then ranks models by fitting a Maximum Likelihood Elo/Bradley-Terry model to GPT-4's preferences.	YuzuAI
Japanese Vicuna QA Benchmark	This is the Japanese version of vicuna-blog-eval, which is the predecessor of MT-Bench. It includes 80 questions on general knowledge, role-playing, common sense, Fermi estimation, counterfactual thinking, coding, mathematics, and writing. It also includes a script for automatic evaluation by GPT-4 (win-rate calculation). The leaderboard can be found here.	Kyoto University Language Media Processing Lab
Tengu-Bench	Includes 120 free-form questions from various categories. Categories of questions: table interpretation, logic puzzles, idea generation, function calling, long document summarization (over a thousand tokens), conversation summarization, long document closed QA (over a thousand tokens), honorifics, project creation, math, translation, extraction, ethical control, cost estimation, Japan, chit-chat, puns, formatting, construction, business, legal judgment, politics, hypothetical questions.	Lightblue
Shaberi	A framework that can collectively evaluate the Japanese MT-bench, Rakuda Benchmark, ELYZA-tasks-100, and Tengu-Bench. There is also a fork by Shisa.AI.	Lightblue

Benchmarks for measuring performance in specific domains

	Description	Developer
Japanese Language Model Financial Evaluation Harness	A benchmark for Japanese LLM in the financial sector. It includes tasks such as sentiment analysis in finance (chabsa), basic knowledge tasks in securities analysis (cma_basics), tasks related to audits in certified public accountant examinations (cpa_audit), multiple choice question tasks in financial planner exams (fp2), and mock exam tasks for securities salespeople exams (security_sales_1). For more details, please see here.	Preferred Networks
pfmt-bench-fin-ja	A benchmark for measuring the generation capabilities of Japanese LLMs in the financial domain.	Preferred Networks
Stockmark Business Questions	The collection includes 50 questions that probe knowledge on topics such as market trends, current affairs, social issues, and business trends.	Stockmark
JMED-LLM	A dataset for evaluating LLMs in the Japanese medical domain. It compiles previously developed Japanese medical language processing tasks for LLM benchmarking.	NAIST Social Computing Lab.
JMedBench	A benchmark for LLMs in the Japanese medical field. It includes 20 datasets in 5 types of tasks: multi-choice question-answering, machine translation, named entity recognition, document classification, and semantic textual similarity (some datasets are borrowed from JMMLU and JMED-LLM). A tool called med-eval is developed to facilitate evaluation on JMedBench.	NII Aizawa Lab
Japanese Medical Language Model Evaluation Harness	A benchmark for evaluating Japanese LLMs in the medical domain in both Japanese and English, executable by a single command.	Individual (Issey Sukeda)
karakuri-bench	A dataset for measuring performance of Japanese LLMs in customer support.	KARAKURI

Benchmarks for measuring factuality and safety

	Description	Developer
JTruthfulQA	The Japanese version of the dataset for evaluating the factuality of LLMs TruthfulQA. It includes questions about superstitions and other beliefs held by some people that are not factual, as well as questions about Japan-specific knowledge, all collected from scratch.	Waseda University Kawahara Lab
JCommonsenseMorality	A dataset on Japanese commonsense morality. Sentences describing actions are labeled with binary values indicating whether they are morally wrong or acceptable.	Hokkaido University Language Media Lab
JBBQ	The Japanese version of the social bias QA dataset BBQ, developed through translation, revision, and addition of questions based on Japanese culture and customs.	University of Tokyo Yanaka Lab

Benchmarks for measuring logical reasoning capabilities

	Description	Developer
JFLD (Japanese Formal Logic Deduction)	A dataset for evaluating deductive reasoning capabilities of Japanese LLMs (the Japanese version of the FLD (Formal Logic Deduction) proposed by the same authors). It is characterized by being composed of counterfactual samples to evaluate apart from the knowledge the LLM possesses.	Hitachi
JHumanEval	A Japanese version of the HumanEval benchmark, which assesses the ability to generate Python code from English instructions. In creating the Japanese version, the text was first machine-translated and then manually corrected.	Japan Women's University Kuramitsu Lab

Benchmarks on controlled text generation

	Description	Developer
LCTG Bench	A benchmark for the controllability of Japanese LLMs. It evaluates whether LLMs can adhere to constraints in four aspects: output format, character count, keywords, and forbidden words. The quality of the generated text is also evaluated.	CyberAgent

Benchmarks for embedding models

	Description	Developer
JMTEB	A benchmark developed as the Japanese version of MTEB. It consists of tasks such as document clustering, text classification, sentence similarity, sentence pair labeling prediction, and text extraction (a reranking task was recently added).	SB Intuitions
JQaRA	A dataset for evaluating Japanese document extraction and reranking accuracy. Each of the 1,667 questions is assigned 100 candidate documents, of which at least one can answer the question. The questions are taken from JAQKET, and the candidate documents are sourced from Japanese Wikipedia.	Individual (Yuichi Tateno)
JaCWIR	A dataset created for evaluating document extraction and reranking in domains other than Wikipedia. Each of the 5,000 questions is assigned one Web page that serves as the source of the question and 99 unrelated Web pages.	Individual (Yuichi Tateno)

Benchmarks for vision-language models

	Description	Developer
JMMMU	A benchmark constructed as the Japanese version of MMMU Benchmark. It consists of 720 translated MMMU problems and 600 new problems unique to Japanese culture.	University of Tokyo Aizawa Lab
JDocQA	A question-answer dataset based on Japanese documents (pamphlets, slides, reports, websites), consisting of a total of 11,600 questions. It includes various question formats, including unanswerable questions.	NAIST Watanabe Lab
Heron VLM Leaderboard powered by Nejumi/WandB	Summarizes the evaluation results of Japanese-Heron-Bench and LLaVA-Bench-In-the-Wild (Japanese).	Turing, Weights & Biases
Japanese-Heron-Bench	21 images are assigned a total of 102 questions. It is characterized by image-question pairs that require knowledge related to Japan.	Turing
JA-VLM-Bench-In-the-Wild	A dataset independently prepared by Sakana AI to evaluate EvoVLM-JP-v1-7B. It consists of 50 questions assigned to 42 images. It is characterized by images and questions that require knowledge about Japan.	Sakana AI
JA-Multi-Image-VQA	A dataset for evaluating the question-answering ability in Japanese for multiple images.	Sakana AI
LLaVA-Bench-In-the-Wild (Japanese)	This is the Japanese version of LLaVA-Bench-In-the-Wild, translated using DeepL. It consists of 60 questions assigned to 24 images.	Turing
LLaVA-Bench (COCO) Japanese	This is the Japanese version, translated by DeepL, of the LLaVA-Bench (COCO) dataset used to evaluate LLaVA. It consists of 30 images, each with 3 types of questions assigned to them.	Turing
Japanese Visual Genome VQA dataset	A question-and-answer dataset annotated based on images from the Visual Genome dataset. A subset of this dataset, JA-VG-VQA-500, consisting of 500 questions, is sometimes used as a benchmark for evaluating VLMs.	Yahoo

References for Models and Architectures

References for Training Methods

Our Contributors

We love contributors! Feel free to contribute to this project.

Citation

The summary of this repository is also published as a preprint: Exploring Open Large Language Models for the Japanese Language: A Practical Guide

When referencing this repository, please cite as follows:

@article{awesomeJapanese2024,
    title={{Exploring Open Large Language Models for the Japanese Language: A Practical Guide}},
    author={Kaito Sugimoto},
    doi={10.51094/jxiv.682},
    journal={Jxiv preprint},
    year={2024}
}

Some architectural changes have been made. For details, refer to: 1,000億パラメータ規模の独自LLM「PLaMo-100B」の事前学習 ↩
Refer to the following articles: 大規模言語モデルTanuki-8B, 8x8Bの位置づけや開発指針など, 大規模言語モデルを開発するにあたっての事前・事後学習の戦略メモー特に合成データについてー ↩ ↩²
Some performance enhancements have been made to the original Llama model. See here for details. ↩
Details have not been made public but the private dataset includes data from the EleutherAI Polyglot project's Japanese team and from members of Stable Community Japan. ↩
This project conducted evaluation research on using right-to-left generation instead of the usual left-to-right generation, releasing both left-to-right and right-to-left models. ↩
Before conducting Instruction Tuning, a Chat Vector between Llama 3 Instruct and Llama 3 Base is added. ↩ ↩²
After conducting Instruction Tuning, a Chat Vector between Llama 3 Instruct and Llama 3 Base is added. ↩ ↩²
However, if commercial use of KARAKURI LM is desired, direct contact with the developer, KARAKURI Inc., is required. ↩
In Instruction Tuning, because it uses data generated by OpenAI's models, such as GPT-3.5 and GPT-4, for training, there is a possibility that it may violate OpenAI's terms. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰
Before conducting Instruction Tuning, a Chat Vector between Gemma 2 Instruct and Gemma 2 Base is added. ↩
○: The model is on the HuggingFace Model Hub and can be loaded in with the AutoModel.from_pretrained() command. △: The model is not on the Model Hub but can be loaded in manually with the HuggingFace transformers library. ✕: The model is not directly loadable with HuggingFace. ↩
By removing Causal Attention from Llama, it is used as an encoder-type model. ↩
This project conducted evaluation research on pre-tokenization morphological analysis and released their best performing model, which used Juman++ and BPE. ↩
However, the maximum sequence length has been extended to 2048, and various architectural changes have been made compared to the original BERT. See the HuggingFace repository README for details. ↩
nlp-waseda/roberta-base-japanese and nlp-waseda/roberta-large-japanese trained using a 128 token context length, but nlp-waseda/roberta-large-japanese-seq512 expanded the context length to 512. ↩
Extended to a 1282 context length from the usual 512. ↩
For details of each model, please refer to Chapter 4 of the authors' paper. Note that the SC-2M-wiki model is strictly not a domain-specific model as it is pre-trained only on Wikipedia. ↩
The "small" model trains on Japanese Wikipedia and the Japanese Financial Corpus simultaneously, while the "base" model takes the TohokuUniversityBERT and conducts additional training on the Japanese Financial Corpus. ↩
ManbyoWordPiece conducts a pre-tokenization step using MeCab (IPA+Manbyo dictionaries) and uses WordPiece for subword tokenization, while the SentencePiece model tokenizes text directly using a unigram model. ↩
The classification of embedding models was referenced from Dense Text Retrieval based on Pretrained Language Models: A Survey (Zhao+, 2022). The Bi-Encoder architecture inputs two separate inputs into the model and vectorizes each, using their dot product or cosine similarity as a measure of their proximity. In contrast, the Cross-Encoder architecture inputs the combined inputs into the model to directly compute their proximity internally. Although Cross-Encoders incur higher computational costs, they are often used as rerankers in information extraction due to their ability to compute input proximity more precisely. Among Bi-Encoders, there are types (e.g., ColBERT) that represent the input as multiple vectors (such as one per token) rather than a single vector, hence further classification into Single-representation bi-encoders and Multi-representation bi-encoders. ↩
However, it calls for consideration for use in research and education. Additionally, be aware that some of the licenses for the source models are not Apache 2.0. ↩ ↩² ↩³

Files

en

Directory actions

More options

Directory actions

More options

Latest commit

History

en

Folders and files

parent directory

README.md

Overview of Japanese LLMs

Text Generation Models

Models built from scratch

General purpose

Domain Specific

Models built off non-Japanese LLMs (w/ continual pre-training on Japanese)

General purpose

Domain specific

Models built off non-Japanese LLMs (w/ post-training on Japanese)

General purpose

Domain specific

Merged models

API-based models

Encoder models

General purpose

Domain Specific

Sentence and Document Embeddings 20

Bi-Encoders

Single-representation bi-encoders

Multi-representation bi-encoders

Cross-Encoders

Vision-Language Models

Text+Image to Text

Models built from scratch

Models built off non-Japanese VLMs

Merged models

Text to Image

General Purpose

Domain Specific

Text to Video

Others

Speech-Language Models

Automatic Speech Recognition

Others

Evaluation Benchmarks for Japanese LLMs

Hybrid Benchmarks

Traditional Benchmarks based on Natural Language Understanding tasks

Benchmarks on open-ended generative tasks

Benchmarks for measuring performance in specific domains

Benchmarks for measuring factuality and safety

Benchmarks for measuring logical reasoning capabilities

Benchmarks on controlled text generation

Benchmarks for embedding models

Benchmarks for vision-language models

References for Models and Architectures

References for Training Methods

Our Contributors

Citation

Footnotes

Sentence and Document Embeddings ²⁰