Skip to content

Latest commit

 

History

History
87 lines (58 loc) · 2.71 KB

tokenizer.rst

File metadata and controls

87 lines (58 loc) · 2.71 KB

Tokenizers

Currently, Lance has built-in support for Jieba and Lindera. However, it doesn't come with its own language models. If tokenization is needed, you can download language models by yourself. You can specify the location where the language models are stored by setting the environment variable LANCE_LANGUAGE_MODEL_HOME. If it's not set, the default value is

... code-block::bash
${system data directory}/lance/language_models

It also supports configuring user dictionaries, which makes it convenient for users to expand their own dictionaries without retraining the language models.

Language Models of Jieba

Downloading the Model

... code-block::bash
python -m lance.download jieba

The language model is stored by default in ${LANCE_LANGUAGE_MODEL_HOME}/jieba/default.

Using the Model

... code-block::python
ds.create_scalar_index("text", "INVERTED", base_tokenizer="jieba/default")

User Dictionaries

Create a file named config.json in the root directory of the current model.

... code-block::json
{
"main": "dict.txt", "users": ["path/to/user/dict.txt"]

}

Language Models of Lindera

Downloading the Model

... code-block::bash
python -m lance.download lindera -l [ipadic|ko-dic|unidic]

Note that the language models of Lindera need to be compiled. Please install lindera-cli first. For detailed steps, please refer to https://github.com/lindera/lindera/tree/main/lindera-cli.

The language model is stored by default in ${LANCE_LANGUAGE_MODEL_HOME}/lindera/[ipadic|ko-dic|unidic]

Using the Model

... code-block::python
ds.create_scalar_index("text", "INVERTED", base_tokenizer="lindera/ipadic")

User Dictionaries

Create a file named config.json in the root directory of the current model.

... code-block::json
{
"main": "main", "users": "path/to/user/dict.bin", "user_kind": "ipadic|ko-dic|unidic"

}

  • The "main" field is optional. If not filled, the default is the "main" directory.
  • "user" is the path of the user dictionary. The user dictionary can be passed as a CSV file or as a binary file compiled by lindera-cli.
  • The "user_kind" field can be left blank if the user dictionary is in binary format. If it's in CSV format, you need to specify the type of the language model.

Create your own language model

Put your language model into LANCE_LANGUAGE_MODEL_HOME.