Currently, Lance has built-in support for Jieba and Lindera. However, it doesn't come with its own language models. If tokenization is needed, you can download language models by yourself. You can specify the location where the language models are stored by setting the environment variable LANCE_LANGUAGE_MODEL_HOME. If it's not set, the default value is
- ... code-block::bash
- ${system data directory}/lance/language_models
It also supports configuring user dictionaries, which makes it convenient for users to expand their own dictionaries without retraining the language models.
- ... code-block::bash
- python -m lance.download jieba
The language model is stored by default in ${LANCE_LANGUAGE_MODEL_HOME}/jieba/default.
- ... code-block::python
- ds.create_scalar_index("text", "INVERTED", base_tokenizer="jieba/default")
Create a file named config.json in the root directory of the current model.
- ... code-block::json
- {
- "main": "dict.txt", "users": ["path/to/user/dict.txt"]
}
- The "main" field is optional. If not filled, the default is "dict.txt".
- "users" is the path of the user dictionary. For the format of the user dictionary, please refer to https://github.com/messense/jieba-rs/blob/main/src/data/dict.txt.
- ... code-block::bash
- python -m lance.download lindera -l [ipadic|ko-dic|unidic]
Note that the language models of Lindera need to be compiled. Please install lindera-cli first. For detailed steps, please refer to https://github.com/lindera/lindera/tree/main/lindera-cli.
The language model is stored by default in ${LANCE_LANGUAGE_MODEL_HOME}/lindera/[ipadic|ko-dic|unidic]
- ... code-block::python
- ds.create_scalar_index("text", "INVERTED", base_tokenizer="lindera/ipadic")
Create a file named config.json in the root directory of the current model.
- ... code-block::json
- {
- "main": "main", "users": "path/to/user/dict.bin", "user_kind": "ipadic|ko-dic|unidic"
}
- The "main" field is optional. If not filled, the default is the "main" directory.
- "user" is the path of the user dictionary. The user dictionary can be passed as a CSV file or as a binary file compiled by lindera-cli.
- The "user_kind" field can be left blank if the user dictionary is in binary format. If it's in CSV format, you need to specify the type of the language model.
Put your language model into LANCE_LANGUAGE_MODEL_HOME.