For models that are not supported by the Fastllm framework, you can support them by customizing the model structure.
A custom Python model requires only a Python file to describe the model structure. You can refer to the implementation in QWEN.
When using ftllm.chat
, ftllm.webui
, or ftllm.server
, you can add the --custom
parameter to specify the custom model file.
Assuming our model is located in the ~/Qwen2-7B-Instruct/
directory and the custom model is located in ~/qwen2.py
, you can use the command:
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --custom ~/qwen2.py
to load the Qwen2 model using the custom model file. The usage for server
and webui
is similar.
When creating a custom model, you need to implement a model description class that inherits from ftllm.llm.ComputeGraph
.
Refer to the code in QWEN:
from ftllm.llm import ComputeGraph
class Qwen2Model(ComputeGraph):
At the end of the file, you need to define the __model__
variable to specify the class corresponding to the custom model structure, with the corresponding code:
__model__ = Qwen2Model
The model description class needs to implement the build
method to obtain model parameters and describe the computation flow.
Here is an example based on the sample code:
class Qwen2Model(ComputeGraph):
def build(self):
# 1. Get weight, data, config
weight, data, config = self.weight, self.data, self.config
# 2. Set some config
config["max_positions"] = 128000
# 3. Describe the computation flow
head_dim = config["hidden_size"] // config["num_attention_heads"]
self.Embedding(data["inputIds"], weight["model.embed_tokens.weight"], data["hiddenStates"]);
# The following is the computation flow, see the example code for details
The model configuration, which by default is read from the config.json
file in the model folder.
You can modify parameters in the config
within the build
method, such as changing max_positions
to modify the context length.
For some models, the variable names used in config.json
may differ and need to be manually assigned during the build
process.
For example, in the TeleChat7B model configuration, there is no max_positions
variable but instead uses seq_length
to represent the length. In the build
method, you need to assign it with the following code:
self.config["max_positions"] = self.config["seq_length"]
In the config
, the following variables must be assigned (if the variable names in config.json
are consistent, no action is needed):
self.config["max_positions"] # Represents the maximum context length
Represents the weight data.
self.weight[weightName]
represents the parameter named weightName
in the model file (corresponding to the parameter names in the .safetensors
file in the HF model folder).
Represents the intermediate variables and input variables of the computation flow.
self.data[dataName]
represents the intermediate variable named dataName
. dataName
can be any string except for the following input variable names:
Input variables:
data["inputIds"] # Input tokens
data["positionIds"] # Position information
data["attentionMask"] # Mask information
data["sin"] # Sin for rotary encoding
data["cos"] # Cos for rotary encoding
data["atype"] # Data type in inference
data["pastKey."][i] # Key cache for the i-th block
data["pastValue."][i] # Value cache for the i-th block
Use the functions of the base class ComputeGraph
to describe the computation flow.
The currently supported operators are documented in Custom Model Operators.
(The interface for custom models in C++ is still under modification...)