Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Commit c44fb59

Browse files
Merge pull request #1615 from janhq/chore/model-run-docs-update
chore: add document for function calling
2 parents 206650f + 611901a commit c44fb59

File tree

5 files changed

+887
-27
lines changed

5 files changed

+887
-27
lines changed

docs/docs/capabilities/models/index.mdx

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,23 @@ description: The Model section overview
77
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
88
:::
99

10+
Models in cortex.cpp are used for inference purposes (e.g., chat completion, embedding, etc.). We support two types of models: local and remote.
11+
12+
Local models use a local inference engine to run completely offline on your hardware. Currently, we support llama.cpp with the GGUF model format, and we have plans to support TensorRT-LLM and ONNX engines in the future.
13+
14+
Remote models (like OpenAI GPT-4 and Claude 3.5 Sonnet) use remote engines. Support for OpenAI and Anthropic engines is under development and will be available in cortex.cpp soon.
15+
1016
When Cortex.cpp is started, it automatically starts an API server, this is inspired by Docker CLI. This server manages various model endpoints. These endpoints facilitate the following:
1117
- **Model Operations**: Run and stop models.
1218
- **Model Management**: Manage your local models.
1319
:::info
1420
The model in the API server is automatically loaded/unloaded by using the [`/chat/completions`](/api-reference#tag/inference/post/v1/chat/completions) endpoint.
1521
:::
1622
## Model Formats
17-
Cortex.cpp supports three model formats:
18-
- GGUF
19-
- ONNX
20-
- TensorRT-LLM
23+
Cortex.cpp supports three model formats and each model format require specific engine to run:
24+
- GGUF - run with `llama-cpp` engine
25+
- ONNX - run with `onnxruntime` engine
26+
- TensorRT-LLM - run with `tensorrt-llm` engine
2127

2228
:::info
2329
For details on each format, see the [Model Formats](/docs/capabilities/models/model-yaml#model-formats) page.

docs/docs/capabilities/models/model-yaml.mdx

Lines changed: 111 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ import TabItem from "@theme/TabItem";
1010
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
1111
:::
1212

13-
Cortex.cpp uses a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
13+
Cortex.cpp utilizes a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
1414

1515
## Structure of `model.yaml`
1616

@@ -39,6 +39,23 @@ temperature: 0.6 # Ranges: 0 to 1
3939
frequency_penalty: 0 # Ranges: 0 to 1
4040
presence_penalty: 0 # Ranges: 0 to 1
4141
max_tokens: 8192 # Should be default to context length
42+
seed: -1
43+
dynatemp_range: 0
44+
dynatemp_exponent: 1
45+
top_k: 40
46+
min_p: 0.05
47+
tfs_z: 1
48+
typ_p: 1
49+
repeat_last_n: 64
50+
repeat_penalty: 1
51+
mirostat: false
52+
mirostat_tau: 5
53+
mirostat_eta: 0.1
54+
penalize_nl: false
55+
ignore_eos: false
56+
n_probs: 0
57+
n_parallels: 1
58+
min_keep: 0
4259
## END OPTIONAL
4360
# END INFERENCE PARAMETERS
4461

@@ -54,6 +71,7 @@ prompt_template: |+ # tokenizer.chat_template
5471
## BEGIN OPTIONAL
5572
ctx_len: 0 # llama.context_length | 0 or undefined = loaded from model
5673
ngl: 33 # Undefined = loaded from model
74+
engine: llama-cpp
5775
## END OPTIONAL
5876
# END MODEL LOAD PARAMETERS
5977

@@ -84,23 +102,59 @@ stop:
84102
  - <|end_of_text|>
85103
  - <|eot_id|>
86104
  - <|eom_id|>
87-
stream: true
88-
top_p: 0.9
89-
temperature: 0.6
90-
frequency_penalty: 0
91-
presence_penalty: 0
92-
max_tokens: 8192
105+
stream: true # Default true?
106+
top_p: 0.9 # Ranges: 0 to 1
107+
temperature: 0.6 # Ranges: 0 to 1
108+
frequency_penalty: 0 # Ranges: 0 to 1
109+
presence_penalty: 0 # Ranges: 0 to 1
110+
max_tokens: 8192 # Should be default to context length
111+
seed: -1
112+
dynatemp_range: 0
113+
dynatemp_exponent: 1
114+
top_k: 40
115+
min_p: 0.05
116+
tfs_z: 1
117+
typ_p: 1
118+
repeat_last_n: 64
119+
repeat_penalty: 1
120+
mirostat: false
121+
mirostat_tau: 5
122+
mirostat_eta: 0.1
123+
penalize_nl: false
124+
ignore_eos: false
125+
n_probs: 0
126+
n_parallels: 1
127+
min_keep: 0
128+
93129
```
94130
Inference parameters define how the results will be produced. The required parameters include:
95-
| **Parameter** | **Description** | **Required** |
96-
|------------------------|--------------------------------------------------------------------------------------|--------------|
97-
| `top_p` | The cumulative probability threshold for token sampling. | No |
98-
| `temperature` | Controls the randomness of predictions by scaling logits before applying softmax. | No |
99-
| `frequency_penalty` | Penalizes new tokens based on their existing frequency in the sequence so far. | No |
100-
| `presence_penalty` | Penalizes new tokens based on whether they appear in the sequence so far. | No |
101-
| `max_tokens` | Maximum number of tokens in the output. | No |
102-
| `stream` | Enables or disables streaming mode for the output (true or false). | No |
103-
| `stop` | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |
131+
132+
| **Parameter** | **Description** | **Required** |
133+
|---------------|-----------------|--------------|
134+
| `stream` | Enables or disables streaming mode for the output (true or false). | No |
135+
| `top_p` | The cumulative probability threshold for token sampling. Ranges from 0 to 1. | No |
136+
| `temperature` | Controls the randomness of predictions by scaling logits before applying softmax. Ranges from 0 to 1. | No |
137+
| `frequency_penalty` | Penalizes new tokens based on their existing frequency in the sequence so far. Ranges from 0 to 1. | No |
138+
| `presence_penalty` | Penalizes new tokens based on whether they appear in the sequence so far. Ranges from 0 to 1. | No |
139+
| `max_tokens` | Maximum number of tokens in the output for 1 turn. | No |
140+
| `seed` | Seed for the random number generator. `-1` means no seed. | No |
141+
| `dynatemp_range` | Dynamic temperature range. | No |
142+
| `dynatemp_exponent` | Dynamic temperature exponent. | No |
143+
| `top_k` | The number of most likely tokens to consider at each step. | No |
144+
| `min_p` | Minimum probability threshold for token sampling. | No |
145+
| `tfs_z` | The z-score used for Typical token sampling. | No |
146+
| `typ_p` | The cumulative probability threshold used for Typical token sampling. | No |
147+
| `repeat_last_n` | Number of previous tokens to penalize for repeating. | No |
148+
| `repeat_penalty` | Penalty for repeating tokens. | No |
149+
| `mirostat` | Enables or disables Mirostat sampling (true or false). | No |
150+
| `mirostat_tau` | Target entropy value for Mirostat sampling. | No |
151+
| `mirostat_eta` | Learning rate for Mirostat sampling. | No |
152+
| `penalize_nl` | Penalizes newline tokens (true or false). | No |
153+
| `ignore_eos` | Ignores the end-of-sequence token (true or false). | No |
154+
| `n_probs` | Number of probabilities to return. | No |
155+
| `min_keep` | Minimum number of tokens to keep. | No |
156+
| `n_parallels` | Number of parallel streams to use. This params allow you to use multiple chat terminal at the same time. Notice that you need to update `ctx_len` coressponding to `n_parallels` (e.g n_parallels=1, ctx_len=2048 -> n_parallels=2, ctx_len=4096. ) | No |
157+
| `stop` | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |
104158

105159

106160
### Model Load Parameters
@@ -114,14 +168,54 @@ prompt_template: |+
114168
115169
ctx_len: 0
116170
ngl: 33
171+
engine: llama-cpp
172+
117173
```
118174
Model load parameters include the options that control how Cortex.cpp runs the model. The required parameters include:
119175
| **Parameter** | **Description** | **Required** |
120176
|------------------------|--------------------------------------------------------------------------------------|--------------|
121-
| `ngl` | Number of attention heads. | No |
177+
| `ngl` | Number of model layers will be offload to GPU. | No |
122178
| `ctx_len` | Context length (maximum number of tokens). | No |
123179
| `prompt_template` | Template for formatting the prompt, including system messages and instructions. | Yes |
180+
| `engine` | The engine that run model, default to `llama-cpp` for local model with gguf format. | Yes |
181+
182+
All parameters from the `model.yml` file are used for running the model via the [CLI chat command](/docs/cli/chat) or [CLI run command](/docs/cli/run). These parameters also act as defaults when using the [model start API](/api-reference#tag/models/post/v1/models/start) through cortex.cpp.
183+
184+
## Runtime parameters
185+
186+
In addition to predefined parameters in `model.yml`, Cortex.cpp supports runtime parameters to override these settings when using the [model start API](/api-reference#tag/models/post/v1/models/start).
187+
188+
### Model start params
189+
190+
Cortex.cpp supports the following parameters when starting a model via the [model start API](/api-reference#tag/models/post/v1/models/start) for the `llama-cpp engine`:
191+
192+
```
193+
cache_enabled: bool
194+
ngl: int
195+
n_parallel: int
196+
cache_type: string
197+
ctx_len: int
198+
199+
## Support for vision model
200+
mmproj: string
201+
llama_model_path: string
202+
model_path: string
203+
```
204+
205+
| **Parameter** | **Description** | **Required** |
206+
|------------------------|--------------------------------------------------------------------------------------|--------------|
207+
| `cache_type` | Data type of the KV cache in llama.cpp models. Supported types are `f16`, `q8_0`, and `q4_0`, default is `f16`. | No |
208+
| `cache_enabled` |Enables caching of conversation history for reuse in subsequent requests. Default is `false` | No |
209+
| `mmproj` | path to mmproj GGUF model, support for llava model | No |
210+
| `llama_model_path` | path to llm GGUF model | No |
211+
212+
These parameters will override the `model.yml` parameters when starting model through the API.
213+
214+
### Chat completion API parameters
215+
216+
The API is accessible at the `/v1/chat/completions` URL and accepts all parameters from the chat completion API as described [API reference](/api-reference#tag/chat/post/v1/chat/completions)
124217
218+
With the `llama-cpp` engine, cortex.cpp accept all parameters from [`model.yml` inference section](#Inference Parameters) and accept all parameters from the chat completion API.
125219
126220
:::info
127221
You can download all the supported model formats from the following:

0 commit comments

Comments
 (0)