You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 4, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: docs/docs/capabilities/models/index.mdx
+10-4Lines changed: 10 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -7,17 +7,23 @@ description: The Model section overview
7
7
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
8
8
:::
9
9
10
+
Models in cortex.cpp are used for inference purposes (e.g., chat completion, embedding, etc.). We support two types of models: local and remote.
11
+
12
+
Local models use a local inference engine to run completely offline on your hardware. Currently, we support llama.cpp with the GGUF model format, and we have plans to support TensorRT-LLM and ONNX engines in the future.
13
+
14
+
Remote models (like OpenAI GPT-4 and Claude 3.5 Sonnet) use remote engines. Support for OpenAI and Anthropic engines is under development and will be available in cortex.cpp soon.
15
+
10
16
When Cortex.cpp is started, it automatically starts an API server, this is inspired by Docker CLI. This server manages various model endpoints. These endpoints facilitate the following:
11
17
-**Model Operations**: Run and stop models.
12
18
-**Model Management**: Manage your local models.
13
19
:::info
14
20
The model in the API server is automatically loaded/unloaded by using the [`/chat/completions`](/api-reference#tag/inference/post/v1/chat/completions) endpoint.
15
21
:::
16
22
## Model Formats
17
-
Cortex.cpp supports three model formats:
18
-
- GGUF
19
-
- ONNX
20
-
- TensorRT-LLM
23
+
Cortex.cpp supports three model formats and each model format require specific engine to run:
24
+
- GGUF - run with `llama-cpp` engine
25
+
- ONNX - run with `onnxruntime` engine
26
+
- TensorRT-LLM - run with `tensorrt-llm` engine
21
27
22
28
:::info
23
29
For details on each format, see the [Model Formats](/docs/capabilities/models/model-yaml#model-formats) page.
@@ -10,7 +10,7 @@ import TabItem from "@theme/TabItem";
10
10
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
11
11
:::
12
12
13
-
Cortex.cpp uses a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
13
+
Cortex.cpp utilizes a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
| `stream` | Enables or disables streaming mode for the output (true or false). | No |
135
+
| `top_p` | The cumulative probability threshold for token sampling. Ranges from 0 to 1. | No |
136
+
| `temperature` | Controls the randomness of predictions by scaling logits before applying softmax. Ranges from 0 to 1. | No |
137
+
| `frequency_penalty` | Penalizes new tokens based on their existing frequency in the sequence so far. Ranges from 0 to 1. | No |
138
+
| `presence_penalty` | Penalizes new tokens based on whether they appear in the sequence so far. Ranges from 0 to 1. | No |
139
+
| `max_tokens` | Maximum number of tokens in the output for 1 turn. | No |
140
+
| `seed` | Seed for the random number generator. `-1` means no seed. | No |
141
+
| `dynatemp_range` | Dynamic temperature range. | No |
142
+
| `dynatemp_exponent` | Dynamic temperature exponent. | No |
143
+
| `top_k` | The number of most likely tokens to consider at each step. | No |
144
+
| `min_p` | Minimum probability threshold for token sampling. | No |
145
+
| `tfs_z` | The z-score used for Typical token sampling. | No |
146
+
| `typ_p` | The cumulative probability threshold used for Typical token sampling. | No |
147
+
| `repeat_last_n` | Number of previous tokens to penalize for repeating. | No |
148
+
| `repeat_penalty` | Penalty for repeating tokens. | No |
149
+
| `mirostat` | Enables or disables Mirostat sampling (true or false). | No |
150
+
| `mirostat_tau` | Target entropy value for Mirostat sampling. | No |
151
+
| `mirostat_eta` | Learning rate for Mirostat sampling. | No |
152
+
| `penalize_nl` | Penalizes newline tokens (true or false). | No |
153
+
| `ignore_eos` | Ignores the end-of-sequence token (true or false). | No |
154
+
| `n_probs` | Number of probabilities to return. | No |
155
+
| `min_keep` | Minimum number of tokens to keep. | No |
156
+
| `n_parallels` | Number of parallel streams to use. This params allow you to use multiple chat terminal at the same time. Notice that you need to update `ctx_len` coressponding to `n_parallels` (e.g n_parallels=1, ctx_len=2048 -> n_parallels=2, ctx_len=4096. ) | No |
157
+
| `stop` | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |
104
158
105
159
106
160
### Model Load Parameters
@@ -114,14 +168,54 @@ prompt_template: |+
114
168
115
169
ctx_len: 0
116
170
ngl: 33
171
+
engine: llama-cpp
172
+
117
173
```
118
174
Model load parameters include the options that control how Cortex.cpp runs the model. The required parameters include:
| `ngl` | Number of model layers will be offload to GPU. | No |
122
178
| `ctx_len` | Context length (maximum number of tokens). | No |
123
179
| `prompt_template` | Template for formatting the prompt, including system messages and instructions. | Yes |
180
+
| `engine` | The engine that run model, default to `llama-cpp` for local model with gguf format. | Yes |
181
+
182
+
All parameters from the `model.yml` file are used for running the model via the [CLI chat command](/docs/cli/chat) or [CLI run command](/docs/cli/run). These parameters also act as defaults when using the [model start API](/api-reference#tag/models/post/v1/models/start) through cortex.cpp.
183
+
184
+
## Runtime parameters
185
+
186
+
In addition to predefined parameters in `model.yml`, Cortex.cpp supports runtime parameters to override these settings when using the [model start API](/api-reference#tag/models/post/v1/models/start).
187
+
188
+
### Model start params
189
+
190
+
Cortex.cpp supports the following parameters when starting a model via the [model start API](/api-reference#tag/models/post/v1/models/start) for the `llama-cpp engine`:
| `cache_type` | Data type of the KV cache in llama.cpp models. Supported types are `f16`, `q8_0`, and `q4_0`, default is `f16`. | No |
208
+
| `cache_enabled` |Enables caching of conversation history for reuse in subsequent requests. Default is `false` | No |
209
+
| `mmproj` | path to mmproj GGUF model, support for llava model | No |
210
+
| `llama_model_path` | path to llm GGUF model | No |
211
+
212
+
These parameters will override the `model.yml` parameters when starting model through the API.
213
+
214
+
### Chat completion API parameters
215
+
216
+
The API is accessible at the `/v1/chat/completions` URL and accepts all parameters from the chat completion API as described [API reference](/api-reference#tag/chat/post/v1/chat/completions)
124
217
218
+
With the `llama-cpp` engine, cortex.cpp accept all parameters from [`model.yml` inference section](#Inference Parameters) and accept all parameters from the chat completion API.
125
219
126
220
:::info
127
221
You can download all the supported model formats from the following:
0 commit comments