menloresearch
diff --git a/‎docs/docs/capabilities/models/index.mdx
Lines changed: 10 additions & 4 deletions b/‎docs/docs/capabilities/models/index.mdx
Lines changed: 10 additions & 4 deletions
diff --git a/‎docs/docs/capabilities/models/model-yaml.mdx
Lines changed: 111 additions & 17 deletions b/‎docs/docs/capabilities/models/model-yaml.mdx
Lines changed: 111 additions & 17 deletions
@@ -7,17 +7,23 @@ description: The Model section overview
 🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
 :::
 
+Models in cortex.cpp are used for inference purposes (e.g., chat completion, embedding, etc.). We support two types of models: local and remote.
+
+Local models use a local inference engine to run completely offline on your hardware. Currently, we support llama.cpp with the GGUF model format, and we have plans to support TensorRT-LLM and ONNX engines in the future.
+
+Remote models (like OpenAI GPT-4 and Claude 3.5 Sonnet) use remote engines. Support for OpenAI and Anthropic engines is under development and will be available in cortex.cpp soon.
+
 When Cortex.cpp is started, it automatically starts an API server, this is inspired by Docker CLI. This server manages various model endpoints. These endpoints facilitate the following:
 - **Model Operations**: Run and stop models.
 - **Model Management**: Manage your local models.
 :::info
 The model in the API server is automatically loaded/unloaded by using the [`/chat/completions`](/api-reference#tag/inference/post/v1/chat/completions) endpoint.
 :::
 ## Model Formats
-Cortex.cpp supports three model formats:
-- GGUF
-- ONNX
-- TensorRT-LLM
+Cortex.cpp supports three model formats and each model format require specific engine to run:
+- GGUF - run with `llama-cpp` engine
+- ONNX - run with `onnxruntime` engine
+- TensorRT-LLM - run with `tensorrt-llm` engine
 
 :::info
 For details on each format, see the [Model Formats](/docs/capabilities/models/model-yaml#model-formats) page.
 
@@ -10,7 +10,7 @@ import TabItem from "@theme/TabItem";
 🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
 :::
 
-Cortex.cpp uses a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
+Cortex.cpp utilizes a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
 
 ## Structure of `model.yaml`
 
@@ -39,6 +39,23 @@ temperature: 0.6     # Ranges: 0 to 1
 frequency_penalty: 0 # Ranges: 0 to 1
 presence_penalty: 0  # Ranges: 0 to 1
 max_tokens: 8192     # Should be default to context length
+seed: -1
+dynatemp_range: 0
+dynatemp_exponent: 1
+top_k: 40
+min_p: 0.05
+tfs_z: 1
+typ_p: 1
+repeat_last_n: 64
+repeat_penalty: 1
+mirostat: false
+mirostat_tau: 5
+mirostat_eta: 0.1
+penalize_nl: false
+ignore_eos: false
+n_probs: 0
+n_parallels: 1
+min_keep: 0
 ## END OPTIONAL
 # END INFERENCE PARAMETERS
 
@@ -54,6 +71,7 @@ prompt_template: |+  # tokenizer.chat_template
 ## BEGIN OPTIONAL
 ctx_len: 0          # llama.context_length | 0 or undefined = loaded from model
 ngl: 33             # Undefined = loaded from model
+engine: llama-cpp
 ## END OPTIONAL
 # END MODEL LOAD PARAMETERS
 
@@ -84,23 +102,59 @@ stop:
   - <|end_of_text|>
   - <|eot_id|>
   - <|eom_id|>
-stream: true         
-top_p: 0.9           
-temperature: 0.6     
-frequency_penalty: 0 
-presence_penalty: 0  
-max_tokens: 8192     
+stream: true # Default true?
+top_p: 0.9 # Ranges: 0 to 1
+temperature: 0.6 # Ranges: 0 to 1
+frequency_penalty: 0 # Ranges: 0 to 1
+presence_penalty: 0 # Ranges: 0 to 1
+max_tokens: 8192 # Should be default to context length
+seed: -1
+dynatemp_range: 0
+dynatemp_exponent: 1
+top_k: 40
+min_p: 0.05
+tfs_z: 1
+typ_p: 1
+repeat_last_n: 64
+repeat_penalty: 1
+mirostat: false
+mirostat_tau: 5
+mirostat_eta: 0.1
+penalize_nl: false
+ignore_eos: false
+n_probs: 0
+n_parallels: 1
+min_keep: 0
+
 ```
 Inference parameters define how the results will be produced. The required parameters include:
-| **Parameter**          | **Description**                                                                      | **Required** |
-|------------------------|--------------------------------------------------------------------------------------|--------------|
-| `top_p`                | The cumulative probability threshold for token sampling.                             | No  |
-| `temperature`          | Controls the randomness of predictions by scaling logits before applying softmax.    | No   |
-| `frequency_penalty`    | Penalizes new tokens based on their existing frequency in the sequence so far.       | No   |
-| `presence_penalty`     | Penalizes new tokens based on whether they appear in the sequence so far.            | No   |
-| `max_tokens`           | Maximum number of tokens in the output.              | No   |
-| `stream`               | Enables or disables streaming mode for the output (true or false).                   | No   |
-| `stop`               | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes   |
+
+| **Parameter** | **Description** | **Required** |
+|---------------|-----------------|--------------|
+| `stream` | Enables or disables streaming mode for the output (true or false). | No |
+| `top_p` | The cumulative probability threshold for token sampling. Ranges from 0 to 1. | No |
+| `temperature` | Controls the randomness of predictions by scaling logits before applying softmax. Ranges from 0 to 1. | No |
+| `frequency_penalty` | Penalizes new tokens based on their existing frequency in the sequence so far. Ranges from 0 to 1. | No |
+| `presence_penalty` | Penalizes new tokens based on whether they appear in the sequence so far. Ranges from 0 to 1. | No |
+| `max_tokens` | Maximum number of tokens in the output for 1 turn. | No |
+| `seed` | Seed for the random number generator. `-1` means no seed. | No |
+| `dynatemp_range` | Dynamic temperature range. | No |
+| `dynatemp_exponent` | Dynamic temperature exponent. | No |
+| `top_k` | The number of most likely tokens to consider at each step. | No |
+| `min_p` | Minimum probability threshold for token sampling. | No |
+| `tfs_z` | The z-score used for Typical token sampling. | No |
+| `typ_p` | The cumulative probability threshold used for Typical token sampling. | No |
+| `repeat_last_n` | Number of previous tokens to penalize for repeating. | No |
+| `repeat_penalty` | Penalty for repeating tokens. | No |
+| `mirostat` | Enables or disables Mirostat sampling (true or false). | No |
+| `mirostat_tau` | Target entropy value for Mirostat sampling. | No |
+| `mirostat_eta` | Learning rate for Mirostat sampling. | No |
+| `penalize_nl` | Penalizes newline tokens (true or false). | No |
+| `ignore_eos` | Ignores the end-of-sequence token (true or false). | No |
+| `n_probs` | Number of probabilities to return. | No |
+| `min_keep` | Minimum number of tokens to keep. | No |
+| `n_parallels` | Number of parallel streams to use. This params allow you to use multiple chat terminal at the same time. Notice that you need to update `ctx_len` coressponding to `n_parallels` (e.g n_parallels=1, ctx_len=2048 -> n_parallels=2, ctx_len=4096. ) | No |
+| `stop` | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |
 
 
 ### Model Load Parameters
@@ -114,14 +168,54 @@ prompt_template: |+
 
 ctx_len: 0          
 ngl: 33 
+engine: llama-cpp
+
 ```
 Model load parameters include the options that control how Cortex.cpp runs the model. The required parameters include:
 | **Parameter**          | **Description**                                                                      | **Required** |
 |------------------------|--------------------------------------------------------------------------------------|--------------|
-| `ngl`                  | Number of attention heads.                                                           | No          |
+| `ngl`                  | Number of model layers will be offload to GPU.                                                           | No          |
 | `ctx_len`              | Context length (maximum number of tokens).                                           | No          |
 | `prompt_template`      | Template for formatting the prompt, including system messages and instructions.      | Yes          |
+| `engine`      | The engine that run model, default to `llama-cpp` for local model with gguf format.      | Yes          |
+
+All parameters from the `model.yml` file are used for running the model via the [CLI chat command](/docs/cli/chat) or [CLI run command](/docs/cli/run). These parameters also act as defaults when using the [model start API](/api-reference#tag/models/post/v1/models/start) through cortex.cpp.
+
+## Runtime parameters
+
+In addition to predefined parameters in `model.yml`, Cortex.cpp supports runtime parameters to override these settings when using the [model start API](/api-reference#tag/models/post/v1/models/start).
+
+### Model start params
+
+Cortex.cpp supports the following parameters when starting a model via the [model start API](/api-reference#tag/models/post/v1/models/start) for the `llama-cpp engine`:
+
+```
+cache_enabled: bool
+ngl: int
+n_parallel: int
+cache_type: string
+ctx_len: int
+
+## Support for vision model
+mmproj: string 
+llama_model_path: string
+model_path: string
+```
+
+| **Parameter**          | **Description**                                                                      | **Required** |
+|------------------------|--------------------------------------------------------------------------------------|--------------|
+| `cache_type`           | Data type of the KV cache in llama.cpp models. Supported types are `f16`, `q8_0`, and `q4_0`, default is `f16`.  | No          |
+| `cache_enabled`           |Enables caching of conversation history for reuse in subsequent requests. Default is `false`  | No          |
+| `mmproj`           |  path to mmproj GGUF model, support for llava model   | No          |
+| `llama_model_path` | path to llm GGUF model  | No          |
+
+These parameters will override the `model.yml` parameters when starting model through the API.
+
+### Chat completion API parameters
+
+The API is accessible at the `/v1/chat/completions` URL and accepts all parameters from the chat completion API as described [API reference](/api-reference#tag/chat/post/v1/chat/completions)
 
+With the `llama-cpp` engine, cortex.cpp accept all parameters from [`model.yml` inference section](#Inference Parameters) and accept all parameters from the chat completion API.
 
 :::info
 You can download all the supported model formats from the following: