Skip to content

Commit

Permalink
Merge pull request #1615 from janhq/chore/model-run-docs-update
Browse files Browse the repository at this point in the history
chore: add document for function calling
  • Loading branch information
gabrielle-ong authored Nov 5, 2024
2 parents 206650f + 611901a commit c44fb59
Show file tree
Hide file tree
Showing 5 changed files with 887 additions and 27 deletions.
14 changes: 10 additions & 4 deletions docs/docs/capabilities/models/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,23 @@ description: The Model section overview
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
:::

Models in cortex.cpp are used for inference purposes (e.g., chat completion, embedding, etc.). We support two types of models: local and remote.

Local models use a local inference engine to run completely offline on your hardware. Currently, we support llama.cpp with the GGUF model format, and we have plans to support TensorRT-LLM and ONNX engines in the future.

Remote models (like OpenAI GPT-4 and Claude 3.5 Sonnet) use remote engines. Support for OpenAI and Anthropic engines is under development and will be available in cortex.cpp soon.

When Cortex.cpp is started, it automatically starts an API server, this is inspired by Docker CLI. This server manages various model endpoints. These endpoints facilitate the following:
- **Model Operations**: Run and stop models.
- **Model Management**: Manage your local models.
:::info
The model in the API server is automatically loaded/unloaded by using the [`/chat/completions`](/api-reference#tag/inference/post/v1/chat/completions) endpoint.
:::
## Model Formats
Cortex.cpp supports three model formats:
- GGUF
- ONNX
- TensorRT-LLM
Cortex.cpp supports three model formats and each model format require specific engine to run:
- GGUF - run with `llama-cpp` engine
- ONNX - run with `onnxruntime` engine
- TensorRT-LLM - run with `tensorrt-llm` engine

:::info
For details on each format, see the [Model Formats](/docs/capabilities/models/model-yaml#model-formats) page.
Expand Down
128 changes: 111 additions & 17 deletions docs/docs/capabilities/models/model-yaml.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ import TabItem from "@theme/TabItem";
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
:::

Cortex.cpp uses a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
Cortex.cpp utilizes a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.

## Structure of `model.yaml`

Expand Down Expand Up @@ -39,6 +39,23 @@ temperature: 0.6 # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0 # Ranges: 0 to 1
max_tokens: 8192 # Should be default to context length
seed: -1
dynatemp_range: 0
dynatemp_exponent: 1
top_k: 40
min_p: 0.05
tfs_z: 1
typ_p: 1
repeat_last_n: 64
repeat_penalty: 1
mirostat: false
mirostat_tau: 5
mirostat_eta: 0.1
penalize_nl: false
ignore_eos: false
n_probs: 0
n_parallels: 1
min_keep: 0
## END OPTIONAL
# END INFERENCE PARAMETERS

Expand All @@ -54,6 +71,7 @@ prompt_template: |+ # tokenizer.chat_template
## BEGIN OPTIONAL
ctx_len: 0 # llama.context_length | 0 or undefined = loaded from model
ngl: 33 # Undefined = loaded from model
engine: llama-cpp
## END OPTIONAL
# END MODEL LOAD PARAMETERS

Expand Down Expand Up @@ -84,23 +102,59 @@ stop:
  - <|end_of_text|>
  - <|eot_id|>
  - <|eom_id|>
stream: true
top_p: 0.9
temperature: 0.6
frequency_penalty: 0
presence_penalty: 0
max_tokens: 8192
stream: true # Default true?
top_p: 0.9 # Ranges: 0 to 1
temperature: 0.6 # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0 # Ranges: 0 to 1
max_tokens: 8192 # Should be default to context length
seed: -1
dynatemp_range: 0
dynatemp_exponent: 1
top_k: 40
min_p: 0.05
tfs_z: 1
typ_p: 1
repeat_last_n: 64
repeat_penalty: 1
mirostat: false
mirostat_tau: 5
mirostat_eta: 0.1
penalize_nl: false
ignore_eos: false
n_probs: 0
n_parallels: 1
min_keep: 0
```
Inference parameters define how the results will be produced. The required parameters include:
| **Parameter** | **Description** | **Required** |
|------------------------|--------------------------------------------------------------------------------------|--------------|
| `top_p` | The cumulative probability threshold for token sampling. | No |
| `temperature` | Controls the randomness of predictions by scaling logits before applying softmax. | No |
| `frequency_penalty` | Penalizes new tokens based on their existing frequency in the sequence so far. | No |
| `presence_penalty` | Penalizes new tokens based on whether they appear in the sequence so far. | No |
| `max_tokens` | Maximum number of tokens in the output. | No |
| `stream` | Enables or disables streaming mode for the output (true or false). | No |
| `stop` | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |

| **Parameter** | **Description** | **Required** |
|---------------|-----------------|--------------|
| `stream` | Enables or disables streaming mode for the output (true or false). | No |
| `top_p` | The cumulative probability threshold for token sampling. Ranges from 0 to 1. | No |
| `temperature` | Controls the randomness of predictions by scaling logits before applying softmax. Ranges from 0 to 1. | No |
| `frequency_penalty` | Penalizes new tokens based on their existing frequency in the sequence so far. Ranges from 0 to 1. | No |
| `presence_penalty` | Penalizes new tokens based on whether they appear in the sequence so far. Ranges from 0 to 1. | No |
| `max_tokens` | Maximum number of tokens in the output for 1 turn. | No |
| `seed` | Seed for the random number generator. `-1` means no seed. | No |
| `dynatemp_range` | Dynamic temperature range. | No |
| `dynatemp_exponent` | Dynamic temperature exponent. | No |
| `top_k` | The number of most likely tokens to consider at each step. | No |
| `min_p` | Minimum probability threshold for token sampling. | No |
| `tfs_z` | The z-score used for Typical token sampling. | No |
| `typ_p` | The cumulative probability threshold used for Typical token sampling. | No |
| `repeat_last_n` | Number of previous tokens to penalize for repeating. | No |
| `repeat_penalty` | Penalty for repeating tokens. | No |
| `mirostat` | Enables or disables Mirostat sampling (true or false). | No |
| `mirostat_tau` | Target entropy value for Mirostat sampling. | No |
| `mirostat_eta` | Learning rate for Mirostat sampling. | No |
| `penalize_nl` | Penalizes newline tokens (true or false). | No |
| `ignore_eos` | Ignores the end-of-sequence token (true or false). | No |
| `n_probs` | Number of probabilities to return. | No |
| `min_keep` | Minimum number of tokens to keep. | No |
| `n_parallels` | Number of parallel streams to use. This params allow you to use multiple chat terminal at the same time. Notice that you need to update `ctx_len` coressponding to `n_parallels` (e.g n_parallels=1, ctx_len=2048 -> n_parallels=2, ctx_len=4096. ) | No |
| `stop` | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |


### Model Load Parameters
Expand All @@ -114,14 +168,54 @@ prompt_template: |+
ctx_len: 0
ngl: 33
engine: llama-cpp
```
Model load parameters include the options that control how Cortex.cpp runs the model. The required parameters include:
| **Parameter** | **Description** | **Required** |
|------------------------|--------------------------------------------------------------------------------------|--------------|
| `ngl` | Number of attention heads. | No |
| `ngl` | Number of model layers will be offload to GPU. | No |
| `ctx_len` | Context length (maximum number of tokens). | No |
| `prompt_template` | Template for formatting the prompt, including system messages and instructions. | Yes |
| `engine` | The engine that run model, default to `llama-cpp` for local model with gguf format. | Yes |

All parameters from the `model.yml` file are used for running the model via the [CLI chat command](/docs/cli/chat) or [CLI run command](/docs/cli/run). These parameters also act as defaults when using the [model start API](/api-reference#tag/models/post/v1/models/start) through cortex.cpp.

## Runtime parameters

In addition to predefined parameters in `model.yml`, Cortex.cpp supports runtime parameters to override these settings when using the [model start API](/api-reference#tag/models/post/v1/models/start).

### Model start params

Cortex.cpp supports the following parameters when starting a model via the [model start API](/api-reference#tag/models/post/v1/models/start) for the `llama-cpp engine`:

```
cache_enabled: bool
ngl: int
n_parallel: int
cache_type: string
ctx_len: int

## Support for vision model
mmproj: string
llama_model_path: string
model_path: string
```
| **Parameter** | **Description** | **Required** |
|------------------------|--------------------------------------------------------------------------------------|--------------|
| `cache_type` | Data type of the KV cache in llama.cpp models. Supported types are `f16`, `q8_0`, and `q4_0`, default is `f16`. | No |
| `cache_enabled` |Enables caching of conversation history for reuse in subsequent requests. Default is `false` | No |
| `mmproj` | path to mmproj GGUF model, support for llava model | No |
| `llama_model_path` | path to llm GGUF model | No |
These parameters will override the `model.yml` parameters when starting model through the API.
### Chat completion API parameters
The API is accessible at the `/v1/chat/completions` URL and accepts all parameters from the chat completion API as described [API reference](/api-reference#tag/chat/post/v1/chat/completions)
With the `llama-cpp` engine, cortex.cpp accept all parameters from [`model.yml` inference section](#Inference Parameters) and accept all parameters from the chat completion API.
:::info
You can download all the supported model formats from the following:
Expand Down
Loading

0 comments on commit c44fb59

Please sign in to comment.