Skip to content

Commit

Permalink
Update docs for website lanch
Browse files Browse the repository at this point in the history
  • Loading branch information
hahuyhoang411 committed Nov 17, 2023
1 parent e5548d5 commit a132c44
Show file tree
Hide file tree
Showing 11 changed files with 383 additions and 203 deletions.
154 changes: 88 additions & 66 deletions docs/docs/features/chat.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,57 @@
title: Chat Completion
---

The Chat Completion feature in Nitro provides a powerful and flexible way to interact with any Large Language Model (LLM) of your choice, directly within your local environment. This feature is fully compatible with the OpenAI API.
The Chat Completion feature in Nitro provides a flexible way to interact with any local Large Language Model (LLM).

### Single Request
## Single Request Example

Here's an example of how to ask a single question:
To send a single query to your chosen LLM, follow these steps:

```bash title="Single Turn"
<div style={{ width: '50%', float: 'left', clear: 'left' }}>

```bash title="Nitro"
curl http://localhost:3928/inferences/llamacpp/chat_completion \
-H "Content-Type: application/json" \
-d '{
"model": "",
"messages": [
{
"role": "user",
"content": "Hello"
},
]
}'

```
</div>

<div style={{ width: '50%', float: 'right', clear: 'right' }}>

```bash title="OpenAI"
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "Hello"
}
]
}'
```
</div>

This command sends a request to your local LLM, querying about the winner of the 2020 World Series.

### Dialog Request
### Dialog Request Example

For ongoing conversations or multiple queries, the dialog request feature is ideal. Here’s how to structure a multi-turn conversation:

For more extended interactions, such as conversations or multiple queries, you can use the dialog request feature. Here's how you can structure a multi-turn conversation:
<div style={{ width: '50%', float: 'left', clear: 'left' }}>

```bash title="Multi-turn"
```bash title="Nitro"
curl http://localhost:3928/inferences/llamacpp/chat_completion \
-H "Content-Type: application/json" \
-d '{
Expand All @@ -50,56 +75,13 @@ curl http://localhost:3928/inferences/llamacpp/chat_completion \
}
]
}'
```

Nitro output
```js
{
"choices": [
{
"finish_reason": null,
"index": 0,
"message": {
"content": "Hello, how may I assist you this evening?",
"role": "assistant"
}
}
],
"created": 1700215278,
"id": "sofpJrnBGUnchO8QhA0s",
"model": "_",
"object": "chat.completion",
"system_fingerprint": "_",
"usage": {
"completion_tokens": 13,
"prompt_tokens": 90,
"total_tokens": 103
}
}
```

### Using OpenAI

Here's an example of how to use OpenAI model:

```bash title=""OpenAI
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "Hello"
}
]
}'
```
</div>

This example demonstrates a more complex interaction, where the user and the assistant build upon each other's responses to carry out a conversation.
<div style={{ width: '50%', float: 'right', clear: 'right' }}>

```bash title="OpenAI dialog"
```bash title="OpenAI"
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
Expand All @@ -124,27 +106,67 @@ curl https://api.openai.com/v1/chat/completions \
]
}'
```
</div>

Output from OAI
```js
### Chat Completion Response

Below are examples of responses from both the Nitro server and OpenAI:

<div style={{ width: '50%', float: 'left', clear: 'left' }}>

```js title="Nitro"
{
"id": "chatcmpl-123",
"choices": [
{
"finish_reason": null,
"index": 0,
"message": {
"content": "Hello, how may I assist you this evening?",
"role": "assistant"
}
}
],
"created": 1700215278,
"id": "sofpJrnBGUnchO8QhA0s",
"model": "_",
"object": "chat.completion",
"system_fingerprint": "_",
"usage": {
"completion_tokens": 13,
"prompt_tokens": 90,
"total_tokens": 103
}
}
```
</div>

<div style={{ width: '50%', float: 'right', clear: 'right' }}>

```js title="OpenAI"
{
"choices": [
{
"finish_reason": "stop"
"index": 0,
"message": {
"role": "assistant",
"content": "Hello there, how may I assist you today?",
}
}
],
"created": 1677652288,
"id": "chatcmpl-123",
"model": "gpt-3.5-turbo-0613",
"object": "chat.completion",
"system_fingerprint": "fp_44709d6fcb",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello there, how may I assist you today?",
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"prompt_tokens": 9,
"total_tokens": 21
}
}
```
```
</div>


The chat completion feature in Nitro showcases compatibility with OpenAI, making the transition between using OpenAI and local AI models more straightforward. For further details and advanced usage, please refer to the [API reference](https://nitro.jan.ai/api).
26 changes: 20 additions & 6 deletions docs/docs/features/cont-batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,22 @@
title: Continuous Batching
---

## Continous batching
Nitro provides `continous batching` feature, which combines multiple requests for the same model execution to provide larger throughput.
## What is continous batching?

Continuous batching is a powerful technique that significantly boosts throughput in large language model (LLM) inference while minimizing latency. This process dynamically groups multiple inference requests, allowing for more efficient GPU utilization.

## Why Continuous Batching?

Traditional static batching methods can lead to underutilization of GPU resources, as they wait for all sequences in a batch to complete before moving on. Continuous batching overcomes this by allowing new sequences to start processing as soon as others finish, ensuring more consistent and efficient GPU usage.

## Benefits of Continuous Batching

- **Increased Throughput:** Improvement over traditional batching methods.
- **Reduced Latency:** Lower p50 latency, leading to faster response times.
- **Efficient Resource Utilization:** Maximizes GPU memory and computational capabilities.

## How to use continous batching
Nitro's `continuous batching` feature allows you to combine multiple requests for the same model execution, enhancing throughput and efficiency.

```bash title="Enable Batching" {6,7}
curl http://localhost:3928/inferences/llamacpp/loadmodel \
Expand All @@ -16,8 +30,8 @@ curl http://localhost:3928/inferences/llamacpp/loadmodel \
}'
```

You can adjust the `n_parallel` suitable for your usecases.
> In the example, `n_parallel = 4` means you can serve up to 4 users at the time.
For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.

### Benchmark and Compare

## Big feature
https://www.anyscale.com/blog/continuous-batching-llm-inference
To understand the impact of continuous batching on your system, perform benchmarks comparing it with traditional batching methods. This [article](https://www.anyscale.com/blog/continuous-batching-llm-inference) will help you quantify improvements in throughput and latency.
81 changes: 50 additions & 31 deletions docs/docs/features/embed.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,69 +2,88 @@
title: Embedding
---

## What are embeddings?

Embeddings are lists of numbers (floats). To find how similar two embeddings are, we measure the [distance](https://en.wikipedia.org/wiki/Cosine_similarity) between them. Shorter distances mean they're more similar; longer distances mean less similarity.

## Activating Embedding Feature

To activate the embedding feature in Nitro, a JSON parameter `"embedding": true` needs to be included in the [inference request](features/load-unload.md). This setting allows Nitro to process inferences with embedding enabled, enhancing the model's capabilities.
To utilize the embedding feature, include the JSON parameter `"embedding": true` in your [load model request](features/load-unload.md). This action enables Nitro to process inferences with embedding capabilities.

### Example Request
### Embedding Request

Here’s an example showing how to get the embedding result from the model:

```bash title="Embedding" {1}
<div style={{ width: '50%', float: 'left', clear: 'left' }}>

```bash title="Nitro" {1}
curl http://localhost:3928/inferences/llamacpp/embedding \
-H 'Content-Type: application/json' \
-d '{
"input": "hello",
"input": "Hello",
"model":"Llama-2-7B-Chat-GGUF",
"encoding_format": "float"
}'

```
</div>
<div style={{ width: '50%', float: 'right', clear: 'right' }}>

```bash title="OpenAI request" {1}
curl https://api.openai.com/v1/embeddings \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": "Hello",
"model": "text-embedding-ada-002",
"encoding_format": "float"
}'
```
```js title"Nitro embeding output"
</div>

## Embedding Reponse

The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) loaded to Nitro server.

<div style={{ width: '50%', float: 'left', clear: 'left' }}>

```js title="Nitro"
{
"data": [
{
"embedding": [
-0.98747426271438599,
0.29654932022094727,
0.19979725778102875,
-0.9874749,
0.2965493,
...
-0.25322747230529785
-0.253227
],
"index": 0,
"object": "embedding"
}
],
"model": "_",
"object": "list",
"usage": {
"prompt_tokens": 0,
"total_tokens": 0
}
]
}
```
</div>

The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main).
<div style={{ width: '50%', float: 'right', clear: 'right' }}>

```js title="Example Output from OAI"
```js title="OpenAI"
{
"object": "embedding",
"embedding": [
0.0023064255,
-0.009327292,
.... (1536 floats total for ada-002)
-0.0028842222,
],
"index": 0
"index": 0,
"object": "embedding"
}




```
</div>

OpenAI
```bash title="OpenAI request" {1}
curl https://api.openai.com/v1/embeddings \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": "The food was delicious and the waiter...",
"model": "text-embedding-ada-002",
"encoding_format": "float"
}'
```

The embedding feature in Nitro demonstrates a high level of compatibility with OpenAI, simplifying the transition between using OpenAI and local AI models. For more detailed information and advanced use cases, refer to the comprehensive [API Reference]((https://nitro.jan.ai/api)).
Loading

0 comments on commit a132c44

Please sign in to comment.