Support streaming. Update docs. (#133)

* update readme, now both inference tools support streaming. Bump submodules too. * also update reame_CN
rubra-ai · Jul 15, 2024 · 447803c · 447803c
1 parent 330a0ea
commit 447803c
Show file tree

Hide file tree

Showing 5 changed files with 25 additions and 15 deletions.
diff --git a/README_CN.md b/README_CN.md
@@ -28,12 +28,30 @@ Rubra 增强了当前最流行的一系列开放权重大模型（LLM）的工
 
 在我们的 [Huggingface Spaces](https://huggingface.co/spaces/sanjay920/rubra-v0.1-dev) 上可以免费试用以上的大模型，不需要登录！
 
-## 本地部署运行 Rubra 模型
+## 在本地部署运行 Rubra 模型
 
-我们扩展了以下部署工具，以在 OpenAI 风格的API格式下本地运行 Rubra 模型：
+查看我们的[文档](https://docs.rubra.ai/category/serving--inferencing)以了解如何在本地运行 Rubra 模型。
+我们扩展了以下部署工具，支持OpenAI的工具调用格式，在本地运行Rubra模型：
+
+- [llama.cpp](https://github.com/rubra-ai/tools.cpp)
+- [vLLM](https://github.com/rubra-ai/vllm)
+
+**注意**: Llama3 模型，包括8B和70B的gguf版本，在量化（quantization）后会出现perplexity增加和函数调用性能下降的问题。我们建议使用 vLLM 或 fp16或更高（bf16， fp32） 量化来部署运行它们。
+
+## 基准测试
+
+查看 Rubra 模型及其他模型的完整基准测试结果： https://docs.rubra.ai/benchmark
+
+| 模型                                                     | 函数调用         | MMLU (5-shot) | GPQA (0-shot) | GSM-8K (8-shot, CoT) | MATH (4-shot, CoT) | MT-bench |
+|-----------------------------------------------------------|------------------|---------------|---------------|----------------------|--------------------|----------|
+| [**Rubra Llama-3 70B Instruct**](https://huggingface.co/rubra-ai/Meta-Llama-3-70B-Instruct)       | 97.85%           | 75.90         | 33.93         | 82.26                | 34.24              | 8.36     |
+| [**Rubra Llama-3 8B Instruct**](https://huggingface.co/rubra-ai/Meta-Llama-3-8B-Instruct)        | 89.28%           | 64.39         | 31.70         | 68.99                | 23.76              | 8.03     |
+| [**Rubra Qwen2 7B Instruct**](https://huggingface.co/rubra-ai/Qwen2-7B-Instruct)                 | 85.71%           | 68.88         | 30.36         | 75.82                | 28.72              | 8.08     |
+| [**Rubra Mistral 7B Instruct v0.3**](https://huggingface.co/rubra-ai/Mistral-7B-Instruct-v0.3)   | 73.57%           | 59.12         | 29.91         | 43.29                | 11.14              | 7.69     |
+| [**Rubra Phi-3 Mini 128k Instruct**](https://huggingface.co/rubra-ai/Phi-3-mini-128k-instruct)   | 70.00%           | 67.87         | 29.69         | 79.45                | 30.80              | 8.21     |
+| [**Rubra Mistral 7B Instruct v0.2**](https://huggingface.co/rubra-ai/Mistral-7B-Instruct-v0.2)   | 69.28%           | 58.90         | 29.91         | 34.12                | 8.36               | 7.36     |
+| [**Rubra Gemma-1.1 2B Instruct**](https://huggingface.co/rubra-ai/gemma-1.1-2b-it)               | 45.00%           | 38.85         | 24.55         | 6.14                 | 2.38               | 5.75     |
 
-- [llama.cpp](https://github.com/ggerganov/llama.cpp)
-- [vllm](https://github.com/vllm-project/vllm)
 
 ## 贡献
 

diff --git a/docs/docs/inference/llamacpp.mdx b/docs/docs/inference/llamacpp.mdx
@@ -137,11 +137,7 @@ The output should look like this:
 ChatCompletion(id='chatcmpl-EmHd8kai4DVwBUOyim054GmfcyUbjiLf', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='e885974b', function=Function(arguments='{"location":"Boston"}', name='get_current_weather'), type='function')]))], created=1719528056, model='rubra-model', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=29, prompt_tokens=241, total_tokens=270))
 ```
 
-That's it! For more function calling examples, you can check out the [test_llamacpp.ipynb](https://github.com/rubra-ai/tools.cpp/blob/010f4d282e86babe216af6e037ab10bf078415e7/test_llamacpp.ipynb) notebook.
-
-:::info
-Make sure you turn `stream` off when making API calls to the server, as the streaming feature is not supported yet. We will support streaming soon.
-:::
+That's it! For more function calling examples, you can check out the [test_llamacpp.ipynb](https://github.com/rubra-ai/tools.cpp/blob/010f4d282e86babe216af6e037ab10bf078415e7/test_llamacpp.ipynb) or [test_llamacpp_streaming.ipynb](https://github.com/rubra-ai/tools.cpp/blob/master/test_llamacpp_streaming.ipynb) notebook.
 
 ## Choosing a Chat Template for Different Models
 

diff --git a/docs/docs/inference/vllm.md b/docs/docs/inference/vllm.md
@@ -105,7 +105,3 @@ The output should look like this:
 ```
 ChatCompletion(id='chatcmpl-EmHd8kai4DVwBUOyim054GmfcyUbjiLf', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='e885974b', function=Function(arguments='{"location":"Boston"}', name='get_current_weather'), type='function')]))], created=1719528056, model='rubra-model', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=29, prompt_tokens=241, total_tokens=270))
 ```
-
-:::info
-Make sure you turn `stream` off when making API calls to the server, as the streaming feature is not supported yet. We will support streaming soon.
-:::
diff --git a/tools.cpp b/tools.cpp
diff --git a/vllm b/vllm