Skip to content

Latest commit

 

History

History
92 lines (64 loc) · 2.9 KB

README.md

File metadata and controls

92 lines (64 loc) · 2.9 KB

Basic Demo

中文版README

Minimum Requirements

Python: 3.10.12 or above

OS: It is recommended to run on a Linux operating system with NVIDIA GPU to avoid installation issues with the xformers library.

GPU requirements are as shown in the table below:

Model Name 19B Series Model Remarks
BF16 inference 42GB Tested with 2K dialogue text
Int4 inference 16GB Tested with 2K dialogue text
BF16 Lora Tuning (With Vision Expert Part) 73GB(8 GPUs with A100 x 80G using zero 2) Trained with 2K dialogue text

Before running any code, make sure you have all dependencies installed. You can install all dependency packages with the following command:

pip install -r requirements.txt

Using CLI Demo

Run this code to start a conversation at the command line. Please note that the model must be loaded on a GPU

CUDA_VISIBLE_DEVICES=0 python cli_demo.py

If you want to use int4 (or int8) quantization, please use

CUDA_VISIBLE_DEVICES=0 python cli_demo.py --quant 4

If you have multiple GPUs, you can use the following code to perform multiple pull-up models and distribute different layers of the model on different GPUs.

python cli_demo_multi_gpus.py

In cli_demo_multi_gpus.py, we use the infer_auto_device_map function to automatically allocate different layers of the model to different GPUs. You need to set the max_memory parameter to specify the maximum memory for each GPU. For example, if you have two GPUs, each with 23GiB of memory, you can set it like this:

device_map = infer_auto_device_map(
    model=model,
    max_memory={i: "23GiB" for i in range(torch.cuda.device_count())},
    # set 23GiB for each GPU, depends on your GPU memory, you can adjust this value
    no_split_module_classes=["CogVLMDecoderLayer"]
)

Using Web Demo

Run this code to start a conversation in the WebUI.

chainlit run web_demo.py

if you want to use int4 or in8 quantization, you can launch it like:

CUDA_VISIBLE_DEVICES=0 QUANT=4 chainlit run web_demo.py

After starting the conversation, you will be able to interact with the model, as shown below:

web_demo

Using OpenAI API format

We provide a simple example to pull up the model through the following code.

After that, you can use the OpenAI API format to request a conversation with the model (optionally --quant 4).

python openai_api_demo.py

Developers can call the model through the following code:

python openai_api_request.py