Skip to content

Commit dfd8f8b

Browse files
committed
Update docs
1 parent 30f286e commit dfd8f8b

File tree

5 files changed

+121
-15
lines changed

5 files changed

+121
-15
lines changed

README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ In this repo, you'll find code for:
1010

1111
# Contents
1212

13+
- [Try it out](#try-it-out)
1314
- [Getting Started](#getting-started)
1415
* [Requirements](#requirements)
1516
* [Chatting with Pythia-Chat-Base-7B](#chatting-with-pythia-chat-base-7b)
@@ -28,9 +29,16 @@ In this repo, you'll find code for:
2829
- [Citing OpenChatKit](#citing-openchatkit)
2930
- [Acknowledgements](#acknowledgements)
3031

32+
# Try it out
33+
- [OpenChatKit Feedback App](https://huggingface.co/spaces/togethercomputer/OpenChatKit)
34+
Feedback helps improve the bot and open-source AI research.
35+
36+
- [Run it on Google Colab](inference/README.md#running-on-google-colab) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/OpenChatKit/blob/main/inference/example/example.ipynb)
37+
- Continue reading to run it on your own!
38+
3139
# Getting Started
3240

33-
In this tutorial, you will download Pythia-Chat-Base-7B, an instruction-tuned language model, and run some some inference requests against it using a command-line tool.
41+
In this tutorial, you will download Pythia-Chat-Base-7B, an instruction-tuned language model, and run some inference requests against it using a command-line tool.
3442

3543
Pythia-Chat-Base-7B is a 7B-parameter fine-tuned variant of Pythia-6.9B-deduped from Eleuther AI. Pre-trained weights for this model are available on Huggingface as [togethercomputer/Pythia-Chat-Base-7B](https://huggingface.co/togethercomputer/Pythia-Chat-Base-7B) under an Apache 2.0 license.
3644

@@ -73,7 +81,7 @@ conda activate OpenChatKit
7381

7482
## Chatting with Pythia-Chat-Base-7B
7583

76-
To help you try the model, [`inference/bot.py`](inference/bot.py) is a simple command-line test harness that provides a shell inferface enabling you to chat with the model. Simply enter text at the prompt and the model replies. The test harness also maintains conversation history to provide the model with context.
84+
To help you try the model, [`inference/bot.py`](inference/bot.py) is a simple command-line test harness that provides a shell interface enabling you to chat with the model. Simply enter text at the prompt and the model replies. The test harness also maintains conversation history to provide the model with context.
7785

7886

7987
Start the bot by calling `bot.py` from the root for the repo.

docs/GPT-NeoXT-Chat-Base-20B.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ Hello human.
184184

185185
Commands are prefixed with a `/`, and the `/quit` command exits.
186186

187-
Please see [the inference README](inference/README.md) for more details about arguments, running on multiple/specific GPUs, and running on consumer hardware.
187+
Please see [the inference README](GPT-NeoXT-Chat-Base-Inference.md) for more details about arguments, running on multiple/specific GPUs, and running on consumer hardware.
188188

189189
# Monitoring
190190

docs/GPT-NeoXT-Chat-Base-Inference.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# GPT-NeoXT-Chat-Base-20B Inference
2+
3+
## Contents
4+
5+
- [Arguments](#arguments)
6+
- [Hardware requirements](#hardware-requirements-for-inference)
7+
- [Running on multiple GPUs](#running-on-multiple-gpus)
8+
- [Running on specific GPUs](#running-on-specific-gpus)
9+
- [Running on consumer hardware](#running-on-consumer-hardware)
10+
11+
## Arguments
12+
- `--gpu-id`: primary GPU device to load inputs onto for inference. Default: `0`
13+
- `--model`: name/path of the model. Default = `../huggingface_models/GPT-NeoXT-Chat-Base-20B`
14+
- `--max-tokens`: the maximum number of tokens to generate. Default: `128`
15+
- `--sample`: indicates whether to sample. Default: `True`
16+
- `--temperature`: temperature for the LM. Default: `0.6`
17+
- `--top-k`: top-k for the LM. Default: `40`
18+
- `--retrieval`: augment queries with context from the retrieval index. Default `False`
19+
- `-g` `--gpu-vram`: GPU ID and VRAM to allocate to loading the model, separated by a `:` in the format `ID:RAM` where ID is the CUDA ID and RAM is in GiB. `gpu-id` must be present in this list to avoid errors. Accepts multiple values, for example, `-g ID_0:RAM_0 ID_1:RAM_1 ID_N:RAM_N`
20+
- `-r` `--cpu-ram`: CPU RAM overflow allocation for loading the model. Optional, and only used if the model does not fit onto the GPUs given.
21+
- `--load-in-8bit`: load model in 8-bit. Requires `pip install bitsandbytes`. No effect when used with `-g`.
22+
23+
## Hardware requirements for inference
24+
The GPT-NeoXT-Chat-Base-20B model requires at least 41GB of free VRAM. Used VRAM also goes up by ~100-200 MB per prompt.
25+
26+
- A **minimum of 80 GB is recommended**
27+
28+
- A **minimum of 48 GB in VRAM is recommended** for fast responses.
29+
30+
If you'd like to run inference on a GPU with <48 GB VRAM, refer to this section on [running on consumer hardware](#running-on-consumer-hardware).
31+
32+
By default, inference uses only CUDA Device 0.
33+
34+
**NOTE: Inference currently requires at least 1x GPU.**
35+
36+
## Running on multiple GPUs
37+
Add the argument
38+
39+
```-g ID0:MAX_VRAM ID1:MAX_VRAM ID2:MAX_VRAM ...```
40+
41+
where IDx is the CUDA ID of the device and MAX_VRAM is the amount of VRAM you'd like to allocate to the device.
42+
43+
For example, if you are running this on 4x 48 GB GPUs and want to distribute the model across all devices, add ```-g 0:10 1:12 2:12 3:12 4:12```. In this example, the first device gets loaded to a max of 10 GiB while the others are loaded with a max of 12 GiB.
44+
45+
How it works: The model fills up the max available VRAM on the first device passed and then overflows into the next until the whole model is loaded.
46+
47+
**IMPORTANT: This MAX_VRAM is only for loading the model. It does not account for the additional inputs that are added to the device. It is recommended to set the MAX_VRAM to be at least 1 or 2 GiB less than the max available VRAM on each device, and at least 3GiB less than the max available VRAM on the primary device (set by `gpu-id` default=0).**
48+
49+
**Decrease MAX_VRAM if you run into CUDA OOM. This happens because each input takes up additional space on the device.**
50+
51+
**NOTE: Total MAX_VRAM across all devices must be > size of the model in GB. If not, `bot.py` automatically offloads the rest of the model to RAM and disk. It will use up all available RAM. To allocate a specified amount of RAM: [refer to this section on running on consumer hardware](#running-on-consumer-hardware).**
52+
53+
## Running on specific GPUs
54+
If you have multiple GPUs but would only like to use a specific device(s), [use the same steps as in this section on running on multiple devices](#running-on-multiple-gpus) and only specify the devices you'd like to use.
55+
56+
Also, if needed, add the argument `--gpu-id ID` where ID is the CUDA ID of the device you'd like to make the primary device. NOTE: The device specified in `--gpu-id` must be present as one of the ID in the argument `-g` to avoid errors.
57+
58+
- **Example #1**: to run inference on devices 2 and 5 with a max of 25 GiB on each, and make device 5 the primary device, add: `--gpu-id 5 -g 2:25 5:25`. In this example, not adding `--gpu-id 5` will give you an error.
59+
- **Example #2**: to run inference on devices 0 and 3 with a max of 10GiB on 0 and 40GiB on 3, with device 0 as the primary device, add: `-g 0:10 3:40`. In this example, `--gpu-id` is not required because device 0 is specified in `-g`.
60+
- **Example #3**: to run inference only on device 1 with a max of 75 GiB, add: `--gpu-id 1 -g 1:75`
61+
62+
63+
## Running on consumer hardware
64+
If you have multiple GPUs, each <48 GB VRAM, [the steps mentioned in this section on running on multiple GPUs](#running-on-multiple-gpus) still apply, unless, any of these apply:
65+
- Running on just 1x GPU with <48 GB VRAM,
66+
- <48 GB VRAM combined across multiple GPUs
67+
- Running into Out-Of-Memory (OOM) issues
68+
69+
In which case, add the flag `-r CPU_RAM` where CPU_RAM is the maximum amount of RAM you'd like to allocate to loading model. Note: This significantly reduces inference speeds.
70+
71+
The model will load without specifying `-r`, however, it is not recommended because it will allocate all available RAM to the model. To limit how much RAM the model can use, add `-r`.
72+
73+
If the total VRAM + CPU_RAM < the size of the model in GiB, the rest of the model will be offloaded to a folder "offload" at the root of the directory. Note: This significantly reduces inference speeds.
74+
75+
- Example: `-g 0:12 -r 20` will first load up to 12 GiB of the model into the CUDA device 0, then load up to 20 GiB into RAM, and load the rest into the "offload" directory.
76+
77+
How it works:
78+
- https://github.com/huggingface/blog/blob/main/accelerate-large-models.md
79+
- https://www.youtube.com/embed/MWCSGj9jEAo

inference/README.md

Lines changed: 30 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,37 @@
11
# OpenChatKit Inference
22
This directory contains code for OpenChatKit's inference.
33

4+
## Contents
5+
6+
- [Arguments](#arguments)
7+
- [Hardware requirements](#hardware-requirements-for-inference)
8+
- [Running on multiple GPUs](#running-on-multiple-gpus)
9+
- [Running on specific GPUs](#running-on-specific-gpus)
10+
- [Running on consumer hardware](#running-on-consumer-hardware)
11+
- [Running on Google Colab](#running-on-google-colab)
12+
413
## Arguments
5-
- `--gpu-id`: Primary GPU device to load inputs onto for inference. Default: `0`
6-
- `--model`: name/path of the model. Default = `../huggingface_models/GPT-NeoXT-Chat-Base-20B`
14+
- `--gpu-id`: primary GPU device to load inputs onto for inference. Default: `0`
15+
- `--model`: name/path of the model. Default = `../huggingface_models/Pythia-Chat-Base-7B`
716
- `--max-tokens`: the maximum number of tokens to generate. Default: `128`
817
- `--sample`: indicates whether to sample. Default: `True`
918
- `--temperature`: temperature for the LM. Default: `0.6`
1019
- `--top-k`: top-k for the LM. Default: `40`
1120
- `--retrieval`: augment queries with context from the retrieval index. Default `False`
1221
- `-g` `--gpu-vram`: GPU ID and VRAM to allocate to loading the model, separated by a `:` in the format `ID:RAM` where ID is the CUDA ID and RAM is in GiB. `gpu-id` must be present in this list to avoid errors. Accepts multiple values, for example, `-g ID_0:RAM_0 ID_1:RAM_1 ID_N:RAM_N`
1322
- `-r` `--cpu-ram`: CPU RAM overflow allocation for loading the model. Optional, and only used if the model does not fit onto the GPUs given.
23+
- `--load-in-8bit`: load model in 8-bit. Requires `pip install bitsandbytes`. No effect when used with `-g`.
1424

1525
## Hardware requirements for inference
16-
The GPT-NeoXT-Chat-Base-20B model requires at least 41GB of free VRAM. Used VRAM also goes up by ~100-200 MB per prompt.
26+
The Pythia-Chat-Base-7B model requires:
27+
28+
- **18 GB of GPU memory for the base model**
1729

18-
- A **minimum of 80 GB is recommended**
30+
- **9 GB of GPU memory for the 8-bit quantized model**
1931

20-
- A **minimum of 48 GB in VRAM is recommended** for fast responses.
32+
Used VRAM also goes up by ~100-200 MB per prompt.
2133

22-
If you'd like to run inference on a GPU with <48 GB VRAM, refer to this section on [running on consumer hardware](#running-on-consumer-hardware).
34+
If you'd like to run inference on a GPU with less VRAM than the size of the model, refer to this section on [running on consumer hardware](#running-on-consumer-hardware).
2335

2436
By default, inference uses only CUDA Device 0.
2537

@@ -32,7 +44,7 @@ Add the argument
3244

3345
where IDx is the CUDA ID of the device and MAX_VRAM is the amount of VRAM you'd like to allocate to the device.
3446

35-
For example, if you are running this on 4x 48 GB GPUs and want to distribute the model across all devices, add ```-g 0:10 1:12 2:12 3:12 4:12```. In this example, the first device gets loaded to a max of 10 GiB while the others are loaded with a max of 12 GiB.
47+
For example, if you are running this on 4x 8 GB GPUs and want to distribute the model across all devices, add ```-g 0:4 1:4 2:6 3:6```. In this example, the first two devices get loaded to a max of 4 GiB while the other two are loaded with a max of 6 GiB.
3648

3749
How it works: The model fills up the max available VRAM on the first device passed and then overflows into the next until the whole model is loaded.
3850

@@ -53,9 +65,9 @@ Also, if needed, add the argument `--gpu-id ID` where ID is the CUDA ID of the d
5365

5466

5567
## Running on consumer hardware
56-
If you have multiple GPUs, each <48 GB VRAM, [the steps mentioned in this section on running on multiple GPUs](#running-on-multiple-gpus) still apply, unless, any of these apply:
57-
- Running on just 1x GPU with <48 GB VRAM,
58-
- <48 GB VRAM combined across multiple GPUs
68+
If you have multiple GPUs [the steps mentioned in this section on running on multiple GPUs](#running-on-multiple-gpus) still apply, unless, any of these apply:
69+
- Running on just 1x GPU with VRAM < size of the model,
70+
- Less combined VRAM across multiple GPUs than the size of the model,
5971
- Running into Out-Of-Memory (OOM) issues
6072

6173
In which case, add the flag `-r CPU_RAM` where CPU_RAM is the maximum amount of RAM you'd like to allocate to loading model. Note: This significantly reduces inference speeds.
@@ -64,8 +76,15 @@ The model will load without specifying `-r`, however, it is not recommended beca
6476

6577
If the total VRAM + CPU_RAM < the size of the model in GiB, the rest of the model will be offloaded to a folder "offload" at the root of the directory. Note: This significantly reduces inference speeds.
6678

67-
- Example: `-g 0:12 -r 20` will first load up to 12 GiB of the model into the CUDA device 0, then load up to 20 GiB into RAM, and load the rest into the "offload" directory.
79+
- Example: `-g 0:3 -r 4` will first load up to 3 GiB of the model into the CUDA device 0, then load up to 4 GiB into RAM, and load the rest into the "offload" directory.
6880

6981
How it works:
7082
- https://github.com/huggingface/blog/blob/main/accelerate-large-models.md
7183
- https://www.youtube.com/embed/MWCSGj9jEAo
84+
85+
## Running on Google Colab
86+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/OpenChatKit/blob/main/inference/example/example.ipynb)
87+
88+
In the [example notebook](example/example.ipynb), you will find code to run the Pythia-Chat-Base-7B 8-bit quantized model. This is recommended for the free version of Colab. If you'd like to disable quantization, simple remove the `--load-in-8bit` flag from the last cell.
89+
90+
Or, simple click on the "Open In Colab" badge to run the example notebook.

inference/example.ipynb renamed to inference/example/example.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
"colab_type": "text"
2626
},
2727
"source": [
28-
"<a href=\"https://colab.research.google.com/github/togethercomputer/OpenChatKit/blob/main/inference/example.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
28+
"<a href=\"https://colab.research.google.com/github/togethercomputer/OpenChatKit/blob/main/inference/example/example.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
2929
]
3030
},
3131
{

0 commit comments

Comments
 (0)