Skip to content

Latest commit

 

History

History
907 lines (707 loc) · 45.9 KB

FAQ.md

File metadata and controls

907 lines (707 loc) · 45.9 KB

Frequently asked questions

API key access

h2oGPT API key access for API and UI and persistence of state via login (auth enabled or not)

python generate.py --base_model=h2oai/h2ogpt-4096-llama2-70b-chat --auth_filename=auth.json --enforce_h2ogpt_api_key=True --enforce_h2ogpt_ui_key=True --h2ogpt_api_keys="['<API_KEY>']"

for some API key <API_KEY> and some auth file auth.json where h2oGPT will store login and persistence information. This enforces keyed access for both API and UI, and one can choose any. For public cases (Hugging Face or GPT_H2O_AI env set), enforce of API is default.

One can also use a json key file:

python generate.py --base_model=h2oai/h2ogpt-4096-llama2-70b-chat --auth_filename=auth.json --enforce_h2ogpt_api_key=True --enforce_h2ogpt_ui_key=True --h2ogpt_api_keys="h2ogpt_api_keys.json"

for some file h2ogpt_api_keys.json which is a JSON file that is a list of strings of keys allowed.

If UI keyed access is enabled, one has to enter the key in the UI in Login tab before accessing LLMs or upload of files.

If API keyed access is enabled, one has to pass the API key along with other arguments to access LLm or upload of files.

See src/gen.py file for details:

  • :param enforce_h2ogpt_api_key: Whether to enforce h2oGPT token usage for API
  • :param enforce_h2ogpt_ui_key: Whether to enforce h2oGPT token usage for UI (same keys as API assumed)
  • :param h2ogpt_api_keys: list of tokens allowed for API access or file accessed on demand for json of list of keys
  • :param h2ogpt_key: E.g. can be set when accessing gradio h2oGPT server from local gradio h2oGPT server that acts as client to that inference server

As with any option, one can set the environment variable H2OGPT_x for an upper-case main() argument to control the above.

Auth Access

As listed in the src/gen.py file, there are many ways to control authorization:

  • :param auth: gradio auth for launcher in form [(user1, pass1), (user2, pass2), ...]
    • e.g. --auth=[('jon','password')] with no spaces
    • e.g. --auth="[('jon', 'password)())(')]" so any special characters can be used
    • e.g. --auth=auth.json to specify persisted state file with name auth.json (auth_filename then not required)
    • e.g. --auth='' will use default auth.json as file name for persisted state file (auth_filename then not required)
    • e.g. --auth=None will use no auth, but still keep track of auth state, just not from logins
  • :param auth_filename: * Set auth filename, used only if --auth= was passed list of user/passwords
  • :param auth_access:
    • 'open': Allow new users to be added
    • 'closed': Stick to existing users
  • :param auth_freeze: whether freeze authentication based upon current file, no longer update file
  • :param auth_message: Message to show if having users login, fixed if passed, else dynamic internally
  • :param guest_name: guess name if using auth and have open access.
    • If '', then no guest allowed even if open access, then all databases for each user always persisted

HTTPS access for server and client

Have files private_key.pem and cert.pem from your own SSL, or if do not have such files, generate by doing:

openssl req -x509 -newkey rsa:4096 -keyout private_key.pem -out cert.pem -days 3650 -nodes -subj '/O=H2OGPT'

Consider the server (not h2oGPT but gradio based) for end-to-end example:

import gradio as gr
import random
import time

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.ClearButton([msg, chatbot])

    def respond(message, chat_history):
        bot_message = random.choice(["How are you?", "I love you", "I'm very hungry"])
        chat_history.append((message, bot_message))
        time.sleep(2)
        return "", chat_history

    msg.submit(respond, [msg, chatbot], [msg, chatbot], api_name='chat')

demo.launch(ssl_verify=False, ssl_keyfile='private_key.pem', ssl_certfile='cert.pem', share=False)

The key and cert files are passed to the server, with ssl_verify=False to avoid asking a known source to verify. This is required to have https but allow the server to talk to itself and via the UI in the browser. The browser will warn about ssl key not being verified, just proceed anyways.

Then the client needs to also not verify when talking to the server running https, which gradio client does not handle itself. One can use a context manager as follows:

import contextlib
import warnings
import requests
from urllib3.exceptions import InsecureRequestWarning

old_merge_environment_settings = requests.Session.merge_environment_settings


@contextlib.contextmanager
def no_ssl_verification():
    opened_adapters = set()

    def merge_environment_settings(self, url, proxies, stream, verify, cert):
        # Verification happens only once per connection so we need to close
        # all the opened adapters once we're done. Otherwise, the effects of
        # verify=False persist beyond the end of this context manager.
        opened_adapters.add(self.get_adapter(url))

        settings = old_merge_environment_settings(self, url, proxies, stream, verify, cert)
        settings['verify'] = False

        return settings

    requests.Session.merge_environment_settings = merge_environment_settings

    try:
        with warnings.catch_warnings():
            warnings.simplefilter('ignore', InsecureRequestWarning)
            yield
    finally:
        requests.Session.merge_environment_settings = old_merge_environment_settings

        for adapter in opened_adapters:
            try:
                adapter.close()
            except:
                pass

Then with this one is able to talk to the server using https:

from gradio_client import Client
HOST_URL ="https://localhost:7860"

with no_ssl_verification():
    client = Client(HOST_URL, serialize=False)
    chatbot = [['foo', 'doo']]
    res = client.predict('Hello', chatbot, api_name='/chat')
    print(res)

which prints out something like:

Loaded as API: https://localhost:7860/ ✔
('', [['foo', 'doo'], ['Hello', 'I love you']])

For h2oGPT, run the server as python generate.py --ssl_verify=False --ssl_keyfile=<KEYFILE> --ssl_certfile=<CERTFILE> --share=False for key file <KEYFILE> and cert file <CERTFILE>, then use gradio client code with context manager as above but use the gradio client endpoints as documented in readme or test code.

RoPE scaling and Long Context Models

For long context models that have been tuned for a specific size, ensure that you set the --rope_scaling configuration to match that exact size. For example:

python generate.py --rope_scaling="{'type':'linear','factor':4}" --base_model=lmsys/vicuna-13b-v1.5-16k --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2 --load_8bit=True --langchain_mode=UserData --user_path=user_path --prompt_type=vicuna11 --h2ocolors=False

If the model is Hugging Face-based and already has a config.json entry with rope_scaling in it, we will use that if you do not pass --rope_scaling.

Migration from Chroma < 0.4 to > 0.4

Option 1: Use old Chroma for old DBs

No action is required from the user. By default, h2oGPT will not migrate for old databases. This is managed internally through requirements added in requirements_optional_langchain.txt, which adds special wheels for old versions of chromadb and hnswlib. This ensures smooth migration handling better than chromadb itself.

Option 2: Automatically Migrate

By default, h2oGPT does not migrate automatically with --auto_migrate_db=False for generate.py. You can set this to True for auto-migration, which may take some time for larger databases. This will occur on-demand when accessing a database. This takes about 0.03s per chunk.

Option 3: Manually Migrate

You can set --auto_migrate_db=False and manually migrate databases by doing the following.

  • Install and run migration tool
    pip install chroma-migrate
    chroma-migrate
    
  • Choose DuckDB
  • Choose "Files I can use ..."
  • Choose your collection path, e.g. db_dir_UserData for collection name UserData

Model Usage Notes

  • amazon/MistralLite
    • Use --max_seq_len=16384 or smaller, larger fails to handle when context used like summarization
    • pip install flash-attn==2.3.1.post1 --no-build-isolation
      python generate.py --hf_model_dict="{'use_flash_attention_2': True}" --base_model=amazon/MistralLite --max_seq_len=16384
  • mistralai/Mistral-7B-Instruct-v0.1
    • Use --max_seq_len=4096 or smaller, but does well even with 32k in some cases query with many chunks in context

Many newer models have large embedding sizes and can handle going beyond the context a bit. However, some models like distilgpt2 critically fail, so one needs to pass

python generate.py --base_model=distilgpt2 --truncation_generation=True

otherwise one will hit:

../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [4,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

We take care of this for distilgpt2, but other similar models might fail in same way.

Adding Models

You can choose any Hugging Face model or quantized GGML model file in h2oGPT. Hugging Face models are automatically downloaded to the Hugging Face .cache folder (in home folder).

Hugging Face

Hugging Face models are passed via --base_model in all cases, with an extra --load_gptq for GPTQ models or an extra --load_awq for AWQ models, e.g., by TheBloke. For example, for AutoGPTQ:

python generate.py --base_model=TheBloke/Nous-Hermes-13B-GPTQ --load_gptq=model --use_safetensors=True --prompt_type=instruct

and in some cases one has to disable certain features that are not automatically handled by AutoGPTQ package, e.g.

CUDA_VISIBLE_DEVICES=0 python generate.py --base_model=TheBloke/Xwin-LM-13B-v0.2-GPTQ --load_gptq=model --use_safetensors=True --prompt_type=xwin --langchain_mode=UserData --score_model=None --share=False --gradio_offline_level=1 --gptq_dict="{'disable_exllama': True}"

Attention sinks is supported, like:

pip install git+https://github.com/tomaarsen/attention_sinks.git
python generate.py --base_model=mistralai/Mistral-7B-Instruct-v0.1 --score_model=None --attention_sinks=True --max_new_tokens=100000 --max_max_new_tokens=100000 --top_k_docs=-1 --use_gpu_id=False --max_seq_len=4096 --sink_dict="{'attention_sink_size': 4, 'attention_sink_window_size': 4096}"

where the attention sink window has to be larger than any prompt input else failures will occur. If one sets max_input_tokens then this will restrict the input tokens and that can be set to same value as attention_sink_window_size.

One can increase --max_seq_len=4096 for Mistral up to maximum of 32768 if GPU has enough memory, or reduce to lower memory needs from input itself, but still get efficient generation of new tokens "without limit". E.g.

--base_model=mistralai/Mistral-7B-Instruct-v0.1 --score_model=None --attention_sinks=True --max_new_tokens=100000 --max_max_new_tokens=100000 --top_k_docs=-1 --use_gpu_id=False --max_seq_len=8192 --sink_dict="{'attention_sink_size': 4, 'attention_sink_window_size': 8192}"

One can also set --min_new_tokens on CLI or in UI to some larger value, but this is risky as it ignores end of sentence token and may do poorly after. Better to improve prompt, and this is most useful when already consumed context with input from documents (e.g. top_k_docs=-1) and still want long generation. Attention sinks is not yet supported for llama.cpp type models or vLLM/TGI inference servers.

AWQ

New quantized AWQ chose good quality, e.g. 70B LLaMa-2 16-bit or AWQ does comparable for many retrieval tasks.

python generate.py --base_model=TheBloke/Llama-2-13B-chat-AWQ --load_awq=model --use_safetensors=True --prompt_type=llama2

GGML

GGML v3 quantized models are supported, and TheBloke also has many of those, e.g.

python generate.py --base_model=llama --model_path_llama=llama-2-7b-chat.ggmlv3.q8_0.bin --max_seq_len=4096

For GGML models, passing --max_seq_len directly is always recommended. When you pass the filename as shown in the preceding example, we assume you have previously downloaded the model to the local path, but if you pass a URL, then we download the file for you. You can also pass a URL for automatic downloading (which will not re-download if the file already exists):

python generate.py --base_model=llama --model_path_llama=https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin --max_seq_len=4096

for any TheBloke GGML v3 models.

GGUF

For GGUF model support or CPU llama.cpp support, see README_LINUX.md or README_WINDOWS.md for uninstalling GGML package in favor of GGUF. As complete example, here is for GPU and CPU using GGUF model.

GGUF using GPU on x86_64 linux:

pip uninstall -y llama-cpp-python llama-cpp-python-cuda
pip install https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.1.83+cu117-cp310-cp310-linux_x86_64.whl
python generate.py --base_model=llama --prompt_type=mistral --model_path_llama=https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf --max_seq_len=4096 --score_model=None

That is, currently, for GPU case, the latest llama_cpp_python only uses GGUF, so version number selects GGML vs. GGUF just like for llama.cpp itself.

GGUF using AVX2 on x86_64 linux:

pip uninstall -y llama-cpp-python llama-cpp-python-cuda
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.1.83+cpuavx2-cp310-cp310-linux_x86_64.whl
CUDA_VISIBLE_DEVICES= python generate.py --base_model=llama --prompt_type=mistral --model_path_llama=https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf --max_seq_len=4096 --score_model=None

Similarly version of llama cpp python package selects support for GGMLv3 vs. GGUF. Later versions of llama_cpp_python than shown here may not be supported in h2oGPT, that is untested.

Similar versions of this package also give support for Windows, AMD, Metal, CPU with various AVX choices, GPU, etc.

GPT4All

GPT4All models are supported, which are automatically downloaded to a GPT4All cache folder (in the home folder). For example:

python generate.py --base_model=gptj --model_name_gptj=ggml-gpt4all-j-v1.3-groovy.bin

for GPTJ models (also downloaded automatically):

python generate.py --base_model=gpt4all_llama --model_name_gpt4all_llama=ggml-wizardLM-7B.q4_2.bin

for GPT4All LLaMa models.

For more information on controlling these parameters, see README_CPU.md and README_GPU.md.

Adding Prompt Templates

After specifying a model, you need to consider if an existing prompt_type will work or if a new one is required. For example, for Vicuna models, a well-defined prompt_type is used, which we support automatically for specific model names. If the model is in prompter.py as associated with some prompt_type name, then we added it already. You can view the models that are currently supported in this automatic way in prompter.py and enums.py.

If we do not list the model in prompter.py, then if you find a prompt_type by name that works for your new model, you can pass --prompt_type=<NAME> for some prompt_type <NAME>, and we will use that for the new model.

However, in some cases, you need to add a new prompt structure because the model does not conform at all (or exactly enough) to the template given in, e.g., the Hugging Face model card or elsewhere. In that case, you have two options:

  • Option 1: Use custom prompt

    In CLI you can pass --prompt_type=custom --prompt_dict="{....}" for some dict {....}. The dictionary doesn't need to contain all of the keys mentioned below, but should contain the primary ones.

    You can also choose prompt_type=custom in expert settings and change prompt_dict in the UI under Models tab. Not all of these dictionary keys need to be set:

    promptA
    promptB
    PreInstruct
    PreInput
    PreResponse
    terminate_response
    chat_sep
    chat_turn_sep
    humanstr
    botstr
    

    i.e. see how consumed: https://github.com/h2oai/h2ogpt/blob/a51576cd174e9fda61f00c3889a26888a604172c/src/prompter.py#L130-L142

    The following are the most crucial items:

    PreInstruct
    PreResponse
    humanstr
    botstr
    

    Note that it is often the case that humanstr equals PreInstruct and botstr equals PreResponse. If this is the case, then you only have to set two keys.

  • Option 2: Tweak or Edit code

    The following steps describe how you can edit the code itself if you don't want to use the CLI or UI:

    1. In prompter.py, add new key (prompt_type name) and value (model name) into prompt_type_to_model_name
    2. In enums.py, add a new name and value for the new prompt_type
    3. In prompter.py, add new block in get_prompt()

    A simple example to follow is vicuna11, with this block:

    elif prompt_type in [PromptType.vicuna11.value, str(PromptType.vicuna11.value),
                         PromptType.vicuna11.name]:
        preprompt = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. """ if not (
                chat and reduced) else ''
        start = ''
        promptB = promptA = '%s%s' % (preprompt, start)
        eos = '</s>'
        PreInstruct = """USER: """
        PreInput = None
        PreResponse = """ASSISTANT:"""
        terminate_response = [PreResponse]
        chat_sep = ' '
        chat_turn_sep = eos
        humanstr = PreInstruct
        botstr = PreResponse
    
        if making_context:
            # when making context, want it to appear as-if LLM generated, which starts with space after :
            PreResponse = PreResponse + ' '
        else:
            # normally LLM adds space after this, because was how trained.
            # if add space here, non-unique tokenization will often make LLM produce wrong output
            PreResponse = PreResponse
    

    You can start by changing each thing that appears in the model card that tells about the prompting. You can always ask for help in a GitHub issue or Discord.

In either case, if the model card doesn't have that information, you'll need to ask around. In some cases, prompt information is included in their pipeline file or in a GitHub repository associated with the model with training of inference code. It may also be the case that the model builds upon another, and you should look at the original model card. You can also ask in the community section on Hugging Face for that model card.

Add new Embedding Model

This section describes how to add a new embedding model.

  • The --use_openai_embedding option set to True or False controls whether to use OpenAI embedding.

  • --hf_embedding_model set to some HuggingFace model name sets that as embedding model if not using OpenAI

  • The setting --migrate_embedding_model set to True or False specifies whether to migrate to new chosen embeddings or stick with existing/original embedding for a given database

  • The option --cut_distance as float specifies the distance above which to avoid using document sources. The default is 1.64, tuned for Mini and instructor-large. You can pass --cut_distance=100000 to avoid any filter. For example:

    python generate.py --base_model=h2oai/h2ogpt-4096-llama2-13b-chat  --score_model=None --langchain_mode='UserData' --user_path=user_path --use_auth_token=True --hf_embedding_model=BAAI/bge-large-en --cut_distance=1000000

In-Context learning via Prompt Engineering

For arbitrary tasks, using uncensored models like Falcon 40 GM is recommended. If censored is ok, then LLama-2 Chat are ok. Choose model size according to your system specs.

For the UI, CLI, or EVAL, this means editing the System Pre-Context text box in expert settings. When starting h2oGPT, you can pass --system_prompt to give a model a system prompt if it supports that, --context to pre-append some raw context, --chat_conversation to pre-append a conversation for instruct/chat models, --text_context_list to fill context up to possible allowed max_seq_len with strings, with first most relevant to appear near prompt, or --iinput for a default input (to instruction for pure instruct models) choice.

Or for API, you can pass the context variable. This can be filled with arbitrary things, including actual conversations to prime the model, although if a conversation then you need to put in prompts as follows:

from gradio_client import Client
import ast

HOST_URL = "http://localhost:7860"
client = Client(HOST_URL)

# string of dict for input
prompt = 'Who are you?'
# falcon, but falcon7B is not good at this:
#context = """<|answer|>I am a pixie filled with fairy dust<|endoftext|><|prompt|>What kind of pixie are you?<|endoftext|><|answer|>Magical<|endoftext|>"""
# LLama2 7B handles this well:
context = """[/INST] I am a pixie filled with fairy dust </s><s>[INST] What kind of pixie are you? [/INST] Magical"""
kwargs = dict(instruction_nochat=prompt, context=context)
res = client.predict(str(dict(kwargs)), api_name='/submit_nochat_api')

# string of dict for output
response = ast.literal_eval(res)['response']
print(response)

For example, see: https://github.com/h2oai/h2ogpt/blob/d3334233ca6de6a778707feadcadfef4249240ad/tests/test_prompter.py#L47 .

Note that even if the prompting is not perfect or matches the model, smarter models will still do quite well, as long as you give their answers as part of context.

If you just want to pre-append a conversation, then use chat_conversation instead and h2oGPT will generate the context for the given instruct/chat model:

from gradio_client import Client
import ast

HOST_URL = "http://localhost:7860"
client = Client(HOST_URL)

# string of dict for input
prompt = 'Who are you?'
chat_conversation = [("Who are you?", "I am a pixie filled with fairy dust"), ("What kind of pixie are you?", "Magical")]
kwargs = dict(instruction_nochat=prompt, chat_conversation=chat_conversation)
res = client.predict(str(dict(kwargs)), api_name='/submit_nochat_api')

# string of dict for output
response = ast.literal_eval(res)['response']
print(response)

Note that when providing context, chat_conversation, and text_context_list, the order in which they are integrated into the document Q/A prompting is: context first, followed by chat_conversation, and finally text_context_list. A system_prompt can also be passed, which can overpower any context or chat_conversation depending upon details.

Token access to Hugging Face models:

Related to transformers. There are two independent ways to do this (choose one):

  • Use ENV:
    export HUGGING_FACE_HUB_TOKEN=<token goes here>
    
    token starts with hf_ usually. Then start h2oGPT like normal. See Hugging Face ENV documentation for other environment variables.
  • Use cli tool:
    huggingface-cli login
    in repo. Then add to generate.py:
    python generate.py --use_auth_token=True ...
    
    See Hugging Face Access Tokens for more details.

Low-memory mode

For GPU case, a reasonable model for low memory is to run:

python generate.py --base_model=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2 --score_model=None --load_8bit=True --langchain_mode='UserData'

which uses good but smaller base model, embedding model, and no response score model to save GPU memory. If you can do 4-bit, then do:

python generate.py --base_model=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2 --score_model=None --load_4bit=True --langchain_mode='UserData'

This uses 5800MB to startup, then soon drops to 5075MB after torch cache is cleared. Asking a simple question uses up to 6050MB. Adding a document uses no more new GPU memory. Asking a question uses up to 6312MB for a few chunks (default), then drops back down to 5600MB.

For some models, you can restrict the use of context to use less memory. This does not work for long context models trained with static/linear RoPE scaling, for which the full static scaling should be used. Otherwise, e.g. for LLaMa-2 you can use

python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path --model_path_llama=https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin --max_seq_len=2048

even though normal value is --max_seq_len=4096 if the option is not passed as inferred from the model config.json.

On CPU case, a good model that's still low memory is to run:

python generate.py --base_model='llama' --prompt_type=llama2 --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2 --langchain_mode=UserData --user_path=user_path

Ensure to vary n_gpu_layers at CLI or in UI to smaller values to reduce offloading for smaller GPU memory boards.

ValueError: ...offload....

The current `device_map` had weights offloaded to the disk. Please provide an `offload_folder` for them. Alternatively, make sure you have `safetensors` installed if the model you are using offers
the weights in this format.

If you see this error, then you either have insufficient GPU memory or insufficient CPU memory. E.g. for 6.9B model one needs minimum of 27GB free memory.

TypeError: Chroma.init() got an unexpected keyword argument 'anonymized_telemetry'

Please check your version of langchain vs. the one in requirements.txt. Somehow the wrong version is installed. Try to install the correct one.

bitsandbytes CUDA error

CUDA Setup failed despite GPU being available. Please run the following command to get more information:
E               
E                       python -m bitsandbytes
E               
E                       Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
E                       to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
E                       and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

Ensure you have cuda version supported by bitsandbytes, e.g. in Ubuntu:

sudo update-alternatives --display cuda
sudo update-alternatives --config cuda

and ensure you choose CUDA 12.1 if using bitsandbytes 0.39.0 since that is last version it supports. Or upgrade bitsandbytes if that works. Or uninstall bitsandbytes to remove 4-bit and 8-bit support, but that will also avoid the error.

Multiple GPUs

Automatic sharding can be enabled with --use_gpu_id=False. This is disabled by default, as in rare cases torch hits a bug with cuda:x cuda:y mismatch. E.g. to use GPU IDs 0 and 3, one can run:

export HUGGING_FACE_HUB_TOKEN=<hf_...>
exoprt CUDA_VISIBLE_DEVICES="0,3"
export GRADIO_SERVER_PORT=7860
python generate.py \
          --base_model=meta-llama/Llama-2-7b-chat-hf \
          --prompt_type=llama2 \
          --max_max_new_tokens=4096 \
          --max_new_tokens=1024 \
          --use_gpu_id=False \
          --save_dir=save7b \
          --score_model=None \
          --use_auth_token="$HUGGING_FACE_HUB_TOKEN"

where use_auth_token has been set as required for LLaMa2.

Larger models require more GPU memory

Depending on available GPU memory, you can load differently sized models. For multiple GPUs, automatic sharding can be enabled with --use_gpu_id=False, but this is disabled by default since cuda:x cuda:y mismatches can occur.

For GPUs with at least 24GB of memory, we recommend:

python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b --load_8bit=True

or

python generate.py --base_model=h2oai/h2ogpt-oasst1-512-20b --load_8bit=True

For GPUs with at least 48GB of memory, we recommend:

python generate.py --base_model=h2oai/h2ogpt-oasst1-512-20b --load_8bit=True

etc.

CPU with no AVX2 or using LLaMa.cpp

For GPT4All based models, require AVX2, unless one recompiles that project on your system. Until then, use llama.cpp models instead.

So we recommend downloading models from TheBloke that are version 3 quantized ggml files to work with latest llama.cpp. See main README.md.

The following example is for the base LLaMa model, not instruct-tuned, so it is not recommended for chatting. It just gives an example of how to quantize if you are an expert.

Compile the llama model on your system by following the instructions and llama-cpp-python, e.g. for Linux:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make LLAMA_OPENBLAS=1

on CPU, or for GPU:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make LLAMA_CUBLAS=1

etc. following different scenarios.

Then:

# obtain the original LLaMA model weights and place them in ./models, i.e. models should contain:
# 65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# install Python dependencies
conda create -n llamacpp -y
conda activate llamacpp
conda install python=3.10 -y
pip install -r requirements.txt

# convert the 7B model to ggml FP16 format
python convert.py models/7B/

# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0

# test by running the inference
./main -m ./models/7B/ggml-model-q4_0.bin -n 128

then pass run like (assumes version 3 quantization):

python generate.py --base_model=llama --model_path_llama=./models/7B/ggml-model-q4_0.bin

or wherever you placed the model with the path pointing to wherever the files are located (e.g. link from h2oGPT repo to llama.cpp repo folder), e.g.

cd ~/h2ogpt/
ln -s ~/llama.cpp/models/* .

then run h2oGPT like:

python generate.py --base_model='llama' --langchain_mode=UserData --user_path=user_path

Is this really a GGML file? Or Using version 2 quantization files from GPT4All that are LLaMa based

If hit error:

Found model file.
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
error loading model: unknown (magic, version) combination: 67676a74, 00000003; is this really a GGML file?
llama_init_from_file: failed to load model
LLAMA ERROR: failed to load model from ./models/7B/ggml-model-q4_0.bin

then note that llama.cpp upgraded to version 3, and we use llama-cpp-python version that supports only that latest version 3. GPT4All does not support version 3 yet. If you want to support older version 2 llama quantized models, then do:

pip install --force-reinstall --ignore-installed --no-cache-dir llama-cpp-python==0.1.73

to go back to the prior version. Or specify the model using GPT4All, run:

python generate.py --base_model=gpt4all_llama  --model_path_gpt4all_llama=./models/7B/ggml-model-q4_0.bin

assuming that file is from version 2 quantization.

not enough memory: you tried to allocate 590938112 bytes.

If one sees: 
```
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 590938112 bytes.
```
then probably CPU has insufficient memory to handle the model.  Try GGML.

WARNING: failed to allocate 258.00 MB of pinned memory: out of memory

If you see:
```
Warning: failed to VirtualLock 17825792-byte buffer (after previously locking 1407303680 bytes): The paging file is too small for this operation to complete.

WARNING: failed to allocate 258.00 MB of pinned memory: out of memory
Traceback (most recent call last):
```
then you have insufficient pinned memory on your GPU.  You can disable pinning by setting this env before launching h2oGPT:
  • Linux:
    export GGML_CUDA_NO_PINNED=1
    
  • Windows:
    setenv GGML_CUDA_NO_PINNED=1
    

I get the error: The model 'OptimizedModule' is not supported for . Supported models are ...

This warning can be safely ignored.

What ENVs can I pass to control h2oGPT?

  • SAVE_DIR: Local directory to save logs to,
  • ADMIN_PASS: Password to access system info, logs, or push to aws s3 bucket,
  • AWS_BUCKET: AWS bucket name to push logs to when have admin access,
  • AWS_SERVER_PUBLIC_KEY: AWS public key for pushing logs to when have admin access,
  • AWS_SERVER_SECRET_KEY: AWS secret key for pushing logs to when have admin access,
  • HUGGING_FACE_HUB_TOKEN: Read or write HF token for accessing private models,
  • LANGCHAIN_MODE: LangChain mode, overrides CLI,
  • SCORE_MODEL: HF model to use for scoring prompt-response pairs, None for no scoring of responses,
  • HEIGHT: Height of Chat window,
  • allow_upload_to_user_data: Whether to allow uploading to Shared UserData,
  • allow_upload_to_my_data: Whether to allow uploading to Personal MyData,
  • HEIGHT: Height of Chat window,
  • HUGGINGFACE_SPACES: Whether on public A10G 24GB HF spaces, sets some low-GPU-memory defaults for public access to avoid GPU memory abuse by model switching, etc.
  • HF_HOSTNAME: Name of HF spaces for purpose of naming log files,
  • GPT_H2O_AI: Whether on public 48GB+ GPU instance, sets some defaults for public access to avoid GPU memory abuse by model switching, etc.,
  • CONCURRENCY_COUNT: Number of concurrency users to gradio server (1 is fastest since LLMs tend to consume all GPU cores, but 2-4 is best to avoid any single user waiting too long to get response)
  • API_OPEN: Whether API access is visible,
  • ALLOW_API: Whether to allow API access,
  • CUDA_VISIBLE_DEVICES: Standard list of CUDA devices to make visible.
  • PING_GPU: ping GPU every few minutes for full GPU memory usage by torch, useful for debugging OOMs or memory leaks
  • GET_GITHASH: get git hash on startup for system info. Avoided normally as can fail with extra messages in output for CLI mode
  • H2OGPT_BASE_PATH: Choose base folder for all files except personal/scratch files These can be useful on HuggingFace spaces, where one sets secret tokens because CLI options cannot be used.

NOTE: Scripts can accept different environment variables to control query arguments. For instance, if a Python script takes an argument like --load_8bit=True, the corresponding ENV variable would follow this format: H2OGPT_LOAD_8BIT=True (regardless of capitalization). It is important to ensure that the environment variable is assigned the exact value that would have been used for the script's query argument.

How to run functions in src from Python interpreter

E.g.

import sys
sys.path.append('src')
from src.gpt_langchain import get_supported_types
non_image_types, image_types, video_types = get_supported_types()
print(non_image_types)
print(image_types)
for x in image_types:
    print('   - `.%s` : %s Image (optional),' % (x.lower(), x.upper()))
# unused in h2oGPT:
print(video_types)

GPT4All not producing output.

Please contact GPT4All team. Even a basic test can give empty result.

>>> from gpt4all import GPT4All as GPT4AllModel
>>> m = GPT4AllModel('ggml-gpt4all-j-v1.3-groovy.bin')
Found model file.
gptj_model_load: loading model from '/home/jon/.cache/gpt4all/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size  =  896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
>>> m.generate('Was Avogadro a  professor at the University of Turin?')

''
>>>

Also, the model tends to not do well when input has new lines, spaces or <br> work better. This does not seem to be an issue with h2oGPT.

Commercial viability

Open-source means the models are not proprietary and are available to download. In addition, the license for all of our non-research models is Apache V2, which is a fully permissive license. Some licenses for other open-source models are not fully permissive, such as StabilityAI's models that are CC-BY-SA that require derivatives to be shared too.

We post models and license and data origin details on our huggingface page: https://huggingface.co/h2oai (all models, except research ones, are fully permissive). The foundational models we fine-tuned on, e.g. Pythia 6.9B, Pythia 12B, NeoX 20B, or Open-LLaMa checkpoints are fully commercially viable. These foundational models are also listed on the huggingface page for each fine-tuned model. Full training logs, source data, etc. are all provided for all models. GPT4All GPT_J is commercially viable, but other models may not be. Any Meta based LLaMa based models are not commercially viable.

Data used to fine-tune are provided on the huggingface pages for each model. Data for foundational models are provided on their huggingface pages. Any models trained on GPT3.5 data like ShareGPT, Vicuna, Alpaca, etc. are not commercially viable due to ToS violations w.r.t. building competitive models. Any research-based h2oGPT models based upon Meta's weights for LLaMa are not commercially viable.

Overall, we have done a significant amount of due diligence regarding data and model licenses to carefully select only fully permissive data and models for our models we license as Apache V2. Outside our models, some "open-source" models like Vicuna, Koala, WizardLM, etc. are based upon Meta's weights for LLaMa, which is not commercially usable due to ToS violations w.r.t. non-competitive clauses well as research-only clauses. Such models tend to also use data from GPT3.5 (ChatGPT), which is also not commercially usable due to ToS violations w.r.t. non-competitive clauses. E.g. Alpaca data, ShareGPT data, WizardLM data, etc. all fall under that category. All open-source foundational models consume data from the internet, including the Pile or C4 (web crawl) that may contain objectionable material. Future licenses w.r.t. new web crawls may change, but it is our understanding that existing data crawls would not be affected by any new licenses. However, some web crawl data may contain pirated books.

AMD support

Untested AMD support: Download and install bitsandbytes on AMD

Disclaimers

Disclaimers and a ToS link are displayed to protect the app creators.

What are the different prompt types? How does prompt engineering work for h2oGPT?

In general, all LLMs use strings as inputs for training/fine-tuning and generation/inference. To manage a variety of possible language task types, we divide any such string into the following three parts:

  • Instruction
  • Input
  • Response

Each of these three parts can be empty or non-empty strings, such as titles or newlines. In the end, all of these prompt parts are concatenated into one string. The magic is in the content of those substrings. This is called prompt engineering.

Summarization

For training a summarization task, we concatenate these three parts together:

  • Instruction = <INSTRUCTION>
  • Input = '## Main Text\n\n' + <INPUT>
  • Response = '\n\n## Summary\n\n' + <OUTPUT>

For each training record, we take <INPUT> and <OUTPUT> from the summarization dataset (typically two fields/columns), place them into the appropriate position, and turn that record into one long string that the model can be trained with: '## Main Text\n\nLarge Language Models are Useful.\n\n## Summary\n\nLLMs rock.'

At inference time, we will take the <INPUT> only and stop right after '\n\n## Summary\n\n' and the model will generate the summary as the continuation of the prompt.

ChatBot

For a conversational chatbot use case, we use the following three parts:

  • Instruction = <INSTRUCTION>
  • Input = '<human>: ' + <INPUT>
  • Response = '<bot>: ' + <OUTPUT>

And a training string could look like this: '<human>: hi, how are you?<bot>: Hi, I am doing great. How can I help you?'. At inference time, the model input would be like this: '<human>: Tell me a joke about snow flakes.<bot>: ', and the model would generate the bot part.

How should training data be prepared?

Training data (in JSON format) must contain at least one column that maps to instruction, input or output. Their content will be placed into the <INSTRUCTION>, <INPUT>, and <OUTPUT> placeholders mentioned above. The chosen prompt_type will fill in the strings in between to form the actual input into the model. Any missing columns will lead to empty strings. Optional --data_col_dict={'A': 'input', 'B': 'output'} argument can be used to map different column names into the required ones.

Examples

The following are examples of training records in JSON format.

  • human_bot prompt type
{
  "input": "Who are you?",
  "output": "My name is h2oGPT.",
  "prompt_type": "human_bot"
}
  • plain version of human_bot, useful for longer conversations
{
  "input": "<human>: Who are you?\n<bot>: My name is h2oGPT.\n<human>: Can you write a poem about horses?\n<bot>: Yes, of course. Here it goes...",
  "prompt_type": "plain"
}
  • summarize prompt type
{
  "instruction": "",
  "input": "Long long long text.",
  "output": "text.",
  "prompt_type": "summarize"
}

Context length

Note that the total length of the text (that is, the input and output) the LLM can handle is limited by the so-called context length. For our current models, the context length is 2048 tokens. Longer context lengths are computationally more expensive due to the interactions between all tokens in the sequence. A context length of 2048 means that for an input of, for example, 1900 tokens, the model will be able to create no more than 148 new tokens as part of the output.

For fine-tuning, if the average length of inputs is less than the context length, one can provide a cutoff_len of less than the context length to truncate inputs to this amount of tokens. For most instruction-type datasets, a cutoff length of 512 seems reasonable and provides nice memory and time savings. For example, the h2oai/h2ogpt-oasst1-512-20b model was trained with a cutoff length of 512.

Tokens

The following are some example tokens (from a total of ~50k), each of which is assigned a number:

"osed": 1700,
"ised": 1701,
"================": 1702,
"ED": 1703,
"sec": 1704,
"Ġcome": 1705,
"34": 1706,
"ĠThere": 1707,
"Ġlight": 1708,
"Ġassoci": 1709,
"gram": 1710,
"Ġold": 1711,
"Ġ{#": 1712,

The model is trained with these specific numbers, so the tokenizer must be kept the same for training and inference/generation. The input format doesn't change whether the model is in pretraining, fine-tuning, or inference mode, but the text itself can change slightly for better results, and that's called prompt engineering.

Is h2oGPT multilingual?

Yes. Try it in your preferred language.

What does 512 mean in the model name?

The number 512 in the model names indicates the cutoff lengths (in tokens) used for fine-tuning. Shorter values generally result in faster training and more focus on the last part of the provided input text (consisting of prompt and answer).

Throttle GPUs in case of reset/reboot

(h2ogpt) jon@gpu:~$ sudo nvidia-smi -pl 250
Power limit for GPU 00000000:3B:00.0 was set to 250.00 W from 300.00 W.
Power limit for GPU 00000000:5E:00.0 was set to 250.00 W from 300.00 W.
Power limit for GPU 00000000:86:00.0 was set to 250.00 W from 300.00 W.
Power limit for GPU 00000000:AF:00.0 was set to 250.00 W from 300.00 W.
All done.

Heterogeneous GPU systems

In case you get peer-to-peer related errors on non-homogeneous GPU systems, set this env var:

export NCCL_P2P_LEVEL=LOC

Use Wiki data

The following example demonstrates how to use Wiki data:

>>> from datasets import load_dataset
>>> wk = load_dataset("wikipedia", "20220301.en")
>>> wk
DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 6458670
    })
})
>>> sentences = ".".join(wk['train'][0]['text'].split('.')[0:2])
'Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful'
>>>

Centos with llama-cpp-python

This may help to get llama-cpp-python to install

# remove old gcc
yum remove gcc yum remove gdb
# install scl-utils
sudo yum install scl-utils sudo yum install centos-release-scl
# find devtoolset-11
yum list all --enablerepo='centos-sclo-rh' | grep "devtoolset"
# install devtoolset-11-toolchain
yum install -y devtoolset-11-toolchain
# add gcc 11 to PATH by adding following script to /etc/profile
PATH=$PATH::/opt/rh/devtoolset-11/root/usr/bin export PATH sudo scl enable devtoolset-11 bash
# show gcc version and gcc11 is installed successfully.
gcc --version
export FORCE_CMAKE=1
export CMAKE_ARGS=-DLLAMA_OPENBLAS=on
pip install llama-cpp-python --no-cache-dir