-
-
Notifications
You must be signed in to change notification settings - Fork 130
KoboldAI Horde
KoboldAI is a part of AI Horde but is for generating text rather than images. it allows you to use many text generation models for free.
A text worker can easily be run on services like (Vast.ai)[https://cloud.vast.ai] using (koboldcpp)[https://github.com/LostRuins/koboldcpp].
Once setting your account up, create a custom template, in image path, put pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
and make sure cuda12.1-cudnn8-devel is selected as version tag.
Select "Run a jupyter-python notebook" and check the two options.
As on-start script, input:
git clone https://github.com/LostRuins/koboldcpp
cd koboldcpp
make -B LLAMA_CUBLAS=1 -j8
pip install -r requirements.txt
wget -nc https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF/resolve/main/mythomax-l2-13b.Q4_K_M.gguf
python koboldcpp.py mythomax-l2-13b.Q4_K_M.gguf --port 5001 --usecublas normal 0 mmq --gpulayers 43 --threads 1 --contextsize 4096 --hordeconfig MythoMax-L2-13b 1024 4096 APIKEY WORKERNAME
Replace "APIKEY" and "WORKERNAME" with your horde API key and the worker name for this instance. This could make your API key visible to Vast.ai staff and owner of the instances, be aware of this You can also swap the model for a different one, but you might need more disk space or VRAM for a bigger model, as well as adjusting the number of gpulayers
Name your template, and do not enable "Public Template"
In the "search" section of the site, select your template in "Instance configuration". Set disk space to 8GB or more
As filter options, you will want to pick:
- GPU RAM minimum 12G
- Inet Down As you want, but minimum 200Mbps down to avoid spending hours on the model download
- GPU count to 1 (not sure if the script works for more than one)
- Unverified instances (cheaper)
Feel free to play with the sorting function to see which instance you might prefer, as well as interruptible instances. Upon clicking "rent", vast.ai will start the image and run the script. You can monitor it using the "request logs" button.
As a quick way of checking that the instance is working, look at
- Disk used: If this is low, the model wasn't downloaded
- VRAM used: If this isn't around 12G, the process didn't load the model in VRAM
- GPU usage %: Sometimes drop to 0 between jobs, but should often be more than 50%
Remember to delete instances after usage, as an offline instance will still be billed for disk space.