Skip to content

Releases: LostRuins/koboldcpp

koboldcpp-1.35

12 Jul 06:28
Compare
Choose a tag to compare

koboldcpp-1.35

Note: This build adds significant changes for CUDA and may be less stable than normal - please report any performance regressions or bugs you encounter. It may be slower than usual. If that is the case, please use the previous version for now.

  • Enabled the CUDA 8Bit MMV mode (see ggml-org#2067) , now that it seems stable enough and works correctly, this approach uses quantized dot products instead of the traditional DMMV approach for the formats q4_0, q4_1, q5_0 and q5_1. If you're able to do a full GPU offload, then CUDA for such models will likely be significantly faster than before. K-quants and CL are not affected.
  • Exposed performance information through the API (prompt processing and generation timing), access it with /api/extra/perf
  • Added support for linear RoPE as an alternative to NTK-Aware RoPE (similar to in 1.33, but using 2048 as a base). This is triggered by the launcher parameter --linearrope. The RoPE scale is determined by the --contextsize parameter, thus for best results on SuperHOT models, you should launch with --linearrope --contextsize 8192 which provides a 0.25 linear scale as the SuperHOT finetune suggests. If --linearrope is not specified, then NTK-aware RoPE is used by default.
  • Added a Save and Load settings option to the GUI launcher.
  • Added the ability to select "All Devices" in the GUI for CUDA. You are still recommended to select a specific device - split GPU is usually slower.
  • Displays a warning if poor sampler orders are used, as the default configuration will give much better results.
  • Updated Kobold Lite, pulled other upstream fixes and optimizations.

1.35.H Henk-Cuda Hotfix: This is an alternative version from Henk that you can try if you encounter speed reductions. Please let me know if it's better for you.

Henk may have newer versions at https://github.com/henk717/koboldcpp/releases/tag/1.35 please check that out for now. I will be able to upstream any fixes only in a few days.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.34.2

07 Jul 11:41
Compare
Choose a tag to compare

koboldcpp-1.34.2

ohno

This is a BIG update. Changes:

  • Added a brand new customtkinter GUI which contains many more configurable settings. To use this new UI, the python module customtkinter is required for Linux and OSX (already included with windows .exe builds). The old GUI is still available otherwise. (Thanks: @Vali-98)
  • Switched to NTK aware scaling for RoPE, set based on --contextsize parameter, with support up to 8K context. This seems to perform much better than the previous dynamic linear method, even on untuned models. It still won't work perfectly for SuperHOT 8K, as that model requires a fixed 0.25 linear rope scale, but I think this approach is better in general. Note that the alpha value chosen is applied when you select the --contextsize so for best results, only set a big --contextsize if you need it since there will be minor perplexity loss otherwise.
  • Enabled support for NTK-Aware scaled RoPE to GPT-NeoX and GPT-J too! And surprisingly long context does work decently with older models, so you can enjoy something like Pyg6B or Pythia with 4K context if you like.
  • Added /generate API support for sampler_order and mirostat/tau/eta parameters, which you can now set per-generation. (Thanks: @ycros)
  • Added --bantokens which allows you to specify a list of token substrings that the AI cannot use. For example --bantokens [ a ooo prevents the AI from using any left square brackets, the letter a, or any token containing ooo. This bans all instances of matching tokens!
  • Added more granular context size options, now you can select 3k and 6k context sizes as well.
  • Added the ability to select Main GPU to use when using CUDA. For example, --usecublas lowvram 2 will use the third Nvidia GPU if it exists.
  • Pulled updates from RWKV.cpp, minor speedup for prompt processing.
  • Fixed build issues on certain older and OSX platforms, GCC 7 should now be supported. please report any that you find.
  • Pulled fixes and updates from upstream, Updated Kobold Lite. Kobold Lite now allows you to view submitted contexts after each generation. Also includes two new scenarios and limited support for Tavern v2 cards.
  • Adjusted scratch buffer sizes for big contexts, so unexpected segfaults/OOM errors should be less common (please report any you find). CUDA scratch buffers should also work better now (upstream fix).

1.34.1a Hotfix (CUDA): Cuda was completely broken, did a quick revert to get it working. Will upload a proper build later.
1.32.2 Hotfix: CUDA kernels now updated to latest version, used python to handle the GPU selection instead.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.33 Ultimate Edition

29 Jun 13:44
Compare
Choose a tag to compare

koboldcpp-1.33 Ultimate Edition

A.K.A The "We CUDA had it all edition"

cldead

  • The KoboldCpp Ultimate edition is an All-In-One release with previously missing CUDA features added in, with options to support both CL and CUDA properly in a single distributable. You can now select CUDA mode with --usecublas, and optionally low VRAM using --usecublas lowvram. This release also contains support for OpenBLAS, CLBlast (via --useclblast), and CPU-only (No BLAS) inference.
  • Back ported CUDA support for all prior versions of GGML file formats for CUDA. CUDA mode now correctly supports every single earlier version of GGML files, (earlier quants from GGML, GGMF, GGJT v1, v2 and v3, with respective feature sets at the time they were released, should load and work correctly.)
  • Ported over the memory optimizations I added for OpenCL to CUDA, now CUDA will use less VRAM, and you may be able to use even more layers than upstream in llama.cpp (testing needed).
  • Ported over CUDA GPU acceleration via layer offloading for MPT, GPT-2, GPT-J and GPT-NeoX in CUDA.
  • Updated Lite, pulled updates from upstream, various minor bugfixes. Also, instruct mode now allows any number of newlines in the start and end tag, configurable by user.
  • Added long context support using Scaled RoPE for LLAMA, which you can use by setting --contextsize greater than 2048. It is based off the PR here ggml-org#2019 and should work reasonably well up to over 3k context, possibly higher.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

koboldcpp-1.32.3

22 Jun 09:18
Compare
Choose a tag to compare

koboldcpp-1.32.3

sadge

  • Ported the optimized K-Quant CUDA kernels to OpenCL ! This speeds up K-Quants generation speed by about 15% with CL (Special thanks: @0cc4m)
  • Implemented basic GPU offloading for MPT, GPT-2, GPT-J and GPT-NeoX via OpenCL! It still keeps a copy of the weights in RAM, but generation speed for these models should now be much faster! (50% speedup for GPT-J, and even WizardCoder is now 30% faster for me.)
  • Implemented scratch buffers for the latest versions of all non-llama architectures except RWKV (MPT, GPT-2, NeoX, GPT-J), BLAS memory usage should be much lower on average, and larger BLAS batch sizes will be usable on these models.
  • Merged GPT-Tokenizer improvements for non-llama models. Support Starcoder special added tokens. Coherence for non-llama models should be improved.
  • Updated Lite, pulled updates from upstream, various minor bugfixes.

1.32.1 Hotfix:

  • A number of bugs were fixed. The include memory allocation errors with OpenBLAS, and errors recognizing the new MPT-30B model correctly.

1.32.2 Hotfix.

  • Solves an issue with the MPT-30B vocab having missing words due to an problems with wide-string tokenization.
  • Solve an issue with LLAMA WizardLM-30B running out of memory near 2048 context at larger k-quants.

1.32.3 Hotfix.

  • Reverted wstring changes, they negatively affected model coherency.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

koboldcpp-1.31.2

17 Jun 15:37
Compare
Choose a tag to compare

koboldcpp-1.31.2

This is mostly a bugfix build, with some new features to Lite.

  • Better EOS token handling for Starcoder models.
  • Major Kobold Lite update, including new scenarios, a variety of bug fixes, italics chat text, customized idle message counts, and improved sentence trimming behavior.
  • Disabled RWKV sequence mode. Unfortunately, the speedups were too situational, and some users experienced speed regressions. Additionally, it was not compatible without modifying the ggml library to increase the max node counts, which had adverse impacts on other model architectures. Sequence mode will be disabled until it has been sufficiently improved upstream.
  • Display token generation rate in console

Update 1.31.1:

  • Cleaned up debug output, now only shows the server endpoint debugs if --debugmode is set. Also, no longer shows incoming horde prompts if --hordeconfig is set unless --debugmode is also enabled.
  • Fixed markdown in lite

Update 1.31.2:

  • Allowed --hordeconfig to specify max context length allowed in horde too, which is separate from the real context length used to allocate memory.

koboldcpp-1.30.3

13 Jun 15:12
Compare
Choose a tag to compare

koboldcpp-1.30.3

A.K.A The "Back from the dead" edition.

drowned2

KoboldCpp Changes:

  • Added full OpenCL / CLBlast support for K-Quants, both prompt processing and GPU offloading for all K-quant formats (credits: @0cc4m)
  • Added RWKV Sequence Mode enhancements for over 3X FASTER prompt processing in RWKV (credits: @LoganDark)
  • Added support for the RWKV World Tokenizer and associated RWKV-World models. It will be automatically detected and selected as necessary.
  • Added a true SSE-streaming endpoint (Agnaistic compatible) that can stream tokens in realtime while generating. Integrators can find it at /api/extra/generate/stream. (Credits @SammCheese)
  • Added an enhanced polled-streaming endpoint to fetch in-progress results without disrupting generation, which is now the default for Kobold Lite when using streaming in KoboldCpp. Integrators can find it at /api/extra/generate/check. The old 8-token-chunked-streaming can still be enabled by setting the parameter streamamount=8 in the URL. Also, the original KoboldAI United compatible /api/v1/generate endpoint is still available.
  • Added a new abort endpoint at /api/extra/abort which aborts any in-progress generation without stopping the server. It has been integrated into Lite, by pressing the "abort" button below the Submit button.
  • Added support for lora base, which is now added as an optional second parameter e.g. --lora [lora_file] [base_model]
  • Updated to latest Kobold Lite (required for new endpoints).
  • Pulled other various enhancements from upstream, plus a few RWKV bugfixes .

1.30.2 Hotfix - Added a fix for RWKV crashing in seq mode, pulled upstream bugfixes, rebuild CUDA version. For those wondering why CUDA exe version is not always included, apart from size, dependencies and only supporting nvidia, that's partially also because it's a pain to build for me, since it can only be done in a dev environment with CUDA toolkit and visual studio on windows.

1.30.3 Hotfix - Disabled RWKV seq mode for now, due to multiple complaints about speed and memory issues with bigger quantized models. I will keep a copy of 1.30.2 here in case anyone still wants it.

CUDA Bonus

Bonus: An alternative CUDA build has also been provided for this version, capable of running all latest formats including K-Quants. To use, download and run the koboldcpp_CUDA_only.exe, which is a one-file pyinstaller.

Extra Bonus: CUDA now also supports the older ggjtv2 models as well, as support has been back ported in! Note that CUDA builds will still not be generated by default, and support for them will be limited.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

koboldcpp-1.29

07 Jun 08:32
Compare
Choose a tag to compare

koboldcpp-1.29

KoboldCpp Changes:

  • Added BLAS batch size to the KoboldCpp Easy Launcher GUI.
  • Merged the upstream K-quantization implementations for OpenBLAS. Note that the new K-quants are still not supported in CLBlast yet. Please remain on the regular quantization formats to use CLBlast for now.
  • Fixed LLAMA 3B OOM errors and a few other OOMs.
  • Multiple bugfixes and improvements in Lite, including streaming for aesthetic chat mode.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.28

04 Jun 14:45
Compare
Choose a tag to compare

koboldcpp-1.28

KoboldCpp Changes:

  • NEW: Added support for MPT models! Note that to use larger context lengths, remember to set it with --contextsize. Values up to around 5000 context tokens have been tested successfully.
  • The KoboldCpp Easy Launcher GUI has been enhanced! You can now set the number of CLBlast GPU layers in the GUI, as well as the number of threads to use. Additional toggles have also been added.
  • Added a more efficient memory allocation to CLBlast! You should be able to offload more layers than before.
  • The flag --renamemodel has been renamed (lol) to --hordeconfig and now accepts 2 parameters, the horde name to display, and the advertised max generation length on horde.
  • Fixed memory issues with Starcoder models. They still don't work very well with BLAS especially for lower RAM devices, so you might want to use a smaller --blasbatchsize with them, 64 or 128.
  • Added the option to use --blasbatchsize -1 which disables BLAS but still allows you to use GPU Layer offloading in Clblast. This means if you don't use BLAS, you can offload EVEN MORE LAYERS and generate even faster (at the expense of slow prompt processing).
  • Minor tweaks and adjustments to defaults settings.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.27

01 Jun 03:13
Compare
Choose a tag to compare

koboldcpp-1.27

KoboldCpp Changes:

  • Integrated the Clblast GPU offloading improvements from @0cc4m which allows you to have a layer fully stored in VRAM instead of keeping a duplicate copy in RAM. As a result, offloading GPU layers will reduce overall RAM used.
  • Pulled upstream support for OpenLlama 3B models.
  • Added support for the new version of RWKV.cpp models (v101) from @saharNooby that uses the updated GGML library, and is smaller and faster. Both the older and newer quantization formats will still be supported automatically, backwards compatible.
  • Added support for EOS tokens in RWKV
  • Updated Kobold Lite. One new and exciting feature is AutoGenerated Memory, which performs a text summary on your story to generate a short memory with a single click. Works best on instruct models.
  • Allowed users to rename their displayed model name now, intended for use in horde. Using--renamemodel lets you change the default name to any string, with an added koboldcpp/ prefix as suggested by Henky.
  • Fixed some build errors on some versions of OSX and Linux

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.26

27 May 10:06
Compare
Choose a tag to compare

koboldcpp-1.26

KoboldCpp Changes:

  • NEW! Now, you can view Token Probabilities when using --debugmode. When enabled, for every generated token, the console will display the probabilities of up to 4 alternative possible tokens. Good way to know how biased/confident/overtrained a model is. The probability percentage values shown are after all the samplers have been applied, so it's also a great way to test your sampler configurations to see how good they are. --debugmode also displays the contents of your input and context, as well as their token IDs. Note that using --debugmode has a slight performance hit, so it is off by default.
  • NEW! The Top-A sampler has been added! This is my own implementation of a special Kobold-exclusive sampler that does not exist in the upstream llama.cpp repo. This sampler reduces the randomness of the AI whenever the probability of one token is much higher than all the others, proportional to the squared softmax probability of the most probable token. Higher values have a stronger effect. (Put this value on 0 to disable its effect).
  • Added support for the Starcoder and Starcoder Chat models.
  • Cleaned up and slightly refactored the sampler code, EOS stop tokens should now work for all model types, use --unbantokens to enable it. Additionally, the left square bracket [ token is no longer banned by default as modern models don't really need it, and the token IDs were inconsistent across architectures.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.