Releases: LostRuins/koboldcpp
koboldcpp-1.39.1
koboldcpp-1.39.1
- Fix SSE streaming to handle headers correctly during abort (Credits: @duncannah)
- Bugfix for
--blasbatchsize -1
and1024
(fix alloc blocks error) - Added experimental support for
--blasbatchsize 2048
(note, buffers are doubled if that is selected, using much more memory) - Added support for 12k and 16k
--contextsize
options. Please let me know if you encounter issues. - Pulled upstream improvements, further CUDA speedups for MMQ mode for all quant types.
- Fix for some LLAMA 65B models being detected as LLAMA2 70B models.
- Revert to upstream approach for CUDA pool malloc (1.39.1 - done only for MMQ).
- Updated Lite, includes adding support for importing Tavern V2 card formats, with world info (character book) and clearer settings edit boxes.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.38
koboldcpp-1.38
- Added upstream support for Quantized MatMul (MMQ) prompt processing, a new option for CUDA (enabled by adding
--usecublas mmq
or toggle in GUI). This uses slightly less memory, and is slightly faster for Q4_0 but slower for K-quants. - Fixed SSE streaming for multibyte characters (For Tavern compatibility)
--noavx2
mode now does not use OpenBLAS (same as Failsafe), this is due to numerous compatibility complaints.- GUI dropdown preset only displays built platforms (Credit: @YellowRoseCx)
- Added a Help button in the GUI
- Fixed an issue with mirostat not reading correct value from GUI
- Fixed an issue with context size slider being limited to 4096 in the GUI
- Displays a terminal warning if received context exceeds max launcher allocated context
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.37.1
koboldcpp-1.37.1
- NEW: KoboldCpp now comes with an embedded Horde Worker which allows anyone to share their ggml models with the AI Horde without downloading additional dependences.
--hordeconfig
now accepts 5 parameters[hordemodelname] [hordegenlength] [hordemaxctx] [hordeapikey] [hordeworkername]
, filling up all 5 will start a Horde worker for you that serves horde requests automatically in the background. For previous behavior, exclude the last 2 parameters to continue using your own Horde worker (e.g. HaidraScribe/KAIHordeBridge). This feature can also be enabled via the GUI. - Added Support for LLAMA2 70B models. This should work automatically, GQA will be set to 8 if it's detected.
- Fixed a bug with mirostat v2 that was causing overly deterministic results. Please try it again. (Credit: @ycros)
- Added addition information to
/api/extra/perf
for the last generation info, including the stopping reason as well as generated token counts. - Exposed the parameter for
--tensor_split
which works exactly like it does upstream. Only for CUDA. - Try to support Kepler as a target for CUDA as well on henky's suggestion, can't guarantee it will work as I don't have a K80, but it might.
- Retained support for
--blasbatchsize 1024
after it was removed upstream. Scratch & KV buffer sizes will be larger when using this. - Minor bugfixes, pulled other upstream fixes and optimizations, updated Kobold Lite (chat mode improvements)
Hotfix 1.37.1
- Fixed clblast to work correctly for LLAMA2 70B
- Fixed sending Client-Agent for embedded horde worker in addition to Bridge Agent and User Agent
- Changed
rms_norm_eps
to5e-6
for better results for both llama1 and 2 - Fixed some streaming bugs in Lite
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.36
koboldcpp-1.36
- Reverted an upstream change to
sched_yield()
that caused slowdowns for certain systems. This should fix speed regressions in 1.35. If you're still experiencing poorer speeds compared to earlier versions, please raise an issue with details. - Reworked command line args on RoPE for extended context to be similar to upstream. Thus,
--linearrope
has been removed. Instead, you can now use--ropeconfig
to customize both RoPE frequency scale (Linear) and RoPE frequency base (NTK-Aware) values, e.g.--ropeconfig 0.5 10000
for a 2x linear scale. By default, long contextNTK-Aware RoPE
will be automatically configured based on your--contextsize
parameter, similar to previously. If you're using LLAMA2 at 4K context, you'd probably want to use--ropeconfig 1.0 10000
to take advantage of the native 4K tuning without scaling. For ease of use, this can be set in the GUI too. - Expose additional token counter information through the API
/api/extra/perf
- The warning for poor sampler orders has been limited to show only once per session, and excludes mirostat. I've heard some people have issues with it, so please let me know if it's still causing problems, though it's only a text warning and should not affect actual operation.
- Model busy flag replaced by Thread Lock, credits @ycros.
- Tweaked scratch and KV buffer allocation sizes for extended context.
- Updated Kobold Lite, with better whitespace trim support and a new toggle for partial chat responses.
- Pulled other upstream fixes and optimizations.
- Downgraded CUDA windows libraries to 11.4 for smaller exe filesizes, same version previously tried by @henk717. Please do report any issues or regressions encountered with this version.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.35
koboldcpp-1.35
Note: This build adds significant changes for CUDA and may be less stable than normal - please report any performance regressions or bugs you encounter. It may be slower than usual. If that is the case, please use the previous version for now.
- Enabled the CUDA 8Bit MMV mode (see ggerganov#2067) , now that it seems stable enough and works correctly, this approach uses quantized dot products instead of the traditional DMMV approach for the formats
q4_0
,q4_1
,q5_0
andq5_1
. If you're able to do a full GPU offload, then CUDA for such models will likely be significantly faster than before. K-quants and CL are not affected. - Exposed performance information through the API (prompt processing and generation timing), access it with
/api/extra/perf
- Added support for linear RoPE as an alternative to NTK-Aware RoPE (similar to in 1.33, but using 2048 as a base). This is triggered by the launcher parameter
--linearrope
. The RoPE scale is determined by the--contextsize
parameter, thus for best results on SuperHOT models, you should launch with--linearrope --contextsize 8192
which provides a0.25
linear scale as the SuperHOT finetune suggests. If--linearrope
is not specified, then NTK-aware RoPE is used by default. - Added a Save and Load settings option to the GUI launcher.
- Added the ability to select "All Devices" in the GUI for CUDA. You are still recommended to select a specific device - split GPU is usually slower.
- Displays a warning if poor sampler orders are used, as the default configuration will give much better results.
- Updated Kobold Lite, pulled other upstream fixes and optimizations.
1.35.H Henk-Cuda Hotfix: This is an alternative version from Henk that you can try if you encounter speed reductions. Please let me know if it's better for you.
Henk may have newer versions at https://github.com/henk717/koboldcpp/releases/tag/1.35 please check that out for now. I will be able to upstream any fixes only in a few days.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.34.2
koboldcpp-1.34.2
This is a BIG update. Changes:
- Added a brand new
customtkinter
GUI which contains many more configurable settings. To use this new UI, the python modulecustomtkinter
is required for Linux and OSX (already included with windows .exe builds). The old GUI is still available otherwise. (Thanks: @Vali-98) - Switched to NTK aware scaling for RoPE, set based on
--contextsize
parameter, with support up to 8K context. This seems to perform much better than the previous dynamic linear method, even on untuned models. It still won't work perfectly for SuperHOT 8K, as that model requires a fixed 0.25 linear rope scale, but I think this approach is better in general. Note that the alpha value chosen is applied when you select the--contextsize
so for best results, only set a big--contextsize
if you need it since there will be minor perplexity loss otherwise. - Enabled support for NTK-Aware scaled RoPE to GPT-NeoX and GPT-J too! And surprisingly long context does work decently with older models, so you can enjoy something like Pyg6B or Pythia with 4K context if you like.
- Added
/generate
API support for sampler_order and mirostat/tau/eta parameters, which you can now set per-generation. (Thanks: @ycros) - Added
--bantokens
which allows you to specify a list of token substrings that the AI cannot use. For example--bantokens [ a ooo
prevents the AI from using any left square brackets, the lettera
, or any token containingooo
. This bans all instances of matching tokens! - Added more granular context size options, now you can select 3k and 6k context sizes as well.
- Added the ability to select Main GPU to use when using CUDA. For example,
--usecublas lowvram 2
will use the third Nvidia GPU if it exists. - Pulled updates from RWKV.cpp, minor speedup for prompt processing.
- Fixed build issues on certain older and OSX platforms, GCC 7 should now be supported. please report any that you find.
- Pulled fixes and updates from upstream, Updated Kobold Lite. Kobold Lite now allows you to view submitted contexts after each generation. Also includes two new scenarios and limited support for Tavern v2 cards.
- Adjusted scratch buffer sizes for big contexts, so unexpected segfaults/OOM errors should be less common (please report any you find). CUDA scratch buffers should also work better now (upstream fix).
1.34.1a Hotfix (CUDA): Cuda was completely broken, did a quick revert to get it working. Will upload a proper build later.
1.32.2 Hotfix: CUDA kernels now updated to latest version, used python to handle the GPU selection instead.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.33 Ultimate Edition
koboldcpp-1.33 Ultimate Edition
A.K.A The "We CUDA had it all edition"
- The KoboldCpp Ultimate edition is an All-In-One release with previously missing CUDA features added in, with options to support both CL and CUDA properly in a single distributable. You can now select CUDA mode with
--usecublas
, and optionally low VRAM using--usecublas lowvram
. This release also contains support for OpenBLAS, CLBlast (via--useclblast
), and CPU-only (No BLAS) inference. - Back ported CUDA support for all prior versions of GGML file formats for CUDA. CUDA mode now correctly supports every single earlier version of GGML files, (earlier quants from GGML, GGMF, GGJT v1, v2 and v3, with respective feature sets at the time they were released, should load and work correctly.)
- Ported over the memory optimizations I added for OpenCL to CUDA, now CUDA will use less VRAM, and you may be able to use even more layers than upstream in llama.cpp (testing needed).
- Ported over CUDA GPU acceleration via layer offloading for MPT, GPT-2, GPT-J and GPT-NeoX in CUDA.
- Updated Lite, pulled updates from upstream, various minor bugfixes. Also, instruct mode now allows any number of newlines in the start and end tag, configurable by user.
- Added long context support using Scaled RoPE for LLAMA, which you can use by setting
--contextsize
greater than 2048. It is based off the PR here ggerganov#2019 and should work reasonably well up to over 3k context, possibly higher.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
koboldcpp-1.32.3
koboldcpp-1.32.3
- Ported the optimized K-Quant CUDA kernels to OpenCL ! This speeds up K-Quants generation speed by about 15% with CL (Special thanks: @0cc4m)
- Implemented basic GPU offloading for MPT, GPT-2, GPT-J and GPT-NeoX via OpenCL! It still keeps a copy of the weights in RAM, but generation speed for these models should now be much faster! (50% speedup for GPT-J, and even WizardCoder is now 30% faster for me.)
- Implemented scratch buffers for the latest versions of all non-llama architectures except RWKV (MPT, GPT-2, NeoX, GPT-J), BLAS memory usage should be much lower on average, and larger BLAS batch sizes will be usable on these models.
- Merged GPT-Tokenizer improvements for non-llama models. Support Starcoder special added tokens. Coherence for non-llama models should be improved.
- Updated Lite, pulled updates from upstream, various minor bugfixes.
1.32.1 Hotfix:
- A number of bugs were fixed. The include memory allocation errors with OpenBLAS, and errors recognizing the new MPT-30B model correctly.
1.32.2 Hotfix.
Solves an issue with the MPT-30B vocab having missing words due to an problems with wide-string tokenization.- Solve an issue with LLAMA WizardLM-30B running out of memory near 2048 context at larger k-quants.
1.32.3 Hotfix.
- Reverted wstring changes, they negatively affected model coherency.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
koboldcpp-1.31.2
koboldcpp-1.31.2
This is mostly a bugfix build, with some new features to Lite.
- Better EOS token handling for Starcoder models.
- Major Kobold Lite update, including new scenarios, a variety of bug fixes, italics chat text, customized idle message counts, and improved sentence trimming behavior.
- Disabled RWKV sequence mode. Unfortunately, the speedups were too situational, and some users experienced speed regressions. Additionally, it was not compatible without modifying the ggml library to increase the max node counts, which had adverse impacts on other model architectures. Sequence mode will be disabled until it has been sufficiently improved upstream.
- Display token generation rate in console
Update 1.31.1:
- Cleaned up debug output, now only shows the server endpoint debugs if
--debugmode
is set. Also, no longer shows incoming horde prompts if--hordeconfig
is set unless--debugmode
is also enabled. - Fixed markdown in lite
Update 1.31.2:
- Allowed
--hordeconfig
to specify max context length allowed in horde too, which is separate from the real context length used to allocate memory.
koboldcpp-1.30.3
koboldcpp-1.30.3
A.K.A The "Back from the dead" edition.
KoboldCpp Changes:
- Added full OpenCL / CLBlast support for K-Quants, both prompt processing and GPU offloading for all K-quant formats (credits: @0cc4m)
- Added RWKV Sequence Mode enhancements for over 3X FASTER prompt processing in RWKV (credits: @LoganDark)
- Added support for the RWKV World Tokenizer and associated RWKV-World models. It will be automatically detected and selected as necessary.
- Added a true SSE-streaming endpoint (Agnaistic compatible) that can stream tokens in realtime while generating. Integrators can find it at
/api/extra/generate/stream
. (Credits @SammCheese) - Added an enhanced polled-streaming endpoint to fetch in-progress results without disrupting generation, which is now the default for Kobold Lite when using streaming in KoboldCpp. Integrators can find it at
/api/extra/generate/check
. The old 8-token-chunked-streaming can still be enabled by setting the parameterstreamamount=8
in the URL. Also, the original KoboldAI United compatible/api/v1/generate
endpoint is still available. - Added a new abort endpoint at
/api/extra/abort
which aborts any in-progress generation without stopping the server. It has been integrated into Lite, by pressing the "abort" button below the Submit button. - Added support for lora base, which is now added as an optional second parameter e.g.
--lora [lora_file] [base_model]
- Updated to latest Kobold Lite (required for new endpoints).
- Pulled other various enhancements from upstream, plus a few RWKV bugfixes .
1.30.2 Hotfix - Added a fix for RWKV crashing in seq mode, pulled upstream bugfixes, rebuild CUDA version. For those wondering why CUDA exe version is not always included, apart from size, dependencies and only supporting nvidia, that's partially also because it's a pain to build for me, since it can only be done in a dev environment with CUDA toolkit and visual studio on windows.
1.30.3 Hotfix - Disabled RWKV seq mode for now, due to multiple complaints about speed and memory issues with bigger quantized models. I will keep a copy of 1.30.2 here in case anyone still wants it.
CUDA Bonus
Bonus: An alternative CUDA build has also been provided for this version, capable of running all latest formats including K-Quants. To use, download and run the koboldcpp_CUDA_only.exe, which is a one-file pyinstaller.
Extra Bonus: CUDA now also supports the older ggjtv2 models as well, as support has been back ported in! Note that CUDA builds will still not be generated by default, and support for them will be limited.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.