Releases: LostRuins/koboldcpp
koboldcpp-1.49
koboldcpp-1.49
- New API feature: Split Memory - The generation payload also supports a new field
memory
in addition to the usualprompt
field. If set, forcefully appends this string to the beginning of any submitted prompt text. If resulting context exceeds the limit, forcefully overwrites text from the beginning of the main prompt until it can fit. Useful to guarantee full memory insertion even when you cannot determine exact token count. Automatically used in Lite. - New API feature:
trim_stop
can be added to the generate payload. If true, removes detected stop_sequences from the output and truncates all text after them. Does not work with SSE streaming. - New API feature:
--preloadstory
now allows you to specify a json file (such as a story savefile) when launching the server. This file will be hosted on the server at/api/extra/preloadstory
, which frontends (such as Kobold Lite) can access over the API. - Pulled various improvements and fixes from upstream llama.cpp
- Updated Kobold Lite, added new TTS options and fixed some bugs with the Retry button when Aborting. Added support for World Info inject position, split memory and preloaded stories. Also added support for optional image generation using DALL-E 3 (OAI API).
- Fixed KoboldCpp colab prebuilts crashing on some older Colab CPUs. It should now also work on A100 and V100 GPUs in addition to the free tier T4s. If it fails, try enabling the ForceRebuild checkbox.
LLAMA_PORTABLE=1
makefile flag can now be used when making builds that target colab or Docker. - Various other minor fixes.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.48.1
koboldcpp-1.48.1
Harder Better Faster Stronger Edition
- NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext.
- Note: Context Shifting is enabled by default, and will override smartcontext if both are enabled. Context Shifting still needs more testing. Your outputs may be different with shifting enabled, but both seem equally coherent. To disable Context Shifting, use the flag
--noshift
. If you observe a bug, please report and issue or send a PR fix.
- Note: Context Shifting is enabled by default, and will override smartcontext if both are enabled. Context Shifting still needs more testing. Your outputs may be different with shifting enabled, but both seem equally coherent. To disable Context Shifting, use the flag
- 'Tensor Core' Changes: KoboldCpp now handles MMQ/Tensor Cores differently from upstream. Here's a breakdown:
- old approach (everybody): if mmq is enabled, just use mmq. If cublas is enabled, just use cublas. MMQ dimensions set to "FAVOR BIG"
- new approach (upstream llama.cpp): you cannot toggle mmq anymore. It is always enabled. MMQ dimensions set to "FAVOR SMALL". CuBLAS always kicks in if batch > 32.
- new approach (koboldcpp): you CAN toggle MMQ. It is always enabled, until batch > 32, then CuBLAS only kicks in if MMQ flag is false, otherwise it still uses MMQ for all batches. MMQ dimensions set to "FAVOR BIG".
- Added GPU Info Display and Auto GPU Layer Selection For Newbies - Uses a combination of
clinfo
andnvidia-smi
queries to automatically determine and display the user's GPU name in the GUI, and helps newbies suggest the GPU layers to use when first choosing a model, based on available VRAM and model filesizes. Not optimal, but it should give usable defaults and be even more newbie friendly now. You can thereafter edit the actual GPU layers to use. (Credit: Original concept adapted from @YellowRoseCx ) - Added Min-P sampler - It is now available over the API, and can also be set in Lite from the Advanced settings tab. (Credit: @kalomaze)
- Added
--remotetunnel
flag, which downloads and creates a TryCloudFlare remote tunnel, allowing you to access koboldcpp remotely over the internet even behind a firewall. Note: This downloads a tool calledCloudflared
to the same directory. - Added a new build target for Windows exe users
koboldcpp_clblast_noavx2
, now providing a "CLBlast NoAVX2 (Old CPU)" option that finally supports CLBlast acceleration for windows devices without AVX2 intrinsics. - Include
Content-Length
header in responses. - Fixed some crashes with other uncommon models in cuda mode.
- Retained support for GGUFv1, but you're encouraged to update as upstream has removed support.
- Minor tweaks and optimizations to streaming timings. Fixed segfault that happens when streaming in multiuser mode and aborting connection halfway.
freq_base_train
is now taken into account when setting automatic rope scale, that should handle codellama correctly now.- Updated Kobold Lite, added support for selecting Min-P and Sampler Seeds (for proper deterministic generation).
- Improved KoboldCpp Colab, now with prebuilt CUDA binaries. Time to load after launch is less than a minute, excluding model downloads. Added a few more default model options, you can also use any custom GGUF model URL. (Try it here!)
Hotfix 1.48.1 - Fixed issues with Multi-GPU setups. GUI defaults to CuBLAS if available. Other minor fixes
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.47.2
koboldcpp-1.47.2
- Added OpenAI optional adapter from #466 (thanks @lofcz) . This is an unofficial extension of the v1 OpenAI Chat Completions endpoint that allows customization of the instruct tags over the API. The Kobold API still provides better functionality and flexibility overall.
- Pulled upstream support for ChatML added token merges (they have to be from a correctly converted GGUF model though, overall ChatML is still an inferior prompt template compared to Alpaca/Vicuna/LLAMA2).
- Embedded Horde Worker improvements: Added auto-recovery pause timeout on too many errors, instead of halting the worker outright. The worker will still be halted if the total error count exceeds a high enough threshold.
- Bug fixes for a multiuser race condition in polled streaming and for Top-K values being clamped (thanks @raefu @kalomaze)
- Improved server CORS and content-type handling.
- Added GUI input for tensor_split fields (thanks @AAbushady)
- Fixed support for GGUFv1 Falcon models, which was broken due to the upstream rewrite of the BPE tokenizer.
- Pulled other fixes and optimizations from upstream
- Updated KoboldCpp Colab, now with the new Tiefighter model (try it here)
Hotfix 1.47.1 - Fixed a race condition with SSE streaming. Tavern streaming should be reliable now.
Hotfix 1.47.2 - Fixed an issue with older multilingual GGUFs needing an alternate BPE tokenizer.
Updates for Embedded Kobold Lite:
- SSE streaming for Kobold Lite has been implemented! It requires a relatively recent browser. Toggle it on in settings.
- Added Browser Storage Save Slots! You can now directly save stories within the browser session itself. This is intended to be a temporary storage allowing you to swap between and try multiple stories - the browser storage is wiped when the browser cache/history is cleared!
- Added World Info Search Depth
- Added Group Chat Management Panel (You can temporarily toggle the participants in a group chat)
- Added AUTOMATIC1111 integration! It's finally here, you can now generate images from a local A1111 install, as an alternative to Horde,
- Lots of miscellaneous fixes and improvements. If you encounter any issues, do report them here.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.46.1
koboldcpp-1.46.1
Important: Deprecation Notice for KoboldCpp 1.46
- The following command line arguments are deprecated and have been removed from this version on.
--psutil_set_threads - parameter will be removed as it's now generally unhelpful, the defaults are usually sufficient.
--stream - a Kobold Lite only parameter, which is now a toggle saved inside Lite's settings and thus no longer necessary.
--unbantokens - EOS unbans should only be set via the generate API, in the use_default_badwordsids json field.
--usemirostat - Mirostat values should only be set via the generate API, in the mirostat mirostat_tau and mirostat_eta json fields.
- Removed the original deprecated tkinter GUI, now only the new customtkinter GUI remains.
- Improved embedded horde worker, added even more session stats, job pulls and job submits are now done in parallel so it should run about 20% faster for horde requests.
- Changed the default model name from
concedo/koboldcpp
tokoboldcpp/[model_filename]
. This does prevent old "Kobold AI-Client" users from connecting via the API, so if you're still using that, either switch to a newer client or connect via the Basic/OpenAI API instead of the Kobold API. - Added proper API documentation, which can be found by navigating to
/api
or the web one at https://lite.koboldai.net/koboldcpp_api - Allow .kcpps files to be drag & dropped, as well as working via OpenWith in windows.
- Added a new OpenAI Chat Completions compatible endpoint at
/v1/chat/completions
(credit: @teddybear082) --onready
processes are now started with subprocess.run instead of Popen (#462)- Both
/check
and/abort
can now function together with multiuser mode, provided the correctgenkey
is used by the client (automatically handled in Lite). - Allow 64k
--contextsize
(for GGUF only, still 16k otherwise). - Minor UI fixes and enhancements.
- Updated Lite, pulled fixes and improvements from upstream.
v1.46.1 hotfix: fixed an issue where blasthreads was used for values between 1 and 32 tokens.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.45.2
koboldcpp-1.45.2
- Improved embedded horde worker: more responsive, and added Session Stats (Total Kudos Earned, EarnRate, Timings)
- Added a new parameter to grammar sampler API
grammar_retain_state
which lets you persist the grammar state across multiple requests. - Allow launching by picking a .kcpps file in the file selector GUI combined with
--skiplauncher
. That settings file must already have a model selected. (Similar to--config
, but that one doesn't use GUI at all.) - Added a new flag toggle
--foreground
for windows users. This sends the console terminal to the foreground every time a new prompt is generated, to avoid some idling slowdown issues. - Increased max support context with
--contextsize
to 32k, but only for GGUF models. It's still limited to 16k for older model versions. GGUF now actually has no hard limit to max context since it switched to using allocators, but it's not be compatible with older models. Additionally, models not trained with extended context are unlikely to work when RoPE scaled beyond 32k. - Added a simple OpenAI compatible completions API, which you can access at
/v1/completions
. You're still recommended to use the Kobold API as it has many more settings. - Increased stop_sequence limit to 16.
- Improved SSE streaming by batching pending tokens between events.
- Upgraded Lite polled-streaming to work even in multiuser mode. This works by sending a unique key for each request.
- Improved Makefile to reduce unnecessary builds, added flag for skipping K-quants.
- Enhanced Remote-Link.cmd to also work on Linux, simply run it to create a Cloudflare tunnel to access koboldcpp anywhere.
- Improved the default colab notebook to use mmq.
- Updated Lite and pulled other fixes and improvements from upstream llama.cpp.
Important: Deprecation Notice for KoboldCpp 1.45.1
The following command line arguments are considered deprecated and will be removed soon, in a future version.
--psutil_set_threads - parameter will be removed as it's now generally unhelpful, the defaults are usually sufficient.
--stream - a Kobold Lite only parameter, which is now a toggle saved inside Lite's settings and thus no longer necessary.
--unbantokens - EOS unbans should only be set via the generate API, in the use_default_badwordsids json field.
--usemirostat - Mirostat values should only be set via the generate API, in the mirostat mirostat_tau and mirostat_eta json fields.
Hotfix for 1.45.2 - Fixed a bug with reading thread counts in 1.45 and 1.45.1, also moved the OpenAI endpoint from /api/extra/oai/v1/completions
to just /v1/completions
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.44.2
koboldcpp-1.44.2
A.K.A The "Mom: we have SillyTavern at home edition"
- Added multi-user mode with
--multiuser
which allows up to 5 concurrent incoming/generate
requests from multiple clients to be queued up and processed in sequence, instead of rejecting other requests while busy. Note that the/check
and/abort
endpoints are inactive while multiple requests are in-queue, this is to prevent one user from accidentally reading or cancelling a different user's request. - Added a new launcher argument
--onready
which allows you to pass a terminal command (e.g. start a python script) to be executed after Koboldcpp has finished loading. This runs as a subprocess, and can be useful for starting cloudflare tunnels, displaying URLs etc. - Added Grammar Sampling for all architectures, which can be accessed via the web API (also in Lite). Older models are also supported.
- Added a new API endpoint
/api/extra/true_max_context_length
which allows fetching the true max context limit, separate from the horde-friendly value. - Added support for selecting from a 4th GPU from the UI and command line (was max 3 before).
- Tweaked automatic RoPE scaling
- Pulled other fixes and improvements from upstream.
- Note: Using
--usecublas
with the prebuilt Windows executables here are only intended for Nvidia devices. For AMD users, please check out @YellowRoseCx koboldcpp-rocm fork instead.
Major Update for Kobold Lite:
- Kobold Lite has undergone a massive overhaul, renamed and rearranged elements for a cleaner UI.
- Added Aesthetic UI for chat mode, which is now automatically selected when importing Tavern cards. You can easily switch between the different UIs for chat and instruct modes from the settings panel.
- Added Mirostat UI configs to settings panel.
- Allowed Idle Responses in all modes, it is now a global setting. Also fixed an idle response detection bug.
- Smarter group chats, mentioning a specific name when inside a group chat will cause that user to respond, instead of being random.
- Added support for automagically increasing the max context size slider limit, if a larger context is detected.
- Added scenario for importing characters from Chub.Ai
- Added a settings checkbox to enable streaming whenever applicable without requiring messing with URLs. Streaming can be easily toggled from the settings UI now, similar to EOS unbanning, although the
--stream
flag is still kept for compatibility. - Added a few Instruct Tag Presets in a dropdown.
- Supports instruct placeholders, allowing easy switching between instruct formats without rewriting the text. Added a toggle option to use "Raw Instruct Tags" (the old method) as an alternative to placeholder tags like
{{[INPUT]}}
and{{[OUTPUT]}}
- Added a toggle for "Newline After Memory" which can be set in the memory panel.
- Added a toggle for "Show Rename Save File" which shows a popup the user can use to rename the json save file before saving.
- You can specify a BNDF grammar string in settings to use when generating, this controls grammar sampling.
- Various minor bugfixes, also fixed stop_sequences still appearing in the AI outputs, they should be correctly truncated now.
v1.44.1 update - added queue number to perf endpoint, and updated lite to fix a few formatting bugs.
v1.44.2 update - fixed a speed regression from sched_yield again.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.43
koboldcpp-1.43
- Re-added support for automatic rope scale calculations based on a model's training context (n_ctx_train), this triggers if you do not explicitly specify a
--ropeconfig
. For example, this means llama2 models will (by default) use a smaller rope scale compared to llama1 models, for the same specified--contextsize
. Setting--ropeconfig
will override this. This was bugged and removed in the previous release, but it should be working fine now. - HIP and CUDA visible devices set to that GPU only, if GPU number is provided and tensor split is not specified.
- Fixed RWKV models being broken after recent upgrades.
- Tweaked
--unbantokens
to decrease the banned token logit values further, as very rarely they could still appear. Still not using-inf
as that causes issues with typical sampling. - Integrate SSE streaming improvements from @kalomaze
- Added mutex for thread-safe polled-streaming from @Elbios
- Added support for older GGML (ggjt_v3) for 34B llama2 models by @vxiiduu, note that this may still have issues if n_gqa is not 1, in which case using GGUF would be better.
- Fixed support for Windows 7, which should work in noavx2 and failsafe modes again. Also, SSE3 flags are now enabled for failsafe mode.
- Updated Kobold Lite, now uses placeholders for instruct tags that get swapped during generation.
- Tab navigation order improved in GUI launcher, though some elements like checkboxes still require mouse to toggle.
- Pulled other fixes and improvements from upstream.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
Of Note:
- Reminder that HIPBLAS requires self compilation, and is not included by default in the prebuilt executables.
- Remember that token unbans can now be set via API (and Lite) in addition to the command line.
koboldcpp-1.42.1
koboldcpp-1.42.1
- Added support for LLAMA GGUFv2 models, handled automatically. All older models will still continue to work normally.
- Fixed a problem with certain logit values that were causing segfaults when using the Typical sampler. Please let me know if it happens again.
- Merged rocm support from @YellowRoseCx so you should now be able to build AMD compatible GPU builds with HIPBLAS, which should be faster than using CLBlast.
- Merged upstream support for GGUF Falcon models. Note that GPU layer offload for Falcon is unavailable with
--useclblast
but works with CUDA. Older pre-gguf Falcon models are not supported. - Added support for unbanning EOS tokens directly from API, and by extension it can now be triggered from Lite UI settings. Note: Your command line
--unbantokens
flag will force override this.
- Added support for automatic rope scale calculations based on a model's training context (n_ctx_train), this triggers if you do not explicitly specify a(reverted in 1.42.1 for now, it was not setup correctly)--ropeconfig
. For example, this means llama2 models will (by default) use a smaller rope scale compared to llama1 models, for the same specified--contextsize
. Setting--ropeconfig
will override this. - Updated Kobold Lite, now with tavern style portraits in Aesthetic Instruct mode.
- Pulled other fixes and improvements from upstream.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.41 (beta)
koboldcpp-1.41 (beta)
It's been a while since the last release and quite a lot upstream has changed under the hood, so consider this release a beta.
- Added support for LLAMA GGUF models, handled automatically. All older models will still continue to work normally. Note that GGUF format support for other non-llama architectures has not been added yet.
- Added
--config
flag to load a.kcpps
settings file when launching from command line (Credits: @poppeman), these files can also be imported/exported from the GUI. - Added a new endpoint
/api/extra/tokencount
which can be used to tokenize and accurately measure how many tokens any string has. - Fix for bell characters occasionally causing the terminal to beep in debug mode.
- Fix for incorrect list of backends & missing backends displayed in the GUI.
- Set MMQ to be the default for CUDA when running from GUI.
- Updated Lite, and merged all the improvements and fixes from upstream.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.40.1
koboldcpp-1.40.1
This release is mostly for bugfixes to the previous one, but enough small stuff has changed that I chose to make it a new version instead of a patch for the previous one.
- Fixed a regression in format detection for LLAMA 70B.
- Converted the embedded horde worker into daemon mode, hopefully solves the occasional exceptions
- Fixed some OOMs for blasbatchsize 2048, adjusted buffer sizes
- Slight modification to the look ahead (2 to 5%) for the cuda pool malloc.
- Pulled some bugfixes from upstream
- Added a new field
idle
for the/api/extra/perf
endpoint, allows checking if a generation is in progress without sending one. - Fixed cmake compilation for cudatoolkit 12.
- Updated Lite, includes option for aesthetic instruct UI (early beta by @Lyrcaxis, please send them your feedback)
hotfix 1.40.1:
- handle stablecode-completion-alpha-3b
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.