Releases: LostRuins/koboldcpp
koboldcpp-1.59.1
koboldcpp-1.59.1
This is mostly a bugfix release to resolve multiple minor issues.
- Added
--nocertify
mode which allows you to disable SSL certificate checking on your embedded Horde worker. This can help bypass some SSL certificate errors. - Fixed pre-gguf models loading with incorrect thread counts. This issue affected the past 2 versions.
- Added build target for Old CPU (NoAVX2) Vulkan support.
- Fixed cloudflare remotetunnel URLs not displaying on runpod.
- Reverted CLBlast back to 1.6.0, pending CNugteren/CLBlast#533 and other correctness fixes.
- Smartcontext toggle is now hidden when contextshift toggle is on.
- Various improvements and bugfixes merged from upstream, which includes google gemma support.
- Bugfixes and updates for Kobold Lite
Fix for 1.59.1: Changed makefile build flags, fix for tooltips, merged IQ3_S support
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.58
koboldcpp-1.58
- Added a toggle for row split mode with cuda multigpu. Split mode changed to layer split by default. If using command line, add
rowsplit
to--usecublas
to enable row split mode. With the GUI launcher, it's a checkbox toggle. - Multiple bugfixes: fixed benchmark command, fixed SSL streaming issues, fixed some SSE formatting with OAI endpoints.
- Make context shifting more forgiving when determining eligibility.
- Upgraded CLBlast to latest version, should result in a modest prompt processing speedup when using CL.
- Various improvements and bugfixes merged from upstream.
- Updated Kobold Lite with many improvements and new features:
- New: Integrated 'AI Vision' for images, this uses AI Horde or a local A1111 endpoint to perform image interrogation, allowing the AI to recognize and interpret uploaded or generated images. This should provide an option for multimodality similar to llava, although not as precise. Click on any image and you can enable it within Lite. This functionality is not provided by KCPP itself.
- New: Importing characters from Pygmalion.Chat is now supported in Lite, select it from scenarios.
- Added option to run Lite in background. It plays a dynamically generated silent audio sound. This should prevent the browser tab from hibernating.
- Fixed printable view, persist streaming text on error, fixed instruct timestamps
- Added "Auto" option for idle responses.
- Allow importing images into story from local disk
- Multiple minor formatting and bug fixes.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.57.1
koboldcpp-1.57.1
- Added a benchmarking feature with
--benchmark
, which automatically runs a benchmark with your provided settings, outputting run parameters, timing and speed information as well as testing for coherence, and exiting on completion. You can provide a filename e.g.--benchmark result.csv
and it will write CSV formatted data appended to that file. - Added temperature Quad-Sampling (set via API with parameter
smoothing_factor
) PR from @AAbushady, (credits @kalomaze). - Improved timing displays. Also, displays the seed used, and also shows llama.cpp styled timings when run in
--debugmode
. The timings will appear faster as they do not include overheads, measuring only specific eval functions. - Improved abort generation behavior (allows second user aborting while in queue)
- Vulkan enhancements from @0cc4m merged: APU memory handling and multigpu. To use multigpu, you can now specify additional IDs, for example
--usevulkan 0 2 3
which will use GPUs with IDs0
,2
, and3
. Allocation is determined by--tensor_split
. Multigpu for Vulkan is currently configurable via commandline only, the GUI launcher does not allow selecting multiple devices for Vulkan. - Various improvements and bugfixes merged from upstream.
- Updated Kobold Lite with many improvements and new features:
- NEW: The Aesthetic UI is now available for Story and Adventure modes as well!
- Added "AI Impersonate" feature for Instruct mode.
- Smoothing factor added, can be configured in dynamic temperature panel.
- Added a toggle to enable printable view (unlock vertical scrolling).
- Added a toggle to inject timestamps, allowing the AI to be aware of time passing.
- Persist API info for A1111 and XTTS, allows specifying custom negative prompts for image gen, allows specifying custom horde keys in KCPP mode.
- Fixes for XTTS to handle devices with over 100 voices, and also adds an option to narrate dialogue only.
- Toggle to request A1111 backend to save generated images to disk.
- Fix for chub.ai card fetching.
Hotfix1.57.1: Fixed some crashes and fixed multigpu for vulkan.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.56
koboldcpp-1.56
- NEW: Added early support for new Vulkan GPU backend by @0cc4m. You can try it out with the command
--usevulkan (gpu id)
or via the GUI launcher. Now included with the Windows and Linux prebuilt binaries. (Note: Mixtral on Vulkan not fully supported) - Updated and merged the new GGML backend rework from upstream. This update includes many extensive fixes, improvements and changes across over a hundred commits. Support for earlier non-gguf models has been preserved via a fossilized earlier version of the library. Please open an issue if you encounter problems. The Wiki and Readme have been updated too.
- Added support for setting
dynatemp_exponent
, previously was defaulted at 1.0. Support added over API and in Lite. - Fixed issues with Linux CUDA on Pascal, added more flags to handle conda and colab builds correctly.
- Added support for Old CPU fallbacks (NoAVX2 and Failsafe modes) in build targets in the Linux prebuilt binary (and koboldcpp.sh)
- Added missing 48k context option, fixed clearing file selection, better abort handling support, fixed aarch64 termux builds, various other fixes.
- Updated Kobold Lite with many improvements and new features:
- NEW: Added XTTS API Server support (Local AI powered text-to-speech).
- Added option to let AI impersonate you for a turn in a chat.
- HD image generation options.
- Added popup-on-complete browser notification options.
- Improved DynaTemp wizard, added options to set exponent
- Bugfixes, padding adjustments, A1111 parameter fixes, image color fixes for invert color mode.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.55.1
koboldcpp-1.55.1
- Added Dynamic Temperature (DynaTemp), which is specified by a Temperature Value and a Temperature Range (Credits: @kalomaze). When used, the actual temperature is allowed to be automatically adjusted dynamically between
DynaTemp ± DynaTempRange
. For example, settingtemperature=0.4
anddynatemp_range=0.1
will result in a minimum temp of 0.3 and max of 0.5. For ease of use, a UI to select min and max temperature for dynatemp directly is also provided in Lite. Both inputs will work and auto update the other. - Try to reuse cloudflared file when running remote tunnel, but also handle if cloudflared fails to download correctly.
- Added a field to show the most recently used seed in the perf endpoint
- Switched cuda pool malloc back to the old implementation
- Updated Lite, added support for DynaTemp
- Merged new improvements and fixes from upstream llama.cpp
- Various minor fixes.
v1.55.1 - Trying to fix some cuda issues on Pascal cards. As I don't have a Pascal card I cannot verify - but try this if 1.55 didn't work.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.54
koboldcpp-1.54
welcome to 2024 edition
- Added
logit_bias
support (for both OpenAI and Kobold APIs. Accepts a dictionary of key-value pairs, which indicate the token IDs (int) and logit bias (float) to apply for that token. Object format is the same as and compatible with the official OpenAI implementation, though token IDs are model specific. (thanks @DebuggingLife46) - Updated Lite, added support for custom background images (thanks @Ar57m), and added customizable settings for stepcount and cfgscale for Horde/A1111 image generation.
- Added mouseover tooltips for all labels in the GUI launcher.
- Cleaned up and simplified the UI of the quick launch tab in the GUI launcher, some advanced options moved to other tabs.
- Bug fixes for garbled output in Termux with q5k Phi
- Fixed paged memory fallback when pinned memory alloc fails while not using mmap.
- Attempt to fix on-exit segfault on some Linux systems.
- Updated KAI United
class.py
, added new parameters. - Makefile fix for Linux CI build using conda (thanks @henk717)
- Merged new improvements and fixes from upstream llama.cpp (includes VMM pool support)
- Included prebuilt binary for no-cuda Linux as well.
- Various minor fixes.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.53
koboldcpp-1.53
- Added support for SSL. You can now import your own SSL cert to use with KoboldCpp and serve it over HTTPS with
--ssl [cert.pem] [key.pem]
or via the GUI. The.pem
files must be unencrypted, you can also generate them with OpenSSL, eg.openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 -config openssl.cnf -nodes
for your own self signed certificate. - Added support for presence penalty (alternative rep pen) over the KAI API and in Lite. If Presence Penalty is set over the OpenAI API, and
rep_pen
is not set, thenrep_pen
will be set to a default of 1.0 instead of 1.1. Both penalties can be used together, although this is probably not a good idea. - Added fixes for Broken Pipe error, thanks @mahou-shoujo.
- Added fixes for aborting ongoing connections while streaming in SillyTavern.
- Merged upstream support for Phi models and speedups for Mixtral
- The default non-blas batch size for GGUF models is now increased from 8 to 32.
- Merged HIPBlas fixes from @YellowRoseCx
- Fixed an issue with building convert tools in 1.52
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.52.2
koboldcpp-1.52.2
something old, something new edition
- NEW: Added a new bare-bones KoboldCpp NoScript WebUI, which does not require Javascript to work. It should be W3C HTML compliant and should run on every browser in the last 20 years, even text-based ones like Lynx (e.g. in the terminal over SSH). It is accessible by default at
/noscript
e.g. http://localhost:5001/noscript . This can be helpful when running KoboldCpp from systems which do not support a modern browser with Javascript. - Partial per-layer KV offloading is now merged for CUDA. Important: this means that the number of layers you can offload to GPU might be reduced, as each layer now takes up more space. To avoid per-layer KV offloading, use the
--usecublas lowvram
option (equivalent to-nkvo
in llama.cpp). Fully offloaded models should behave the same as before. - The
/api/extra/tokencount
endpoint now also returns an array of token ids in the response body from the tokenizer. - Merged support for QWEN and Mixtral from upstream. Note: Mixtral seems to perform large batch prompt processing extremely slowly. This is probably an implementation issue. For now, you might have better luck using
--noblas
or setting--blasbatchsize -1
when using Mixtral - Selecting a .kcpps in the GUI when choosing a model will load the model specified inside that config file instead.
- Added the Mamba Multitool script (from @henk717). This is a shell script that can be used in Linux to setup an environment with all dependencies required for building and running KoboldCpp on Linux.
- Improved KCPP Embedded Horde Worker fault tolerance, should now gracefully backoff for increasing durations whenever encountering errors polling from AI Horde, and will automatically recover from up to 24 hours of Horde downtime.
- Added a new parameter that shows number of Horde Worker errors in the
/api/extra/perf
endpoint, this can be used to monitor your embedded horde worker if it goes down. - Pulled other fixes and improvements from upstream, updated Kobold Lite, added asynchronous file autosaves (thanks @aleksusklim), various other improvements.
Hotfix 1.52.1: Fixed 'not enough memory' loading errors for large (20B+) models. See #563
NEW: Added Linux PyInstaller binaries
Hotfix 1.52.2: Merged fixes for Mixtral prompt processing
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.51.1
koboldcpp-1.51.1
all quiet on the kobold front edition
- Added a new flag
--quiet
which allows you to suppress input and outputs from appearing in the console. - When context shift is enabled, allocate a small amount (about 80 tokens) of reserved space to reduce the
Failed to predict
errors that occur due to running out of KV cache space caused by KV cache fragmentation when shifting. - Auto rope scaling will not be automatically applied if the model already overrides the RoPE freq scale with a value below 1.
- Increased the graph node limit for older models to fix AiDungeon GPT2 not working.
- Display the available endpoint KAI and OAI URLs in the terminal on startup.
- Updated some API examples in the documentation
--multiuser
now accepts an extra optional parameter that indicates how many concurrent requests to allow to queue. If unset, or set to 1, it defaults to the default value of 5.- Pulled fixed and improvements from upstream, updated Kobold Lite, fixed Chub imports, optimized for Firefox, added multiline input in aesthetic mode, various other improvements.
1.51.1 Hotfix:
- Reverted an upstream change that caused a CLBlast segfault that occurred when context size exceeded 2k.
- Stripped out the OAI SSE carriage return after end message that was causing issues in Janitor.
- Moved the 80 extra tokens allocated for handling KV fragmentation to be added on top of the specified max context length instead of subtracted from it at runtime, which could cause padding issues when counting tokens in Tavern. This means that loading
--contextsize 2048
will actually allocate a size of 2128 behind the scenes for example. - Changed the API url printouts to include the tunnel url when using
--remotetunnel
Added a linux test build provided by @henk717
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.50.1
koboldcpp-1.50.1
- Improved automatic GPU layer selection: In the GUI launcher with CuBLAS, it will now automatically select all layers to do a full GPU offload if it thinks you have enough VRAM to support it.
- Added a short delay to the Abort function in Lite, hopefully fixes the glitches with retry and abort.
- Fixed automatic RoPE values for Yi and Deepseek. If no
--ropeconfig
is set, the preconfigured rope values in the model now take priority over the automatic context rope scale. - The above fix should also allow YaRN RoPE scaled models to work correctly by default, assuming the model has been correctly converted. Note: Customized YaRN configurations flags are not yet available.
- The OpenAI compatible
/v1/completions
has been enhanced, adding extra unofficial parameters that Aphrodite uses, such as Min-P, Top-A and Mirostat. However, OpenAI does not support separatememory
fields or sampler order, so the Kobold API will still give better results there. - SSE streaming support has been added for OpenAI
/v1/completions
endpoint (tested working in SillyTavern) - Custom DALL-E endpoints are now supported, for use with OAI proxies.
- Pulled fixed and improvements from upstream, updated Kobold Lite
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Hotfix 1.50.1:
- Fixed a regression with older RWKV/GPT-2/GPT-J/GPT-NeoX models that caused a segfault.
- If ropeconfig is not set, apply auto linear rope scaling multiplier for rope-tuned models such as Yi when used outside their original context limit.
- Fixed another bug in Lite with the retry/abort button.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.