Skip to content

Releases: LostRuins/koboldcpp

koboldcpp-1.6

13 Apr 06:44
Compare
Choose a tag to compare

koboldcpp-1.6

  • This is a bugfix release, to try and see if it resolves the recent crashing issues reported.
  • Recent CLBlast fixes merged, now shows GPU name.
  • Batch size reduced back from 1024 to 512 due to reported crashes.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2 flags

koboldcpp-1.5

12 Apr 16:03
Compare
Choose a tag to compare

koboldcpp-1.5

  • This release consolidates a lot of upstream bug fixes and improvements, if you had issues with earlier versions please try this one. The upstreamed GPTJ changes should also make GPT-J-6B inference even faster by another 20% or so.
  • Integrated AVX2 and Non-AVX2 support into the same binary for windows. If your CPU is very old and doesn't support AVX2 instructions, you can switch to compatibility mode with --noavx2, but it will be slower.
  • Now has integrated experimental CLBlast support thanks to @0cc4m, which uses your GPU to speed up prompt processing. Enable it with --useclblast [platform_id] [device_id]
  • To quantize various fp16 model, you can use the quantizers in the tools.zip. Remember to convert them from Pytorch/Huggingface format first with the relevant Python conversion scripts.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2 flags
If you prefer, you can download the zip file, extract and run the python script e.g. koboldcpp.py [ggml_model.bin] manually

koboldcpp-1.4

10 Apr 16:36
Compare
Choose a tag to compare

koboldcpp-1.4

  • This is an expedited bugfix release because the new model formats were breaking on large contexts.
  • Also people have requested mmap to be the default, so now it is, you can disable it with --nommap

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

Alternative Options:
None are provided for this release as it is a temporary one.

koboldcpp-1.3

10 Apr 04:14
Compare
Choose a tag to compare

koboldcpp-1.3

-Bug fixes for various issues (missing endpoints, malformed url)
-Merged upstream file loading enhancements. mmap is now disabled by default, enable with --usemmap
-Now can automatically distinguish between older and newer GPTJ and GPT2 quantized files.
-Version numbers are now displayed at start

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

Alternative Options:
If your CPU is very old and doesn't support AVX2 instructions, you can try running the noavx2 version. It will be slower.
If you prefer, you can download the zip file, extract and run the python script e.g. koboldcpp.py [ggml_model.bin] manually
To quantize an fp16 model, you can use the quantize.exe in the tools.zip

koboldcpp-1.2

08 Apr 17:33
Compare
Choose a tag to compare

koboldcpp-1.2

This is a checkpoint version which should be relatively stable and includes more release variants.

  • Support for new versions of GPT2 models, for example the Cerebras models on HF.
  • Prevented the TK GUI window from staying open and being annoying.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

Alternative Options:
If your CPU is very old and doesn't support AVX2 instructions, you can try running the noavx2 version. It will be slower.
If you prefer, you can download the zip file, extract and run the python script e.g. koboldcpp.py [ggml_model.bin] manually
To quantize an fp16 model, you can use the quantize.exe in the tools.zip

koboldcpp-1.1

07 Apr 14:17
Compare
Choose a tag to compare

koboldcpp-1.1

  • Simplifying the version numbering as I don't think I really need that granularity
  • Various small tweaks and improvements, and bugfixes
  • Updated embedded kobold lite

To use, download and run the koboldcpp.exe
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

If your CPU is very old and doesn't support AVX2 instructions, you can try running the noavx2 version. It will be slower.

koboldcpp-1.0.10

06 Apr 08:55
Compare
Choose a tag to compare

koboldcpp-1.0.10

  • Updated the embedded kobold lite to version 19
  • Merged the various improvements from the parent repo
  • Removed psutil dependencies, reverting calculations to be based on 0.5 x cpu_count
  • Changed makefile to hopefully work on ARM

To use, download and run the koboldcpp.exe
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

koboldcpp-1.0.9beta

05 Apr 08:16
Compare
Choose a tag to compare

koboldcpp-1.0.9beta

  • Integrated support for GPT2! This also should theoretically work with Cerebras models, but I have not tried those yet. This is a great way to get started as now you can try models so tiny even a potato CPU can run them. Here's a good one to start with: https://huggingface.co/ggerganov/ggml/resolve/main/ggml-model-gpt-2-117M.bin with which I can generate 100 tokens in a second.
  • Upgraded embedded Kobold Lite to support a Stanford Alpaca compatible Instruct Mode, which can be enabled in settings.
  • Removed all -march=native and -mtune=native flags when building the binary. Compatibility should be more consistent with different devices now.
  • Fixed an incorrect flag name used to trigger the ACCELERATE library for mac OSX. This should give you greatly increased performance of OSX users for GPT-J and GPT2 models, assuming you have ACCELERATE support.
  • Added Rep Pen for GPT-J and GPT-2 models, and by extension pyg.cpp, this means that repetition penalty now works similar to the way it does in llama.cpp.

To use, download and run the koboldcpp.exe
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

koboldcpp-1.0.8beta

03 Apr 03:58
Compare
Choose a tag to compare

koboldcpp-1.0.8beta

  • Rebranded to koboldcpp (formerly llamacpp-for-kobold). Library file names and references are changed too, Please let me know if anything is broken!
  • Added support for the original GPT4ALL.CPP format!
  • Added support for GPT-J formats, including the original 16bit legacy format as well as the 4bit version from Pygmalion.cpp
  • Switched compiler flag from -O3 to -Ofast. This should increase generation speed even more, but I dunno if anything will break, please let me know if so.
  • Changed default threads to scale according to physical Core counts instead of os.cpu_count(). This will generally result in fewer threads being utilized, but it should provide a better default for slower systems. You can override this manually with --threads parameter.

To use, download and run the koboldcpp.exe
Alternatively, drag and drop a compatible quantized model for llamacpp on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

llamacpp-for-kobold-1.0.7

01 Apr 01:11
Compare
Choose a tag to compare

llamacpp-for-kobold-1.0.7

  • Added support for new version of the ggml llamacpp model format (magic=ggjt, version 3). All old versions will continue to be supported.
  • Integrated speed improvements from parent repo.
  • Fixed an encoding issue with utf-8 in the outputs.
  • Improved console debug information during generation, now shows token progress and time taken directly.
  • Set non-streaming to be the default mode. You can enable streaming with --stream

To use, download and run the llamacpp-for-kobold.exe
Alternatively, drag and drop a compatible quantized model for llamacpp on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001