Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: ggml-org/llama.cpp
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: dae06c06e5c6232ae2be4d567dd5101e1e96c814
Choose a base ref
...
head repository: ggml-org/llama.cpp
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 64e64aa2557d97490b2fe1262b313e2f4a1607e3
Choose a head ref

Commits on Nov 20, 2023

  1. speculative : fix prompt tokenization in speculative example (#4025)

    * Support special tokens and not adding BOS to prompt in speculative
    
    * Adapt to new should_add_bos function
    
    * Ensure tgt and dft have same add_bos setting
    AutonomicPerfectionist authored Nov 20, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    40a34fe View commit details
  2. ci : add flake8 to github actions (python linting) (#4129)

    Disabled rules:
    
    * E203 Whitespace before ':' - disabled because we often use 'C' Style where values are aligned
    
    * E211 Whitespace before '(' (E211) - disabled because we often use 'C' Style where values are aligned
    
    * E221 Multiple spaces before operator - disabled because we often use 'C' Style where values are aligned
    
    * E225 Missing whitespace around operator - disabled because it's broken so often it seems like a standard
    
    * E231 Missing whitespace after ',', ';', or ':' - disabled because we often use 'C' Style where values are aligned
    
    * E241 Multiple spaces after ',' - disabled because we often use 'C' Style where values are aligned
    
    * E251 Unexpected spaces around keyword / parameter equals - disabled because it's broken so often it seems like a standard
    
    * E261 At least two spaces before inline comment - disabled because it's broken so often it seems like a standard
    
    * E266 Too many leading '#' for block comment - sometimes used as "section" separator
    
    * E501 Line too long - disabled because it's broken so often it seems like a standard
    
    * E701 Multiple statements on one line (colon) - broken only in convert.py when defining abstract methods (we can use# noqa instead)
    
    * E704 Multiple statements on one line - broken only in convert.py when defining abstract methods (we can use# noqa instead)
    Galunid authored Nov 20, 2023
    Copy the full SHA
    f23c035 View commit details
  3. main : Add ChatML functionality to main example (#4046)

    Co-authored-by: Sebastian Cramond <sebby37@users.noreply.github.com>
    Sebby37 and Sebby37 authored Nov 20, 2023
    Copy the full SHA
    881800d View commit details
  4. readme : update ROCm Windows instructions (#4122)

    * Update README.md
    
    * Update README.md
    
    Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
    
    ---------
    
    Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
    jammm and cebtenzzre authored Nov 20, 2023
    Copy the full SHA
    dfc7cd4 View commit details
  5. Copy the full SHA
    0b871f1 View commit details

Commits on Nov 21, 2023

  1. Copy the full SHA
    8e672ef View commit details

Commits on Nov 23, 2023

  1. Copy the full SHA
    ff8238f View commit details
  2. examples : fix typo in parallel example doc comment (#4181)

    Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
    danbev authored Nov 23, 2023
    Copy the full SHA
    9d5949f View commit details
  3. readme : update hot topics

    ggerganov authored Nov 23, 2023
    Copy the full SHA
    d103d93 View commit details
  4. llama : KV cache view API + better KV cache management (#4170)

    * llama : keep track of used KV cells + better KV cache management
    
    * llama : zero KV cache used upon clear
    
    ggml-ci
    
    * llama : allow exporting a view of the KV cache (#4180)
    
    * Allow exporting a view of the KV cache
    
    * Allow dumping the sequences per cell in common
    
    * Track max contiguous cells value and position as well
    
    * Fix max contiguous empty cells index calculation
    
    Make dump functions deal with lengths or sequences counts > 10 better
    
    * Fix off by one error in dump_kv_cache_view
    
    * Add doc comments for KV cache view functions
    
    Eliminate cell sequence struct; use llama_seq_id directly
    
    Minor cleanups
    
    * common : add -dkvc arg for enabling kv cache dumps
    
    ---------
    
    Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
    ggerganov and KerfuffleV2 authored Nov 23, 2023
    Copy the full SHA
    6b0a742 View commit details
  5. Fix incorrect format strings and uninitialized variables. (#4133)

    * Fix incorrect format strings and uninitialized variables.
    
    * Address comments
    
    * Add the missing include statement
    haohui authored Nov 23, 2023
    Copy the full SHA
    55978ce View commit details

Commits on Nov 24, 2023

  1. readme : use PATH for Windows ROCm (#4195)

    * Update README.md to use PATH for Windows ROCm
    
    * Update README.md
    
    * Update README.md
    jammm authored Nov 24, 2023
    Copy the full SHA
    b35f3d0 View commit details
  2. main.swift : fix eos checking (#4197)

    llama_token_eos(const struct llama_model *) is currently getting struct llama_context type variable context as a parameter.
    eastriverlee authored Nov 24, 2023
    Copy the full SHA
    2568a4b View commit details
  3. Copy the full SHA
    189d684 View commit details
  4. ggml-cuda : support stablelm rope (#4156)

    * ggml-cuda : support stablelm rope
    
    * remove unused freq_base kernel parameter
    
    * add n_dims parameter to llm_build_k_shift, default to n_rot via overload
    
    * llama : fix llm_build_k_shift args
    
    ---------
    
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
    slaren and ggerganov authored Nov 24, 2023
    Copy the full SHA
    8a052c1 View commit details
  5. Copy the full SHA
    e9c13ff View commit details

Commits on Nov 25, 2023

  1. server : OAI API compatibility (#4198)

    * Add openai-compatible POST /v1/chat/completions API endpoint to server example
    
    * fix code style
    
    * Update server README.md
    
    * Improve server README.md
    
    * Fix server.cpp code style according to review
    
    * server : some style changes
    
    * server : indentation
    
    * server : enable special tokens during tokenization by default
    
    * server : minor code style
    
    * server : change random string generator
    
    * straightforward /v1/models endpoint
    
    ---------
    
    Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com>
    Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>
    3 people authored Nov 25, 2023
    Copy the full SHA
    af19d35 View commit details
  2. readme : update hot topics

    ggerganov authored Nov 25, 2023
    Copy the full SHA
    04814e7 View commit details
  3. Copy the full SHA
    3014b54 View commit details
  4. llama : grammar reserve space in decode_utf8 (#4210)

    * reserve space for codepoints
    
    * improvement for the appended 0
    MarcusDunn authored Nov 25, 2023
    Copy the full SHA
    f837c3a View commit details
  5. scripts : Use mmap in torch load (#4202)

    * Use mmap in torch load, prefer .bin files when loading
    
    * Revert .bin > .safetensors preference
    Galunid authored Nov 25, 2023
    Copy the full SHA
    1ddb52e View commit details

Commits on Nov 26, 2023

  1. metal : fix yarn (#4220)

    get the correct n_orig_ctx in metal
    jxy authored Nov 26, 2023
    Copy the full SHA
    22da055 View commit details
  2. lookahead : add example for lookahead decoding (#4207)

    * lookahead : init
    
    * lookahead : generate and store n-grams
    
    * lookahead : use loop instead recursion to generate n-grams
    
    * lookahead : initial working implementation
    
    * lookahead : filter repeating n-grams
    
    * lookahead : use deterministic init
    
    * lookahead : add to Makefile
    
    * lookahead : fix a bug in the seq_id of the lookahead tokens
    
    * lookahead : add comments
    
    ---------
    
    Co-authored-by: slaren <slarengh@gmail.com>
    ggerganov and slaren authored Nov 26, 2023
    Copy the full SHA
    922754a View commit details
  3. readme : update hot topics

    ggerganov authored Nov 26, 2023
    Copy the full SHA
    9656026 View commit details
  4. Copy the full SHA
    3e73d31 View commit details

Commits on Nov 27, 2023

  1. Copy the full SHA
    f3b2698 View commit details
  2. examples : iOS example with swift ui (#4159)

    * copy to llama.cpp as subdir
    
    * attempt enabling metal, fails
    
    * ggml metal compiles!
    
    * Update README.md
    
    * initial conversion to new format, utf8 errors?
    
    * bug fixes, but now has an invalid memory access :(
    
    * added O3, now has insufficient memory access
    
    * begin sync with master
    
    * update to match latest code, new errors
    
    * fixed it!
    
    * fix for loop conditionals, increase result size
    
    * fix current workflow errors
    
    * attempt a llama.swiftui workflow
    
    * Update .github/workflows/build.yml
    
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
    
    ---------
    
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
    bachittle and ggerganov authored Nov 27, 2023
    1
    Copy the full SHA
    bb03290 View commit details
  3. Copy the full SHA
    0dab8cd View commit details
  4. cmake : fix issue with version info not getting baked into LlamaConfi…

    …g.cmake (#3970)
    
    * Split CPP generation from build-info query
    
    * Remove blank lines
    
    * Add BUILD_SHARED_LIBS option
    bandoti authored Nov 27, 2023
    Copy the full SHA
    b38a16d View commit details

Commits on Nov 28, 2023

  1. ggml : re-enable BLAS for CPU when src0 != F32 + remove redundant ful…

    …l offload checks in llama.cpp (#4240)
    
    * ggml : use blas even if src0 is not F32
    
    * llama : use n_threads_batch only when n_tokens >= 32
    
    ggml-ci
    
    * llama : revert n_threads_batch logic
    
    ggml-ci
    ggerganov authored Nov 28, 2023
    Copy the full SHA
    8406b09 View commit details
  2. Copy the full SHA
    64e64aa View commit details
Showing with 2,365 additions and 282 deletions.
  1. +11 −0 .github/workflows/build.yml
  2. +20 −0 .github/workflows/python-lint.yml
  3. +1 −0 .gitignore
  4. +4 −0 CMakeLists.txt
  5. +4 −1 Makefile
  6. +15 −3 README.md
  7. +1 −1 common/CMakeLists.txt
  8. +82 −0 common/common.cpp
  9. +12 −0 common/common.h
  10. +17 −15 convert-hf-to-gguf.py
  11. +33 −20 convert-llama-ggml-to-gguf.py
  12. +3 −1 convert-persimmon-to-gguf.py
  13. +40 −20 convert.py
  14. BIN docs/llama-star/idea-arch.key
  15. BIN docs/llama-star/idea-arch.pdf
  16. +1 −0 examples/CMakeLists.txt
  17. +1 −1 examples/batched.swift/Sources/main.swift
  18. +1 −1 examples/finetune/README.md
  19. +7 −0 examples/infill/infill.cpp
  20. +1 −0 examples/llama.swiftui/.gitignore
  21. +7 −0 examples/llama.swiftui/README.md
  22. +176 −0 examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
  23. +5 −0 examples/llama.swiftui/llama.cpp.swift/bridging-header.h
  24. +481 −0 examples/llama.swiftui/llama.swiftui.xcodeproj/project.pbxproj
  25. +7 −0 examples/llama.swiftui/llama.swiftui.xcodeproj/project.xcworkspace/contents.xcworkspacedata
  26. +8 −0 ...s/llama.swiftui/llama.swiftui.xcodeproj/project.xcworkspace/xcshareddata/IDEWorkspaceChecks.plist
  27. +11 −0 examples/llama.swiftui/llama.swiftui/Assets.xcassets/AccentColor.colorset/Contents.json
  28. +13 −0 examples/llama.swiftui/llama.swiftui/Assets.xcassets/AppIcon.appiconset/Contents.json
  29. +6 −0 examples/llama.swiftui/llama.swiftui/Assets.xcassets/Contents.json
  30. +45 −0 examples/llama.swiftui/llama.swiftui/Models/LlamaState.swift
  31. +6 −0 examples/llama.swiftui/llama.swiftui/Preview Content/Preview Assets.xcassets/Contents.json
  32. 0 examples/llama.swiftui/llama.swiftui/Resources/models/.gitignore
  33. +42 −0 examples/llama.swiftui/llama.swiftui/UI/ContentView.swift
  34. +10 −0 examples/llama.swiftui/llama.swiftui/llama_swiftuiApp.swift
  35. +5 −0 examples/lookahead/CMakeLists.txt
  36. +487 −0 examples/lookahead/lookahead.cpp
  37. +31 −5 examples/main/main.cpp
  38. +10 −1 examples/parallel/parallel.cpp
  39. +49 −0 examples/server/README.md
  40. +366 −11 examples/server/server.cpp
  41. +15 −2 examples/speculative/speculative.cpp
  42. +24 −16 ggml-cuda.cu
  43. +2 −1 ggml-metal.m
  44. +7 −6 ggml.c
  45. +2 −3 ggml.h
  46. +1 −1 gguf-py/gguf/gguf_writer.py
  47. +162 −93 llama.cpp
  48. +55 −4 llama.h
  49. +0 −22 scripts/build-info.cmake
  50. +24 −0 scripts/gen-build-info-cpp.cmake
  51. +28 −28 tests/test-tokenizer-0-falcon.py
  52. +26 −26 tests/test-tokenizer-0-llama.py
11 changes: 11 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -498,6 +498,17 @@ jobs:
path: |
cudart-llama-bin-win-cu${{ matrix.cuda }}-x64.zip
ios-xcode-build:
runs-on: macos-latest

steps:
- name: Checkout code
uses: actions/checkout@v3

- name: Build Xcode project
run: xcodebuild -project examples/llama.swiftui/llama.swiftui.xcodeproj -scheme llama.swiftui -sdk iphoneos CODE_SIGNING_REQUIRED=NO CODE_SIGN_IDENTITY= -destination 'generic/platform=iOS' build


# freeBSD-latest:
# runs-on: macos-12
# steps:
20 changes: 20 additions & 0 deletions .github/workflows/python-lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: flake8 Lint

on: [push, pull_request]

jobs:
flake8-lint:
runs-on: ubuntu-latest
name: Lint
steps:
- name: Check out source repository
uses: actions/checkout@v3
- name: Set up Python environment
uses: actions/setup-python@v4
with:
python-version: "3.11"
- name: flake8 Lint
uses: py-actions/flake8@v2
with:
ignore: "E203,E211,E221,E225,E231,E241,E251,E261,E266,E501,E701,E704"
exclude: "examples/*,examples/*/**,*/**/__init__.py"
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -47,6 +47,7 @@ models-mnt
/libllama.so
/llama-bench
/llava-cli
/lookahead
/main
/metal
/perplexity
4 changes: 4 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -43,6 +43,7 @@ else()
endif()

# general
option(BUILD_SHARED_LIBS "build shared libraries" OFF)
option(LLAMA_STATIC "llama: static link libraries" OFF)
option(LLAMA_NATIVE "llama: enable -march=native flag" ON)
option(LLAMA_LTO "llama: enable link time optimization" OFF)
@@ -100,6 +101,9 @@ option(LLAMA_BUILD_TESTS "llama: build tests" ${LLAMA_STANDALO
option(LLAMA_BUILD_EXAMPLES "llama: build examples" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_SERVER "llama: build server example" ON)

# Required for relocatable CMake package
include(${CMAKE_CURRENT_SOURCE_DIR}/scripts/build-info.cmake)

#
# Compile flags
#
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -2,7 +2,7 @@
BUILD_TARGETS = \
main quantize quantize-stats perplexity embedding vdot q8dot train-text-from-scratch convert-llama2c-to-ggml \
simple batched batched-bench save-load-state server gguf llama-bench libllava.a llava-cli baby-llama beam-search \
speculative infill tokenize benchmark-matmult parallel finetune export-lora tests/test-c.o
speculative infill tokenize benchmark-matmult parallel finetune export-lora lookahead tests/test-c.o

# Binaries only useful for tests
TEST_TARGETS = \
@@ -657,6 +657,9 @@ speculative: examples/speculative/speculative.cpp ggml.o llama.o $(COMMON_DEPS)
parallel: examples/parallel/parallel.cpp ggml.o llama.o $(COMMON_DEPS) $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

lookahead: examples/lookahead/lookahead.cpp ggml.o llama.o $(COMMON_DEPS) $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

ifdef LLAMA_METAL
metal: examples/metal/metal.cpp ggml.o $(OBJS)
$(CXX) $(CXXFLAGS) $^ -o $@ $(LDFLAGS)
18 changes: 15 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -10,7 +10,9 @@ Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++

### Hot topics

- *No hot topics atm. Open to suggestions about what is hot today*
- Using `llama.cpp` with AWS instances: https://github.com/ggerganov/llama.cpp/discussions/4225
- Looking for contributions to improve and maintain the `server` example: https://github.com/ggerganov/llama.cpp/issues/4216
- Collecting Apple Silicon performance stats: https://github.com/ggerganov/llama.cpp/discussions/4167

----

@@ -114,6 +116,7 @@ as the main playground for developing new features for the [ggml](https://github
- [nat/openplayground](https://github.com/nat/openplayground)
- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui)
- [withcatai/catai](https://github.com/withcatai/catai)
- [semperai/amica](https://github.com/semperai/amica)

---

@@ -410,19 +413,28 @@ Building the program with BLAS support may lead to some performance improvements
This provides BLAS acceleration on HIP-supported AMD GPUs.
Make sure to have ROCm installed.
You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html).
Windows support is coming soon...
- Using `make`:
```bash
make LLAMA_HIPBLAS=1
```
- Using `CMake`:
- Using `CMake` for Linux:
```bash
mkdir build
cd build
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake .. -DLLAMA_HIPBLAS=ON
cmake --build .
```
- Using `CMake` for Windows (using x64 Native Tools Command Prompt for VS):
```bash
set PATH=%HIP_PATH%\bin;%PATH%
mkdir build
cd build
cmake -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ..
cmake --build .
```
Make sure that `AMDGPU_TARGETS` is set to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 or 11.0.0 on RDNA3.
2 changes: 1 addition & 1 deletion common/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -26,7 +26,7 @@ add_custom_command(
COMMENT "Generating build details from Git"
COMMAND ${CMAKE_COMMAND} -DMSVC=${MSVC} -DCMAKE_C_COMPILER_VERSION=${CMAKE_C_COMPILER_VERSION}
-DCMAKE_C_COMPILER_ID=${CMAKE_C_COMPILER_ID} -DCMAKE_VS_PLATFORM_NAME=${CMAKE_VS_PLATFORM_NAME}
-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} -P "${CMAKE_CURRENT_SOURCE_DIR}/../scripts/build-info.cmake"
-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} -P "${CMAKE_CURRENT_SOURCE_DIR}/../scripts/gen-build-info-cpp.cmake"
WORKING_DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/.."
DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/build-info.cpp.in" ${GIT_INDEX}
VERBATIM
82 changes: 82 additions & 0 deletions common/common.cpp
Original file line number Diff line number Diff line change
@@ -12,6 +12,7 @@
#include <regex>
#include <sstream>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include <cinttypes>
@@ -491,8 +492,12 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
params.interactive_first = true;
} else if (arg == "-ins" || arg == "--instruct") {
params.instruct = true;
} else if (arg == "-cml" || arg == "--chatml") {
params.chatml = true;
} else if (arg == "--infill") {
params.infill = true;
} else if (arg == "-dkvc" || arg == "--dump-kv-cache") {
params.dump_kv_cache = true;
} else if (arg == "--multiline-input") {
params.multiline_input = true;
} else if (arg == "--simple-io") {
@@ -730,6 +735,7 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
printf(" -i, --interactive run in interactive mode\n");
printf(" --interactive-first run in interactive mode and wait for input right away\n");
printf(" -ins, --instruct run in instruction mode (use with Alpaca models)\n");
printf(" -cml, --chatml run in chatml mode (use with ChatML-compatible models)\n");
printf(" --multiline-input allows you to write or paste multiple lines without ending each in '\\'\n");
printf(" -r PROMPT, --reverse-prompt PROMPT\n");
printf(" halt generation at PROMPT, return control in interactive mode\n");
@@ -832,6 +838,8 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
#endif // GGML_USE_CUBLAS
#endif
printf(" --verbose-prompt print prompt before generation\n");
printf(" -dkvc, --dump-kv-cache\n");
printf(" verbose print of the KV cache\n");
printf(" --simple-io use basic IO for better compatibility in subprocesses and limited consoles\n");
printf(" --lora FNAME apply LoRA adapter (implies --no-mmap)\n");
printf(" --lora-scaled FNAME S apply LoRA adapter with user defined scaling S (implies --no-mmap)\n");
@@ -1383,3 +1391,77 @@ void dump_non_result_info_yaml(FILE * stream, const gpt_params & params, const l
fprintf(stream, "typical_p: %f # default: 1.0\n", sparams.typical_p);
fprintf(stream, "verbose_prompt: %s # default: false\n", params.verbose_prompt ? "true" : "false");
}

//
// KV cache utils
//

void dump_kv_cache_view(const llama_kv_cache_view & view, int row_size) {
static const char slot_chars[] = ".123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+";

printf("=== Dumping KV cache. total cells %d, max sequences per cell %d, populated cells %d, total tokens in cache %d, largest empty slot=%d @ %d",
view.n_cells, view.n_max_seq, view.used_cells, view.token_count, view.max_contiguous, view.max_contiguous_idx);

llama_kv_cache_view_cell * c_curr = view.cells;
llama_seq_id * cs_curr = view.cells_sequences;

for (int i = 0; i < view.n_cells; i++, c_curr++, cs_curr += view.n_max_seq) {
if (i % row_size == 0) {
printf("\n%5d: ", i);
}
int seq_count = 0;
for (int j = 0; j < view.n_max_seq; j++) {
if (cs_curr[j] >= 0) { seq_count++; }
}
putchar(slot_chars[std::min(sizeof(slot_chars) - 2, size_t(seq_count))]);
}

printf("\n=== Done dumping\n");
}

void dump_kv_cache_view_seqs(const llama_kv_cache_view & view, int row_size) {
static const char slot_chars[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

printf("=== Dumping KV cache. total cells %d, max sequences per cell %d, populated cells %d, total tokens in cache %d, largest empty slot=%d @ %d\n",
view.n_cells, view.n_max_seq, view.used_cells, view.token_count, view.max_contiguous, view.max_contiguous_idx);

std::unordered_map<llama_seq_id, size_t> seqs;
llama_kv_cache_view_cell * c_curr = view.cells;
llama_seq_id * cs_curr = view.cells_sequences;

for (int i = 0; i < view.n_cells; i++, c_curr++, cs_curr += view.n_max_seq) {
for (int j = 0; j < view.n_max_seq; j++) {
if (cs_curr[j] < 0) { continue; }
if (seqs.find(cs_curr[j]) == seqs.end()) {
if (seqs.size() + 1 >= sizeof(slot_chars)) { break; }
seqs[cs_curr[j]] = seqs.size();
}
}
if (seqs.size() + 1 >= sizeof(slot_chars)) { break; }
}

printf("=== Sequence legend: ");
for (const auto & it : seqs) {
printf("%zu=%d, ", it.second, it.first);
}
printf("'+'=other sequence ids");

c_curr = view.cells;
cs_curr = view.cells_sequences;
for (int i = 0; i < view.n_cells; i++, c_curr++, cs_curr += view.n_max_seq) {
if (i % row_size == 0) {
printf("\n%5d: ", i);
}
for (int j = 0; j < view.n_max_seq; j++) {
if (cs_curr[j] >= 0) {
const auto & it = seqs.find(cs_curr[j]);
putchar(it != seqs.end() ? int(slot_chars[it->second]) : '+');
} else {
putchar('.');
}
}
putchar(' ');
}

printf("\n=== Done dumping\n");
}
12 changes: 12 additions & 0 deletions common/common.h
Original file line number Diff line number Diff line change
@@ -102,6 +102,7 @@ struct gpt_params {
bool random_prompt = false; // do not randomize prompt if none provided
bool use_color = false; // use color to distinguish generations and inputs
bool interactive = false; // interactive mode
bool chatml = false; // chatml mode (used for models trained on chatml syntax)
bool prompt_cache_all = false; // save user input and generations to prompt cache
bool prompt_cache_ro = false; // open the prompt cache read-only and do not update it

@@ -121,6 +122,7 @@ struct gpt_params {
bool numa = false; // attempt optimizations that help on some NUMA systems
bool verbose_prompt = false; // print prompt tokens before generation
bool infill = false; // use infill mode
bool dump_kv_cache = false; // dump the KV cache contents for debugging purposes

// multimodal models (see examples/llava)
std::string mmproj = ""; // path to multimodal projector
@@ -217,3 +219,13 @@ std::string get_sortable_timestamp();
void dump_non_result_info_yaml(
FILE * stream, const gpt_params & params, const llama_context * lctx,
const std::string & timestamp, const std::vector<int> & prompt_tokens, const char * model_desc);

//
// KV cache utils
//

// Dump the KV cache view with the number of sequences per cell.
void dump_kv_cache_view(const llama_kv_cache_view & view, int row_size = 80);

// Dump the KV cache view showing individual sequences in each cell (long output).
void dump_kv_cache_view_seqs(const llama_kv_cache_view & view, int row_size = 40);
32 changes: 17 additions & 15 deletions convert-hf-to-gguf.py
Original file line number Diff line number Diff line change
@@ -59,7 +59,7 @@ def get_tensors(self) -> Iterator[tuple[str, Tensor]]:
from safetensors import safe_open
ctx = cast(ContextManager[Any], safe_open(self.dir_model / part_name, framework="pt", device="cpu"))
else:
ctx = contextlib.nullcontext(torch.load(self.dir_model / part_name, map_location="cpu"))
ctx = contextlib.nullcontext(torch.load(str(self.dir_model / part_name), map_location="cpu", mmap=True, weights_only=True))

with ctx as model_part:
for name in model_part.keys():
@@ -827,13 +827,14 @@ def set_gguf_parameters(self):
self.gguf_writer.add_embedding_length(hparams["hidden_size"])
self.gguf_writer.add_block_count(block_count)
self.gguf_writer.add_feed_forward_length(hparams["intermediate_size"])
self.gguf_writer.add_rope_dimension_count(int(hparams["rope_pct"]*(hparams["hidden_size"] // hparams["num_attention_heads"])))
self.gguf_writer.add_rope_dimension_count(int(hparams["rope_pct"] * (hparams["hidden_size"] // hparams["num_attention_heads"])))
self.gguf_writer.add_head_count(hparams["num_attention_heads"])
self.gguf_writer.add_parallel_residual(hparams["use_parallel_residual"] if "use_parallel_residual" in hparams else True)
self.gguf_writer.add_layer_norm_eps(1e-5)

###### CONVERSION LOGIC ######


def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Convert a huggingface model to a GGML compatible file")
parser.add_argument(
@@ -879,20 +880,21 @@ def parse_args() -> argparse.Namespace:

hparams = Model.load_hparams(dir_model)

model_class = Model.from_model_architecture(hparams["architectures"][0])
model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian)
with torch.inference_mode():
model_class = Model.from_model_architecture(hparams["architectures"][0])
model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian)

print("Set model parameters")
model_instance.set_gguf_parameters()
print("Set model parameters")
model_instance.set_gguf_parameters()

print("Set model tokenizer")
model_instance.set_vocab()
print("Set model tokenizer")
model_instance.set_vocab()

if args.vocab_only:
print(f"Exporting model vocab to '{fname_out}'")
model_instance.write_vocab()
else:
print(f"Exporting model to '{fname_out}'")
model_instance.write()
if args.vocab_only:
print(f"Exporting model vocab to '{fname_out}'")
model_instance.write_vocab()
else:
print(f"Exporting model to '{fname_out}'")
model_instance.write()

print(f"Model successfully exported to '{fname_out}'")
print(f"Model successfully exported to '{fname_out}'")
Loading