Initial version of aarch64 container with Vulkan #270

sroecker · 2024-10-09T15:51:09Z

Initial version for aarch64 container with Vulkan support that runs on libkrun containers on MacOs

ericcurtin · 2024-10-09T15:57:30Z

I think we should merge this... But we do have a vulkan image based on kompute also... It's all about naming... Please working with @slp and @rhatdan to agree on names...

ericcurtin · 2024-10-09T15:58:04Z

@sroecker please sign your commit, this is failing the DCO build:

git commit --amend -s

to sign an old commit.

Signed-off-by: Steffen Roecker <[email protected]>

rhatdan · 2024-10-09T16:01:40Z

I would prefer these all to be based off a base image with all of the python tools required to run ramalama and then rocm, vulcan, ... can all share the lower layer.

slp · 2024-10-09T16:15:08Z

@sroecker is llama.cpp working properly with you with a container generated from this Containerfile? Which models have you tested?

I'm asking because the Vulkan backend hasn't worked for me since March, which is the reason why I started favoring the Kompute (which also uses Vulkan) backend.

sroecker · 2024-10-09T16:20:17Z

@sroecker is llama.cpp working properly with you with a container generated from this Containerfile? Which models have you tested?

I'm asking because the Vulkan backend hasn't worked for me since March, which is the reason why I started favoring the Kompute (which also uses Vulkan) backend.

I had to test a smaller model due to machine constraints:
https://huggingface.co/MaziyarPanahi/SmolLM-1.7B-Instruct-GGUF/blob/main/SmolLM-1.7B-Instruct.Q4_K_M.gguf

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Virtio-GPU Venus (Apple M1 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =    0.20 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors: Virtio-GPU Venus (Apple M1 Pro) buffer size =  1005.01 MiB
llm_load_tensors:        CPU buffer size =    78.75 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Virtio-GPU Venus (Apple M1 Pro) KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.19 MiB
llama_new_context_with_model: Virtio-GPU Venus (Apple M1 Pro) compute buffer size =   148.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 5

system_info: n_threads = 5 (n_threads_batch = 5) / 5 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 3671259997
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = -1, n_keep = 0

Fibonacci: The Fibonacci sequence is a classic example of a recurrent sequence that can be used to model various phenomena, such as the growth of populations or the stock market.

In summary, recurrences are essential for modeling dynamic systems and capturing the underlying patterns and behaviors of these systems over time. [end of text]


llama_perf_sampler_print:    sampling time =       1.97 ms /    63 runs   (    0.03 ms per token, 32061.07 tokens per second)
llama_perf_context_print:        load time =    1363.18 ms
llama_perf_context_print: prompt eval time =     614.52 ms /     4 tokens (  153.63 ms per token,     6.51 tokens per second)
llama_perf_context_print:        eval time =    1134.46 ms /    58 runs   (   19.56 ms per token,    51.13 tokens per second)
llama_perf_context_print:       total time =    1755.12 ms /    62 tokens

I can check with the kompute backend tomorrow.

slp · 2024-10-09T21:22:13Z

Tested with Mistral-7B and Wizard-Vicuna-13B and got random answers with both of them. Sadly, the Vulkan backend is still broken for Apple Silicon GPUs upstream.

I think we're going to need to stay for a while with the Kompute backend, as implemented in #235.

ericcurtin · 2024-10-10T10:27:21Z

container-images/aarch64/Containerfile

+
+RUN git clone https://github.com/ggerganov/whisper.cpp.git && \
+    cd whisper.cpp && \
+    git reset --hard ${WHISPER_CPP_SHA} && \


Is it possible to build whisper.cpp with GGML_VULKAN also?

I just tried by just setting the env variable GGML_VULKAN=ON which failed because ggml-vulkan-shaders.hpp were missing.
Seems like additional steps like running ggml_vk_generate_shaders.py before are missing.

ericcurtin · 2024-10-10T10:27:31Z

container-images/aarch64/Containerfile

+RUN git clone https://github.com/ggerganov/llama.cpp && \
+    cd llama.cpp && \
+    git reset --hard ${LLAMA_CPP_SHA} && \
+    #cmake -B build -DCMAKE_INSTALL_PREFIX:PATH=/usr -DGGML_CCACHE=0 && \


Might as well remove this commented out line

ericcurtin · 2024-10-10T10:31:11Z

I think this image should inherit from the kompute one like @rhatdan said, if anything it makes the Containerfiles easier to maintain, less duplication.

I notice this container image was put in a aarch64 directory. This image GGML_VULKAN and the GGML_KOMPUTE one it inherits from should be x86_64/aarch64 multi-arch images.

@slp do you think GGML_VULKAN backend might work on other non-Apple GPUs? I was kinda thinking we merge this anyway, with Kompute being the primary Vulkan backend, but that one can switch to this one via some command-line option in ramalama if they wish.

ericcurtin · 2024-10-10T10:33:17Z

Pretty much all images that can be x86_64/aarch64, should be. I think the ROCm one will be x86_64 only, but that's because some of the required things aren't built for aarch64.

rhatdan · 2024-10-10T12:31:02Z

Yes all images should be available in as many arches as makes sense.

rhatdan · 2024-10-15T18:43:24Z

@slp @ericcurtin we need to get this in, so we can run on Mac with Containers.

rhatdan · 2024-10-18T19:16:00Z

@pufferffish this might interest you.

Initial version of aarch64 container with Vulkan

3e41c4e

Signed-off-by: Steffen Roecker <[email protected]>

sroecker force-pushed the add_aarch64_vulkan_container branch from 40672af to 3e41c4e Compare October 9, 2024 16:00

ericcurtin reviewed Oct 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial version of aarch64 container with Vulkan #270

Initial version of aarch64 container with Vulkan #270

sroecker commented Oct 9, 2024

ericcurtin commented Oct 9, 2024

ericcurtin commented Oct 9, 2024

rhatdan commented Oct 9, 2024

slp commented Oct 9, 2024

sroecker commented Oct 9, 2024 •

edited

Loading

slp commented Oct 9, 2024

ericcurtin Oct 10, 2024

sroecker Oct 10, 2024

ericcurtin Oct 10, 2024 •

edited

Loading

ericcurtin commented Oct 10, 2024 •

edited

Loading

ericcurtin commented Oct 10, 2024 •

edited

Loading

rhatdan commented Oct 10, 2024

rhatdan commented Oct 15, 2024

rhatdan commented Oct 18, 2024

Initial version of aarch64 container with Vulkan #270

Are you sure you want to change the base?

Initial version of aarch64 container with Vulkan #270

Conversation

sroecker commented Oct 9, 2024

ericcurtin commented Oct 9, 2024

ericcurtin commented Oct 9, 2024

rhatdan commented Oct 9, 2024

slp commented Oct 9, 2024

sroecker commented Oct 9, 2024 • edited Loading

slp commented Oct 9, 2024

ericcurtin Oct 10, 2024

Choose a reason for hiding this comment

sroecker Oct 10, 2024

Choose a reason for hiding this comment

ericcurtin Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

ericcurtin commented Oct 10, 2024 • edited Loading

ericcurtin commented Oct 10, 2024 • edited Loading

rhatdan commented Oct 10, 2024

rhatdan commented Oct 15, 2024

rhatdan commented Oct 18, 2024

sroecker commented Oct 9, 2024 •

edited

Loading

ericcurtin Oct 10, 2024 •

edited

Loading

ericcurtin commented Oct 10, 2024 •

edited

Loading

ericcurtin commented Oct 10, 2024 •

edited

Loading