Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version of aarch64 container with Vulkan #270

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sroecker
Copy link

@sroecker sroecker commented Oct 9, 2024

Initial version for aarch64 container with Vulkan support that runs on libkrun containers on MacOs

@ericcurtin
Copy link
Collaborator

I think we should merge this... But we do have a vulkan image based on kompute also... It's all about naming... Please working with @slp and @rhatdan to agree on names...

@ericcurtin
Copy link
Collaborator

@sroecker please sign your commit, this is failing the DCO build:

git commit --amend -s

to sign an old commit.

@sroecker sroecker force-pushed the add_aarch64_vulkan_container branch from 40672af to 3e41c4e Compare October 9, 2024 16:00
@rhatdan
Copy link
Member

rhatdan commented Oct 9, 2024

I would prefer these all to be based off a base image with all of the python tools required to run ramalama and then rocm, vulcan, ... can all share the lower layer.

@slp
Copy link
Collaborator

slp commented Oct 9, 2024

@sroecker is llama.cpp working properly with you with a container generated from this Containerfile? Which models have you tested?

I'm asking because the Vulkan backend hasn't worked for me since March, which is the reason why I started favoring the Kompute (which also uses Vulkan) backend.

@sroecker
Copy link
Author

sroecker commented Oct 9, 2024

@sroecker is llama.cpp working properly with you with a container generated from this Containerfile? Which models have you tested?

I'm asking because the Vulkan backend hasn't worked for me since March, which is the reason why I started favoring the Kompute (which also uses Vulkan) backend.

I had to test a smaller model due to machine constraints:
https://huggingface.co/MaziyarPanahi/SmolLM-1.7B-Instruct-GGUF/blob/main/SmolLM-1.7B-Instruct.Q4_K_M.gguf

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Virtio-GPU Venus (Apple M1 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =    0.20 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors: Virtio-GPU Venus (Apple M1 Pro) buffer size =  1005.01 MiB
llm_load_tensors:        CPU buffer size =    78.75 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Virtio-GPU Venus (Apple M1 Pro) KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.19 MiB
llama_new_context_with_model: Virtio-GPU Venus (Apple M1 Pro) compute buffer size =   148.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 5

system_info: n_threads = 5 (n_threads_batch = 5) / 5 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 3671259997
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = -1, n_keep = 0

Fibonacci: The Fibonacci sequence is a classic example of a recurrent sequence that can be used to model various phenomena, such as the growth of populations or the stock market.

In summary, recurrences are essential for modeling dynamic systems and capturing the underlying patterns and behaviors of these systems over time. [end of text]


llama_perf_sampler_print:    sampling time =       1.97 ms /    63 runs   (    0.03 ms per token, 32061.07 tokens per second)
llama_perf_context_print:        load time =    1363.18 ms
llama_perf_context_print: prompt eval time =     614.52 ms /     4 tokens (  153.63 ms per token,     6.51 tokens per second)
llama_perf_context_print:        eval time =    1134.46 ms /    58 runs   (   19.56 ms per token,    51.13 tokens per second)
llama_perf_context_print:       total time =    1755.12 ms /    62 tokens

I can check with the kompute backend tomorrow.

@slp
Copy link
Collaborator

slp commented Oct 9, 2024

Tested with Mistral-7B and Wizard-Vicuna-13B and got random answers with both of them. Sadly, the Vulkan backend is still broken for Apple Silicon GPUs upstream.

I think we're going to need to stay for a while with the Kompute backend, as implemented in #235.


RUN git clone https://github.com/ggerganov/whisper.cpp.git && \
cd whisper.cpp && \
git reset --hard ${WHISPER_CPP_SHA} && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to build whisper.cpp with GGML_VULKAN also?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried by just setting the env variable GGML_VULKAN=ON which failed because ggml-vulkan-shaders.hpp were missing.
Seems like additional steps like running ggml_vk_generate_shaders.py before are missing.

RUN git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && \
git reset --hard ${LLAMA_CPP_SHA} && \
#cmake -B build -DCMAKE_INSTALL_PREFIX:PATH=/usr -DGGML_CCACHE=0 && \
Copy link
Collaborator

@ericcurtin ericcurtin Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well remove this commented out line

@ericcurtin
Copy link
Collaborator

ericcurtin commented Oct 10, 2024

I think this image should inherit from the kompute one like @rhatdan said, if anything it makes the Containerfiles easier to maintain, less duplication.

I notice this container image was put in a aarch64 directory. This image GGML_VULKAN and the GGML_KOMPUTE one it inherits from should be x86_64/aarch64 multi-arch images.

@slp do you think GGML_VULKAN backend might work on other non-Apple GPUs? I was kinda thinking we merge this anyway, with Kompute being the primary Vulkan backend, but that one can switch to this one via some command-line option in ramalama if they wish.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Oct 10, 2024

Pretty much all images that can be x86_64/aarch64, should be. I think the ROCm one will be x86_64 only, but that's because some of the required things aren't built for aarch64.

@rhatdan
Copy link
Member

rhatdan commented Oct 10, 2024

Yes all images should be available in as many arches as makes sense.

@rhatdan
Copy link
Member

rhatdan commented Oct 15, 2024

@slp @ericcurtin we need to get this in, so we can run on Mac with Containers.

@rhatdan
Copy link
Member

rhatdan commented Oct 18, 2024

@pufferffish this might interest you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants