-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows performance in Release profile seems crippled when building dev Cargo profile (RelWithDebInfo is faster) #649
Comments
I've observed slower-than-expected performance using this crate in comparison to the llama.cpp CLI (taking the avx2 binary from their GitHub releases) even when using |
@babichjacob please share. I'm also observing a ~30% slowdown vs stock llama.cpp CLI. Is that ballpark what you're seeing as well? Would love to track that down together. |
Seems even slower for me. I did a crude comparison with `hyperfine` and saw this crate run at approximately 1/5th the speed for the same task:
(I referenced the Then, I took some code from the `simple` example and put it in `usage` to get more fair measurements, seeing about 1/2.6th the generation speed:
So 3.83 / 10.02 = running at about 38% of or 1/2.6th the speed of llama.cpp. I'll report back with increasingly sophisticated measurements as I learn how to get them (I'm very new to it), like a flamegraph maybe? Not sure if that'll point to any places time is lost that originate from this codebase rather than from llama.cpp itself. |
We noticed the same thing in NobodyWho last week. Not sure if it's a new thing, or if we're only just noticing it. We mostly work on linux, where we have like 98% the performance of llama-bench, but a windows user of our project noticed that it is significantly slower than llama.cpp FWIW, we're seeing that it runs at like 23% of the t/s that pure llama.cpp does (only on windows). I just finished setting up a windows VM to investigate 😅 Will post results if I find anything noteworthy. |
I'm not convinced that the windows performance reduction was fixed by #651 (or I don't understand it) I'm slowly getting better at profiling and performance analysis, and haven't found anything super scary in the flamegraphs yet. But a weird thing I'm noticing is that the numbers I get from binaries built with Gut feel is that something is off in |
@AsbjornOlling even in Rust debug builds, build.rs tried to build llama.cpp itself optimized. However, on Windows only, the underlying cmake crate has a bug where it explicitly turns off optimization flags for MSVC release. This ONLY impacts debug builds (ie cargo build) and not release (cargo build --release). If cargo build is faster than cargo build --release then my patch was somehow incomplete or you're on an old release without that fix. After my patch I saw both debug and release had identical performance but that number was still 30% off from llama.cpp CLI. Please let me know if you have any insights. |
(sorry about the comment right before this that I just deleted; I forgot to I can corroborate what Asbjørn has found. Running commit f4b9657 on Windows 11 using MSVC with a Ryzen 5 5600 (so AVX2 is available):
Where the ✅s are around 11 t/s and the ❌s are around 4 t/s for me. The command I was getting measurements from was I modified
|
@vlovich I see the bug workaround was restricted to cargo's debug profile: llama-cpp-rs/llama-cpp-sys-2/build.rs Lines 280 to 281 in f4b9657
I'm going to test reducing that conditional to
to cover all variations of (llama.cpp's) release mode and to lift the restriction to (cargo's) debug profile, and post a new table with my findings. Do you know or have a guess how it's somehow not slower to run in release mode with the current commit, and can check if my proposed change regresses that, on your end? |
@babichjacob look at the generated CMakeCache.txt and the CXXFLAGS_RELEASE and CFLAGS_RELEASE variables. The /O2 is only missing when Rust is built debug. In --release the /O2 is there. I'm not sure what you're seeing but I linked the upstream bug report where I highlight the relevant code from cmake.rs which indeed only applies their opt flag filtering erroneously when building Rust debug but configuring cmake to be release. |
Thanks for the hint. Really tricky situation.
Regardless of if your patch is enabled, and regardless of the cargo / Rust profile (where you're saying it should be there when compiled for release),
I suspect this is how it is for Asbjørn too. I'll also try wrangling a GitHub Actions worker and see what its flags look like. With your patch enabled, I can see that
(You can also see here that So, if I understand you correctly, this is what you see in release mode (regardless of if your patch is applied because it's conditional on debug):
Which is also the expected arrangement if the bug were resolved and the workaround wasn't needed (and it was regardless of whether cargo is in the release or debug profile). I looked at the linked Thanks again for pointing me in the right direction! |
Look at the flag filtering logic I pointed to in cmake-rs and how the build type is part of the variable they apply the filtering to. It's really convoluted and weird. |
From what I can tell, I am only getting the /O2 flags when using the default cargo profile (debug?). The /O2 /DNDEBUG and /Ob2 flags are only used when I don't build with --releaseRunning in this repo at commit 8b11c5c, so with @vlovich's patch. With --release:
Without --release
If I apply the suggested changes to examples/usage.rs and test with and without --release, I get this: usage.rs benchmarksNote that I am using This is all running on a Windows 11 virtual machine, hosted by a proxmox hypervisor with an Nvidia A6000 GPU and the Nvidia 572.16 driver. With --release:
Without --release:
For comparison, llama-bench.exe of the pre-built vulkan binaries from llama.cpp github release b4745 gives 83.41 t/s The drop to 60t/s is significant, but not as bad as I've seen in other tests... Changing the conditional in build.rs as @babichjacob suggested removes the difference in benchmarking. I also tried building for x86_64-pc-windows-gnu, to try and workaround the msvc weirdness by using mingw/gcc instead. I haven't been able to make it work yet. |
I have no objection. Accelerators (Vulkan/CUDA) generally experience less of a slowdown but I'm curious why there's still a slowdown. I wonder if the FFI boundary UTF8 validation converting into Rust String from CStr is really adding that much overhead Do keep in mind that you want to check CXX_FLAGS as well - when building with --release the O2 ends up there if I recall correctly. |
Great. I made a PR.
I am very curious too! I have a hard time believing that the string conversions are that expensive- after all it would have to be several miliseconds. I tried an experiment where I removed detokenization from my program, and instead just yielded a constant string each time, while still doing the full decode/sample dance. It didn't give me a measurably faster t/s speed. but I am observing a bigger slowdown in nobodywho, which does Rust/C++ string conversion in both ends of the program (actually
If you look at the first foldable section in my previous comment, it does include the CMAKE_CXX_FLAGS_RELEASE too. |
I'm not sure if this is something specific to my configuration, but I'm seeing this across the board in CI & local machine: the Release variant compiler flags are missing O2 (& DNDEBUG). I don't think this has anything to do with this project per se as I suspect the bug lies in the cmake crate, but wanted to flag. This results in CPU inference being >2x slower on my AMD with performance equivalent to debug llama-cli. Switching to RelWithDebInfo speeds it up quite a bit.
It's also worth noting that stock cmake configures
/Ob2
as well for Release but I didn't observe any meaningful performance difference from that.The text was updated successfully, but these errors were encountered: