Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows performance in Release profile seems crippled when building dev Cargo profile (RelWithDebInfo is faster) #649

Closed
vlovich opened this issue Feb 13, 2025 · 14 comments · Fixed by #651
Labels
🏗 build 🚀 performance there is something that should be faster 🪟 windows windows os issues

Comments

@vlovich
Copy link
Contributor

vlovich commented Feb 13, 2025

I'm not sure if this is something specific to my configuration, but I'm seeing this across the board in CI & local machine: the Release variant compiler flags are missing O2 (& DNDEBUG). I don't think this has anything to do with this project per se as I suspect the bug lies in the cmake crate, but wanted to flag. This results in CPU inference being >2x slower on my AMD with performance equivalent to debug llama-cli. Switching to RelWithDebInfo speeds it up quite a bit.

//Flags used by the CXX compiler during RELEASE builds.
CMAKE_CXX_FLAGS_RELEASE:STRING= -nologo -MD -Brepro

//Flags used by the CXX compiler during RELWITHDEBINFO builds.
CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=/MD /Zi /O2 /Ob1 /DNDEBUG

//Flags used by the C compiler during RELEASE builds.
CMAKE_C_FLAGS_RELEASE:STRING= -nologo -MD -Brepro

//Flags used by the C compiler during RELWITHDEBINFO builds.
CMAKE_C_FLAGS_RELWITHDEBINFO:STRING=/MD /Zi /O2 /Ob1 /DNDEBUG

It's also worth noting that stock cmake configures /Ob2 as well for Release but I didn't observe any meaningful performance difference from that.

@vlovich vlovich changed the title Windows performance in Release profile seems crippled (RelWithDebInfo has faster CPU inference) Windows performance in Release profile seems crippled when building Debug (RelWithDebInfo is faster) Feb 13, 2025
@vlovich vlovich changed the title Windows performance in Release profile seems crippled when building Debug (RelWithDebInfo is faster) Windows performance in Release profile seems crippled when building dev Cargo profile (RelWithDebInfo is faster) Feb 13, 2025
@babichjacob
Copy link
Contributor

babichjacob commented Feb 13, 2025

I've observed slower-than-expected performance using this crate in comparison to the llama.cpp CLI (taking the avx2 binary from their GitHub releases) even when using --release.

@MarcusDunn MarcusDunn added 🚀 performance there is something that should be faster 🏗 build labels Feb 13, 2025
@vlovich
Copy link
Contributor Author

vlovich commented Feb 13, 2025

@babichjacob please share. I'm also observing a ~30% slowdown vs stock llama.cpp CLI. Is that ballpark what you're seeing as well? Would love to track that down together.

@babichjacob
Copy link
Contributor

babichjacob commented Feb 14, 2025

Seems even slower for me.

I did a crude comparison with `hyperfine` and saw this crate run at approximately 1/5th the speed for the same task:
> hyperfine --warmup 2 --runs 4 '.../llama.cpp/build/bin/Release/llama-simple.exe -m ".../Qwen2.5-3B-Instruct-Q6_K_L.gguf" -n 64 "<|im_start|>user\nHello! how are you?<|im_end|>\n<|im_start|>assistant\n"' 'cargo run --release --features native --example usage ".../Qwen2.5-3B-Instruct-Q6_K_L.gguf"'  

Benchmark 1: .../llama.cpp/build/bin/Release/llama-simple.exe -m ".../Qwen2.5-3B-Instruct-Q6_K_L.gguf" -n 64 "<|im_start|>user\nHello! how are you?<|im_end|>\n<|im_start|>assistant\n"
  Time (mean ± σ):      3.850 s ±  0.139 s    [User: 9.374 s, System: 1.404 s]
  Range (min … max):    3.767 s …  4.058 s    4 runs

Benchmark 2: cargo run --release --features native --example usage ".../Qwen2.5-3B-Instruct-Q6_K_L.gguf"
  Time (mean ± σ):     20.843 s ±  0.389 s    [User: 46.550 s, System: 2.572 s]
  Range (min … max):   20.550 s … 21.392 s    4 runs

Summary
  .../llama.cpp/build/bin/Release/llama-simple.exe -m ".../Qwen2.5-3B-Instruct-Q6_K_L.gguf" -n 64 "<|im_start|>user\nHello! how are you?<|im_end|>\n<|im_start|>assistant\n" ran
    5.41 ± 0.22 times faster than cargo run --release --features native --example usage ".../Qwen2.5-3B-Instruct-Q6_K_L.gguf"

(I referenced the usage example to make the llama-simple command so that each program would be doing the same thing. Warmup used to make sure program compilation, shader compilation, and other cache-related stuff was done before starting measuring.)

Then, I took some code from the `simple` example and put it in `usage` to get more fair measurements, seeing about 1/2.6th the generation speed:
> cargo run --release --features native --example usage ".../Qwen2.5-3B-Instruct-Q6_K_L.gguf"

[omitted irrelevant parts. the prompt and generated text is the same as in llama-simple below]
decoded 128 tokens in 33.42 s, speed 3.83 t/s
> .../llama.cpp/build/bin/Release/llama-simple.exe -m ".../Qwen2.5-3B-Instruct-Q6_K_L.gguf" -n 128 "<|im_start|>user\nWrite a 1000 word paragraph about llama-cpp-rs, a way to use llama.cpp from Rust.<|im_end|>\n<|im_start|>assistant\n"

[omitted irrelevant parts. the prompt and generated text is the same as in llama-cpp-rs/usage above]
decoded 128 tokens in 12.77 s, speed: 10.02 t/s

llama_perf_sampler_print:    sampling time =      20.72 ms /   128 runs   (    0.16 ms per token,  6178.20 tokens per second)
llama_perf_context_print:        load time =    1816.34 ms
llama_perf_context_print: prompt eval time =    1395.03 ms /    33 tokens (   42.27 ms per token,    23.66 tokens per second)
llama_perf_context_print:        eval time =   11260.76 ms /   127 runs   (   88.67 ms per token,    11.28 tokens per second)
llama_perf_context_print:       total time =   13190.89 ms /   160 tokens

So 3.83 / 10.02 = running at about 38% of or 1/2.6th the speed of llama.cpp.

I'll report back with increasingly sophisticated measurements as I learn how to get them (I'm very new to it), like a flamegraph maybe? Not sure if that'll point to any places time is lost that originate from this codebase rather than from llama.cpp itself.

@AsbjornOlling
Copy link
Contributor

We noticed the same thing in NobodyWho last week. Not sure if it's a new thing, or if we're only just noticing it.

We mostly work on linux, where we have like 98% the performance of llama-bench, but a windows user of our project noticed that it is significantly slower than llama.cpp

FWIW, we're seeing that it runs at like 23% of the t/s that pure llama.cpp does (only on windows).
Image

I just finished setting up a windows VM to investigate 😅 Will post results if I find anything noteworthy.

@AsbjornOlling
Copy link
Contributor

I'm not convinced that the windows performance reduction was fixed by #651 (or I don't understand it)

I'm slowly getting better at profiling and performance analysis, and haven't found anything super scary in the flamegraphs yet.

But a weird thing I'm noticing is that the numbers I get from binaries built with cargo build --release is significantly slower than the numbers I get from binaries built with just cargo build.

Gut feel is that something is off in build.rs. I'll have a closer look at it when I get the time

@vlovich
Copy link
Contributor Author

vlovich commented Feb 20, 2025

@AsbjornOlling even in Rust debug builds, build.rs tried to build llama.cpp itself optimized. However, on Windows only, the underlying cmake crate has a bug where it explicitly turns off optimization flags for MSVC release. This ONLY impacts debug builds (ie cargo build) and not release (cargo build --release).

If cargo build is faster than cargo build --release then my patch was somehow incomplete or you're on an old release without that fix.

After my patch I saw both debug and release had identical performance but that number was still 30% off from llama.cpp CLI. Please let me know if you have any insights.

@babichjacob
Copy link
Contributor

babichjacob commented Feb 20, 2025

(sorry about the comment right before this that I just deleted; I forgot to pull to actually test with the PR merged in 🤦)

I can corroborate what Asbjørn has found. Running commit f4b9657 on Windows 11 using MSVC with a Ryzen 5 5600 (so AVX2 is available):

cargo profile LLAMA_LIB_PROFILE generation speed in my crude one-shot test is close to llama.cpp native
release RelWithDebInfo
release MinSizeRel
release Release
release (unspecified)
debug RelWithDebInfo
debug MinSizeRel
debug Release
debug (unspecified)

Where the ✅s are around 11 t/s and the ❌s are around 4 t/s for me.

The command I was getting measurements from was cargo run --features native --example usage .../Qwen2.5-3B-Instruct-Q6_K_L.gguf (with modifications to the command as reflected in the table). I ran cargo clean in between to prevent weirdness.

I modified usage.rs to get measurements like this:
//! # Usage
//!
//! This is just about the smallest possible way to do inference. To fetch a model from hugging face:
//!
//! ```console
//! git clone --recursive https://github.com/utilityai/llama-cpp-rs
//! cd llama-cpp-rs/examples/usage
//! wget https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF/resolve/main/qwen2-1_5b-instruct-q4_0.gguf
//! cargo run --example usage -- qwen2-1_5b-instruct-q4_0.gguf
//! ```
use llama_cpp_2::context::params::LlamaContextParams;
use llama_cpp_2::llama_backend::LlamaBackend;
use llama_cpp_2::llama_batch::LlamaBatch;
use llama_cpp_2::model::params::LlamaModelParams;
use llama_cpp_2::model::LlamaModel;
use llama_cpp_2::model::{AddBos, Special};
use llama_cpp_2::sampling::LlamaSampler;
use llama_cpp_2::{ggml_time_us, send_logs_to_tracing, LogOptions};
use std::io::Write;
use std::time::Duration;
#[allow(clippy::cast_possible_wrap, clippy::cast_possible_truncation)]
fn main() {
    let model_path = std::env::args().nth(1).expect("Please specify model path");
tracing_subscriber::fmt::init();
send_logs_to_tracing(LogOptions::default().with_logs_enabled(true));

let backend = LlamaBackend::init().unwrap();
let params = LlamaModelParams::default();

let prompt =
    "<|im_start|>user\nWrite a 1000 word paragraph about llama-cpp-rs, a way to use llama.cpp from Rust.<|im_end|>\n<|im_start|>assistant\n".to_string();
LlamaContextParams::default();
let model =
    LlamaModel::load_from_file(&backend, model_path, &params).expect("unable to load model");
let ctx_params = LlamaContextParams::default();
let mut ctx = model
    .new_context(&backend, ctx_params)
    .expect("unable to create the llama_context");
let tokens_list = model
    .str_to_token(&prompt, AddBos::Always)
    .unwrap_or_else(|_| panic!("failed to tokenize {prompt}"));
let n_len = 128;

// create a llama_batch with size 512
// we use this object to submit token data for decoding
let mut batch = LlamaBatch::new(512, 1);

let last_index = tokens_list.len() as i32 - 1;
for (i, token) in (0_i32..).zip(tokens_list.into_iter()) {
    // llama_decode will output logits only for the last token of the prompt
    let is_last = i == last_index;
    batch.add(token, i, &[0], is_last).unwrap();
}
ctx.decode(&mut batch).expect("llama_decode() failed");

let mut n_cur = batch.n_tokens();
let mut n_decode = 0;

let t_main_start = ggml_time_us();

// The `Decoder`
let mut decoder = encoding_rs::UTF_8.new_decoder();
let mut sampler = LlamaSampler::greedy();

while n_decode < n_len {
    // sample the next token
    {
        let token = sampler.sample(&ctx, batch.n_tokens() - 1);

        sampler.accept(token);

        // is it an end of stream?
        if token == model.token_eos() {
            eprintln!();
            break;
        }

        let output_bytes = model.token_to_bytes(token, Special::Tokenize).unwrap();
        // use `Decoder.decode_to_string()` to avoid the intermediate buffer
        let mut output_string = String::with_capacity(32);
        let _decode_result = decoder.decode_to_string(&output_bytes, &mut output_string, false);
        print!("{output_string}");
        std::io::stdout().flush().unwrap();

        batch.clear();
        batch.add(token, n_cur, &[0], true).unwrap();
    }

    n_cur += 1;

    ctx.decode(&mut batch).expect("failed to eval");

    n_decode += 1;
}
eprintln!();

let t_main_end = ggml_time_us();

let duration = Duration::from_micros((t_main_end - t_main_start) as u64);

eprintln!(
    "decoded {} tokens in {:.2} s, speed {:.2} t/s\n",
    n_decode,
    duration.as_secs_f32(),
    n_decode as f32 / duration.as_secs_f32()
);

}

@babichjacob
Copy link
Contributor

babichjacob commented Feb 20, 2025

@vlovich I see the bug workaround was restricted to cargo's debug profile:

if (cfg!(debug_assertions)
|| std::env::var("PROFILE").as_ref().map(String::as_str) == Ok("debug"))

I'm going to test reducing that conditional to


    if matches!(target_os, TargetOs::Windows(WindowsVariant::Msvc))
        && matches!(
            profile.as_str(),
            "Release" | "RelWithDebInfo" | "MinSizeRel"
        )
    {

to cover all variations of (llama.cpp's) release mode and to lift the restriction to (cargo's) debug profile, and post a new table with my findings.

Do you know or have a guess how it's somehow not slower to run in release mode with the current commit, and can check if my proposed change regresses that, on your end?

@vlovich
Copy link
Contributor Author

vlovich commented Feb 20, 2025

@babichjacob look at the generated CMakeCache.txt and the CXXFLAGS_RELEASE and CFLAGS_RELEASE variables. The /O2 is only missing when Rust is built debug. In --release the /O2 is there. I'm not sure what you're seeing but I linked the upstream bug report where I highlight the relevant code from cmake.rs which indeed only applies their opt flag filtering erroneously when building Rust debug but configuring cmake to be release.

@babichjacob
Copy link
Contributor

Thanks for the hint. Really tricky situation.

look at the generated CMakeCache.txt and the CXXFLAGS_RELEASE and CFLAGS_RELEASE variables. The /O2 is only missing when Rust is built debug. In --release the /O2 is there.

Regardless of if your patch is enabled, and regardless of the cargo / Rust profile (where you're saying it should be there when compiled for release), /O2 is missing for me in the RELEASE flags:

CMAKE_CXX_FLAGS:STRING= -nologo -MD -Brepro
CMAKE_CXX_FLAGS_DEBUG:STRING=/MDd /Zi /Ob0 /Od /RTC1
CMAKE_CXX_FLAGS_MINSIZEREL:STRING=/MD /O1 /Ob1 /DNDEBUG
vvvv line of interest (optimization flags are missing especially in comparison to MINSIZERE and RELWITHDEBINFO)
CMAKE_CXX_FLAGS_RELEASE:STRING= -nologo -MD -Brepro
^^^^ line of interest
CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=/MD /Zi /O2 /Ob1 /DNDEBUG

CMAKE_C_FLAGS:STRING= -nologo -MD -Brepro
CMAKE_C_FLAGS_DEBUG:STRING=/MDd /Zi /Ob0 /Od /RTC1
CMAKE_C_FLAGS_MINSIZEREL:STRING=/MD /O1 /Ob1 /DNDEBUG
vvvv line of interest (optimization flags are missing especially in comparison to MINSIZERE and RELWITHDEBINFO)
CMAKE_C_FLAGS_RELEASE:STRING= -nologo -MD -Brepro
^^^^ line of interest
CMAKE_C_FLAGS_RELWITHDEBINFO:STRING=/MD /Zi /O2 /Ob1 /DNDEBUG

I suspect this is how it is for Asbjørn too. I'll also try wrangling a GitHub Actions worker and see what its flags look like.

With your patch enabled, I can see that /O2 was indeed added (in the general flags section), but only in cargo's debug profile (because of the conditional mentioned above):

vvvv line of interest (optimization flags are the ones manually specified in build.rs)
CMAKE_CXX_FLAGS:STRING= /O2 /DNDEBUG /Ob2 -nologo -MD -Brepro 
^^^^ line of interest
CMAKE_CXX_FLAGS_DEBUG:STRING=/MDd /Zi /Ob0 /Od /RTC1
CMAKE_CXX_FLAGS_MINSIZEREL:STRING=/MD /O1
CMAKE_CXX_FLAGS_RELEASE:STRING=
CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=/MD /Zi /O2 /Ob1 /DNDEBUG                                       

vvvv line of interest (optimization flags are the ones manually specified in build.rs)
CMAKE_C_FLAGS:STRING= /O2 /DNDEBUG /Ob2 -nologo -MD -Brepro
^^^^ line of interest
CMAKE_C_FLAGS_DEBUG:STRING=/MDd /Zi /Ob0 /Od /RTC1                   
CMAKE_C_FLAGS_MINSIZEREL:STRING=/MD /O1 /Ob1 /DNDEBUG            
CMAKE_C_FLAGS_RELEASE:STRING= /O2 /DNDEBUG /Ob2 -nologo
CMAKE_C_FLAGS_RELWITHDEBINFO:STRING=/MD /Zi /O2 /Ob1 /DNDEBUG

(You can also see here that cmake-rs gave /O2 /DNDEBUG /Ob2 to the C RELEASE flags but not C++. Weird!)

So, if I understand you correctly, this is what you see in release mode (regardless of if your patch is applied because it's conditional on debug):

CMAKE_CXX_FLAGS:STRING= -nologo -MD -Brepro
CMAKE_CXX_FLAGS_DEBUG:STRING=/MDd /Zi /Ob0 /Od /RTC1
CMAKE_CXX_FLAGS_MINSIZEREL:STRING=/MD /O1 /Ob1 /DNDEBUG
vvvv line of interest (the expected optimization flags are present)
CMAKE_CXX_FLAGS_RELEASE:STRING=/O2 /DNDEBUG /Ob2 -nologo -MD -Brepro
^^^^ line of interest
CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=/MD /Zi /O2 /Ob1 /DNDEBUG

CMAKE_C_FLAGS:STRING= -nologo -MD -Brepro
CMAKE_C_FLAGS_DEBUG:STRING=/MDd /Zi /Ob0 /Od /RTC1
CMAKE_C_FLAGS_MINSIZEREL:STRING=/MD /O1 /Ob1 /DNDEBUG
vvvv line of interest (the expected optimization flags are present)
CMAKE_C_FLAGS_RELEASE:STRING=/O2 /DNDEBUG /Ob2 -nologo -MD -Brepro
^^^^ line of interest
CMAKE_C_FLAGS_RELWITHDEBINFO:STRING=/MD /Zi /O2 /Ob1 /DNDEBUG

Which is also the expected arrangement if the bug were resolved and the workaround wasn't needed (and it was regardless of whether cargo is in the release or debug profile).

I looked at the linked cmake-rs issue and couldn't figure out how the optimization flag filtering was restricted to cargo's debug profile (that doesn't mean it isn't; I just couldn't see how). I've cloned the dependency and patched it in locally so I can try to see how any changes will affect the flags and ultimately resolve this...

Thanks again for pointing me in the right direction!

@vlovich
Copy link
Contributor Author

vlovich commented Feb 23, 2025

Look at the flag filtering logic I pointed to in cmake-rs and how the build type is part of the variable they apply the filtering to. It's really convoluted and weird.

@AsbjornOlling
Copy link
Contributor

Regardless of if your patch is enabled, and regardless of the cargo / Rust profile (where you're saying it should be there when compiled for release), /O2 is missing for me in the RELEASE flags [...] I suspect this is how it is for Asbjørn too.

look at the generated CMakeCache.txt and the CXXFLAGS_RELEASE and CFLAGS_RELEASE variables. The /O2 is only missing when Rust is built debug. In --release the /O2 is there.

From what I can tell, I am only getting the /O2 flags when using the default cargo profile (debug?).

The /O2 /DNDEBUG and /Ob2 flags are only used when I don't build with --release

Running in this repo at commit 8b11c5c, so with @vlovich's patch.

With --release:

$ cargo clean
$ cargo build --release
$ cat ./target/release/build/llama-cpp-sys-2-9f60549f3cef90cc/out/build/CMakeCache.txt | Select-String CMAKE_CXX_FLAGS_RELEASE:STRING 
CMAKE_CXX_FLAGS_RELEASE:STRING= -nologo -MD -Brepro
$ cat ./target/release/build/llama-cpp-sys-2-9f60549f3cef90cc/out/build/CMakeCache.txt | Select-String CMAKE_C_FLAGS_RELEASE:STRING
CMAKE_C_FLAGS_RELEASE:STRING= -nologo -MD -Brepro

Without --release

$ cargo clean
$ cargo build
$ cat ./target/release/build/llama-cpp-sys-2-6ebad67f70f390a6/out/build/CMakeCache.txt | Select-String CMAKE_CXX_FLAGS_RELEASE:STRING 
CMAKE_CXX_FLAGS_RELEASE:STRING= /O2 /DNDEBUG /Ob2 -nologo -MD -Brepro
$ cat ./target/release/build/llama-cpp-sys-2-6ebad67f70f390a6/out/build/CMakeCache.txt | Select-String CMAKE_C_FLAGS_RELEASE:STRING
CMAKE_C_FLAGS_RELEASE:STRING=  /O2 /DNDEBUG /Ob2 -nologo -MD -Brepro

If I apply the suggested changes to examples/usage.rs and test with and without --release, I get this:

usage.rs benchmarks

Note that I am using --features vulkan.

This is all running on a Windows 11 virtual machine, hosted by a proxmox hypervisor with an Nvidia A6000 GPU and the Nvidia 572.16 driver.

With --release:

$ cargo run --features vulkan --example usage --release -- ./Meta-Llama-3-8B-Instruct-Q5_K_M.gguf
decode 64 tokens in 1.07 s, speed 59.55 t/s

Without --release:

$ cargo run --features vulkan --example usage -- ./Meta-Llama-3-8B-Instruct-Q5_K_M.gguf
decode 64 tokens in 0.89 s, speed 72.26 t/s

For comparison, llama-bench.exe of the pre-built vulkan binaries from llama.cpp github release b4745 gives 83.41 t/s

The drop to 60t/s is significant, but not as bad as I've seen in other tests...

Changing the conditional in build.rs as @babichjacob suggested removes the difference in benchmarking.
Could we do that? Is there some reason that this improvement should only apply to debug profiles?

I also tried building for x86_64-pc-windows-gnu, to try and workaround the msvc weirdness by using mingw/gcc instead. I haven't been able to make it work yet.

@vlovich
Copy link
Contributor Author

vlovich commented Feb 24, 2025

I have no objection. Accelerators (Vulkan/CUDA) generally experience less of a slowdown but I'm curious why there's still a slowdown. I wonder if the FFI boundary UTF8 validation converting into Rust String from CStr is really adding that much overhead

Do keep in mind that you want to check CXX_FLAGS as well - when building with --release the O2 ends up there if I recall correctly.

@AsbjornOlling
Copy link
Contributor

I have no objection.

Great. I made a PR.

Accelerators (Vulkan/CUDA) generally experience less of a slowdown but I'm curious why there's still a slowdown. I wonder if the FFI boundary UTF8 validation converting into Rust String from CStr is really adding that much overhead

I am very curious too! I have a hard time believing that the string conversions are that expensive- after all it would have to be several miliseconds.

I tried an experiment where I removed detokenization from my program, and instead just yielded a constant string each time, while still doing the full decode/sample dance. It didn't give me a measurably faster t/s speed.

but I am observing a bigger slowdown in nobodywho, which does Rust/C++ string conversion in both ends of the program (actually godot::Gstring, but w/e), so that might lend a little bit of credence to the string conversion theory.

Do keep in mind that you want to check CXX_FLAGS as well - when building with --release the O2 ends up there if I recall correctly.

If you look at the first foldable section in my previous comment, it does include the CMAKE_CXX_FLAGS_RELEASE too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏗 build 🚀 performance there is something that should be faster 🪟 windows windows os issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants