Skip to content

Codspeed 2.8 hangs or kills the CI runner #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tan3-netapp opened this issue Feb 25, 2025 · 12 comments · Fixed by #80
Closed

Codspeed 2.8 hangs or kills the CI runner #79

tan3-netapp opened this issue Feb 25, 2025 · 12 comments · Fixed by #80
Assignees
Labels

Comments

@tan3-netapp
Copy link

tan3-netapp commented Feb 25, 2025

Currently, codspeed 2.8 hangs or kills the CI runner when being run as the benchmark check of the shotover-proxy repository.
The latest working version of codspeed for this job is 2.7.2.

To reproduce the issue, please look into this change for more information.

@tan3-netapp tan3-netapp changed the title Codspeed 2.8 hangs or is killed the CI runner Codspeed 2.8 hangs or kills in the CI runner Feb 25, 2025
@art049 art049 added bug Something isn't working Systems Integrations - Rust labels Feb 25, 2025 — with Linear
@art049
Copy link
Member

art049 commented Feb 25, 2025

We just released v2.8.1 statically linked to avoid issues with further issues with Glibc.

On an unrelated note, we also saw your workflow is using the version 2 of the codspeed action, we have a temporary issue with backward compatibility that makes those workflows temporarily delayed. If you can bump the action to version 3 this will be fixed directly

@art049 art049 closed this as completed Feb 25, 2025
@art049 art049 reopened this Feb 25, 2025
@art049
Copy link
Member

art049 commented Feb 25, 2025

@tan3-netapp we're trying to understand the issue. Can you try with 2.8.1 and the latest runner version (v3)?

As mentioned in my previous message, there is a slight backward compatibility issue on our side that might explain the hanging you're seeing but I'm just guessing.

@rukai
Copy link

rukai commented Feb 25, 2025

Here is an attempt with 2.8.1 shotover/shotover-proxy#1870 you can see that:

  • the musl fix now allows cargo-codspeed to run at all!
  • however it now hangs towards the end of the compile.
    • although it didnt happen in this run, I've seen the bug manifest in the CI runner immediately killed at the same point where it hangs.
    • I've seen this behaviour on locally compiled non-musl builds too.
    • I am most suspicious of 6a00c7f maybe we are blocking cargo's execution by not reading its stderr? maybe the logic for copying built artifacts is hitting something weird.

@tan3-netapp tan3-netapp changed the title Codspeed 2.8 hangs or kills in the CI runner Codspeed 2.8 hangs or kills the CI runner Feb 26, 2025
@GuillaumeLagrange
Copy link
Contributor

Thanks for the information. I can reproduce and iterate on my end. The job exits with a code 143, aka an external SIGTERM. Looks like an OOM issue.

The commit you linked seems to be the root cause, but I cannot pinpoint why. The base idea of this change was to switch from using cargo as a library in our code to instrumenting the cargo cli, following the cargo guidelines.

We'll make sure to keep you updated when we find something, in the meantime, you can keep using the 2.7.2 version

@GuillaumeLagrange
Copy link
Contributor

Hello @rukai, here is what I've found.

The root issue is that when one runs cargo build --benches -p my_crate, it also builds the test targets of the my_crate crate.
This was not an issue with cargo-codspeed versions < 2.8, because we used the cargo crate bindings and were extracting only the bench target, bypassing the default cargo behavior.

However, we want to stick to the change made in 2.8, your repository happens to be an edge case where building the test target brings the compilation over the memory/time limit allowed by the CI.

I have filed an issue in the cargo repository to see if we can upstream a better behavior for cargo build --benches rust-lang/cargo#15258

In the meantime, the workaround would be to create a dedicated crate for your benchmarks, that would only hold the benchmarks, see here for an example on how we did it in this repo

Then, you can change your CodSpeed CI action to only build the crate holding the benches. In this repo, we do this here

By the way I noticed that in your current workflow, you also build benches for your shotover_proxy crate, which does not seem necessary as you don't have codspeed benchmarks in the proxy benchmark, you may want to run cargo codspeed build -p shotover instead !

Let me know if the workaround suits you. If we get further reports of this change we may yank this release and create a major

@rukai
Copy link

rukai commented Mar 4, 2025

Thankyou for investigating the issue!
Your explanation makes sense, but I'm still puzzled by something.
We already build everything (--all-targets) when running integration tests (both dev and release mode) here.

It is possible that the memory usage of cargo-codspeed itself on top of the full build is sending it over the edge.
However I would expect cargo-codspeed's memory usage to sit around 20MB or something?
Github actions runners have 16GB to work with, so its very surprising that would send it over the edge, considering I've never seen the test run fail due to OoM.

Does codspeed do anything that might increase memory usage of the build itself? My understanding is that the codspeed stable bench magic happens at run time not compile time, so I imagine this wouldnt be the case.

Ah well, I'll just work on the assumption that the little bit of extra memory from cargo-codspeed pushes it over the edge and we were about to hit this on the nextest compile as well.

So that leads me to the solution.
I have some issues with the proposed solution:

  • I would really rather not limit codspeed to run on just one package, I want it to run all benches regardless of which project they come from, otherwise I'm at risk of adding a new bench and not realizing that its not being picked up due to CI configuration.
  • I think we might rely on #[cfg(test)] functionality in our benches, so moving to a separate crate would require some reworking.

Instead I would rather:

  • short term fix - set -j 2 on the build to avoid using so much ram.
    • on a quick and dirty test on my 16core laptop, setting -j2 reduced max ram usage from 13GB to 5GB
    • This workflow is not the bottleneck of our CI as we spend a lot of time running tests. So I expect this to not impact CI runtime.
  • long term fix - Investigate why tests are taking so much ram to build, sounds like this is about to break our test compilation as well, and I dont want to -j 2 the workflow that is our bottleneck.
    • Long ago we combined all our tests/* tests into a single test binary to reduce compilation time, but maybe reverting that can reduce total memory usage.

The ability to set the cargo --jobs flag is something I need codspeed to implement.
The rest is all stuff on our end.
Thanks again for your help here.

@GuillaumeLagrange
Copy link
Contributor

GuillaumeLagrange commented Mar 4, 2025

From what I tested, the biggest change seems to come from the debug symbols that we explicitly activate for codspeed to be able to build callgraphs.

When I build your project with

RUSTFLAGS="-C debuginfo=2 --cfg codspeed" "cargo" "build" "--all-targets" "--profile" "release" "--package" "shotover"

Memory usage peaks at 6.5GB, while

RUSTFLAGS="--cfg codspeed" "cargo" "build" "--all-targets" "--profile" "release" "--package" "shotover"

Memory usage peaks at 2GB

Mesurement was made using zsh's time function

export TIMEFMT='
%U user CPU
%S kernel CPU
%E RT total
max memory:  %M MB'

then time cargo build...

Maybe not the most precise measure, but I dont know a more convenient alternative.

-j definitely seems like something we can add to cargo-codspeed, I'll look into it

@rukai
Copy link

rukai commented Mar 4, 2025

Ah! Well that explains it!
If that's the case our tests are not at risk and -j 2 should be fine.

Zsh time looks very handy, thanks, I was just eyeballing it from htop

@rukai
Copy link

rukai commented Mar 21, 2025

Hi, any news on a -j option? we've started hitting issues because codspeed 2.7 doesnt support newer Cargo.toml manifests:

Image

Copy link
Member

art049 commented Mar 21, 2025

Hey we're on it. Should be released by beginning of next week. We'll let you know!

@art049
Copy link
Member

art049 commented Apr 1, 2025

Just released 2.10.0, which supports the -j flag. Don't hesitate to reopen if the problem is still blocking you.

@art049 art049 closed this as completed Apr 1, 2025
@rukai
Copy link

rukai commented Apr 1, 2025

Great, thanks for all your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants