Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate using Profile-Guided Optimization (PGO) #107

Open
zamazan4ik opened this issue Oct 17, 2024 · 2 comments
Open

Evaluate using Profile-Guided Optimization (PGO) #107

zamazan4ik opened this issue Oct 17, 2024 · 2 comments
Assignees

Comments

@zamazan4ik
Copy link

Hi!

A few days ago I found an article about OpenVMM on Reddit - as far as I see, the project tries to deliver peak performance. Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects including many projects - the results are available in the awesome-pgo repo. Since PGO has helped in many cases, I think it would be a good idea to try optimizing OpenVMM by applying PGO to it.

I can suggest the following things to do:

  • We need to perform PGO benchmarks on OpenVMM. If it shows improvements - add a note about possible improvements to the documentation. Providing an easier way (e.g. a build option) to build scripts to build OpenVMM with PGO can be useful for the end-users too.
  • It will be a good idea to integrate building with PGO into your CI pipelines. In this case, users will get already PGO-optimized binaries from the website (if you are going to provide prebuilt binaries, of course)

For the Rust projects, I suggest trying to start with cargo-pgo.

Here you can find different materials about PGO: benchmarks in different software, examples of how PGO is already integrated with different projects, PGO support in multiple Rust compilers, and some PGO-related pieces of advice.

After PGO, I suggest evaluating the LLVM BOLT optimizer - it can give more aggressive optimizations even after PGO. However, starting with regular PGO will be easier to do.

I would be happy to answer all your questions about PGO!

P.S. It's just an improvement idea for the project. I created the Issue since Discussions are disabled for the repository.

@jstarks
Copy link
Member

jstarks commented Oct 17, 2024

Thanks for the pointers!

Yes, we probably could benefit from PGO, since the IO paths are often performance critical and (somewhat surprisingly) CPU bound. I am especially wondering if PGO will allow us to keep the hot paths optimized for speed while letting the colder paths be optimized for size. Binary size is important to us because in a paravisor configuration, each VM has its own copy of the binary in memory, and with lots of small VMs this can add up.

We found that optimizing the full binary for size reduced the size by a few megabytes, but it also reduced networking performance in our Azure Boost compatibility scenarios by a significant amount. So we're still optimizing the full binary for speed at the moment. Just optimizing specific crates for size didn't seem to help, for reasons I don't fully understand yet. Maybe PGO can help us split the difference.

@zamazan4ik
Copy link
Author

I am especially wondering if PGO will allow us to keep the hot paths optimized for speed while letting the colder paths be optimized for size. Binary size is important to us because in a paravisor configuration, each VM has its own copy of the binary in memory, and with lots of small VMs this can add up.

Yep, PGO definitely can help with that! Actually, it's the main benefit of PGO: optimizing hot paths for speed (the most common thing - inline them harder) and optimizing cold paths for size (inline them less frequently).

We found that optimizing the full binary for size reduced the size by a few megabytes, but it also reduced networking performance in our Azure Boost compatibility scenarios by a significant amount. So we're still optimizing the full binary for speed at the moment. Just optimizing specific crates for size didn't seem to help, for reasons I don't fully understand yet. Maybe PGO can help us split the difference.

I am almost sure that the root of such behavior is inlining. Then you optimize for size, and the compiler tries to inline as little as possible. yes, it helps with the binary size optimization but each non-inlined function introduces an additional performance cost for calling such a function (call jump, miss I-cache if the called function is not in the I-cache right now). PGO can help with inlining "right" functions, and Post-Link Optimization (PLO) with tools like LLVM BOLT can help with reducing I-cache misses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants