-
Notifications
You must be signed in to change notification settings - Fork 472
build: Build all cargo executables at once #20221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5edea55
to
345f875
Compare
f500a91
to
c8310af
Compare
a8fa067
to
311a874
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This came out much cleaner than I was expecting.
I defer to @benesch on this one and am removing myself as a reviewer. I think we can remove the antithesis stuff at this time. When/if we decide to re-engage them, most likely all of that directory will need to be rewritten anyway, e.g. the materialized container does not have CRDB in it at all, so the Dockerfile is invalid for the current Mz |
Another test run to prove that everything still works after the slight refactoring: https://buildkite.com/materialize/tests/builds/58655#01890b9c-c4ef-4862-ae02-1ac9ebc9ffd0 |
misc/python/materialize/mzbuild.py
Outdated
def generate_cargo_build_command( | ||
bins: List[str], | ||
examples: List[str], | ||
rd: RepositoryDetails, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it would be a bit cleaner to push the rd
extraction into this function.
Gonna noodle on a refactor, one second. |
8246734
to
1ae4655
Compare
I pushed up two changes:
I've not tested any of this. Going to test by pushing up some dummy changes to this PR. |
5548132
to
e5c3cf6
Compare
I'm worried that the change to build all binaries and examples in parallel causes OOMs. The aarch64 build failed in my test: https://buildkite.com/materialize/tests/builds/58695#01890da6-8704-41e7-bbbd-d0983ebaf30b The presentation of that failure wasn't obviously an OOM, but OOMing while doing parallel links is a known problem: rust-lang/cargo#9157 |
Unfortunately the "agent lost" seems to be a common way OoMs in CI manifest. On a 64 GB system, that seems crazy. Locally even with
#!/bin/bash
LOCK_FILE="/tmp/linker.lock"
while ! (set -o noclobber; (echo $$ > "$LOCK_FILE") 2> /dev/null); done
sleep 1
done
cleanup() {
rm -f "$LOCK_FILE"
}
trap cleanup EXIT HUP INT QUIT TERM
exec /usr/local/bin/realmoldbinaryhere
Based on my local experiments it seems to be the rustc processes using the memory actually, so have to go with (2) probably. I'll put up a change for that and see if that stays ok on CI. |
Before you spend too long on this, Dennis, I think we might just be using wrong EC2 instance type. We're currently using compute optimized instances. That simply may not be the right ratio of CPU:memory (1:2) for compiling Rust. We should consider using general purpose instances instead (1:4). Do you have 32 cores locally? If not, |
Only 16 threads locally, so not that realistic. The implementation was not too much effort: diff --git a/misc/python/materialize/mzbuild.py b/misc/python/materialize/mzbuild.py
index 73afd171be..9e2b7e816d 100644
--- a/misc/python/materialize/mzbuild.py
+++ b/misc/python/materialize/mzbuild.py
@@ -41,6 +41,7 @@ from typing import (
Optional,
Sequence,
Set,
+ Tuple,
cast,
)
@@ -381,9 +382,21 @@ class CargoBuild(CargoPreImage):
examples.update(build.examples)
assert rd
- ui.say(f"Common cargo build for: {', '.join(bins | examples)}")
- cargo_build = cls.generate_cargo_build_command(rd, list(bins), list(examples))
- spawn.runv(cargo_build, cwd=rd.root)
+ def chunks(
+ list1: List[str], list2: List[str], n: int
+ ) -> Iterator[Tuple[List[str], List[str]]]:
+ for i in range(0, len(list1) + len(list2), n):
+ chunk1 = list1[i : i + n]
+ chunk2 = list2[max(0, i - len(list1)) : max(0, i - len(list1) + n)]
+ yield (chunk1, chunk2)
+
+ # Workaround for linking OoMs, see https://github.com/rust-lang/cargo/issues/9157
+ for (bins_chunk, examples_chunk) in chunks(list(bins), list(examples), 6):
+ ui.say(f"Common cargo build for: {', '.join(bins_chunk + examples_chunk)}")
+ cargo_build = cls.generate_cargo_build_command(
+ rd, bins_chunk, examples_chunk
+ )
+ spawn.runv(cargo_build, cwd=rd.root)
def build(self) -> None:
cargo_profile = "release" if self.rd.release_mode else "debug" My guess is that we are still pretty CPU-constrained during builds, so going down to half the CPU would hurt us. Doubling memory would be fine of course, but we might be utilizing less than half of it in most cases. The move would be c6g.8xlarge -> m6g.8xlarge for aarch64 which seems about 16% more expensive. If that's acceptable, I have no problem with switching to general purpose for the builders. |
Remove unused antithesis
This should speed up builds substantially.
7829100
to
20c719f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# We don't use `docker push --quiet`, as that disables progress | ||
# output entirely. | ||
push = subprocess.Popen( | ||
f"docker push {shlex.quote(image)} | cat", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming our docker version is new enough NO_COLOR=1
in environment should work too: moby/buildkit#2954
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect, though I did not check, that NO_COLOR=1
disables escape codes related to coloring, but not to rendering the progress bars that update in place. That's a different set of escape codes.
spawn.runv(["docker", "push", dep.spec()]) | ||
images_to_push.append(dep.spec()) | ||
|
||
# Push all Docker images in parallel to minimize build time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was easier than expected. I was looking for a way to have a single docker push
call upload multiple images, went in the wrong direction :)
The ARM build is stubbornly slower than the x86_64 build. Since we're not actively running Materialize on ARM in production right now, stop running the ARM build on PRs, and switch to running it only on main. This way we'll still quickly catch bugs that cause us to stop compiling on ARM, but we'll spend less money and waste less developer time in doing so.
I pushed up one last change to get the relatively slower ARM builds out of PR builds. Merging now. We'll see how this fares over the next few days. |
Thanks so much for turning this around so quickly, @def-! |
Thanks for the two additional changes @benesch ! |
Proof from the build logs that it works: https://buildkite.com/materialize/tests/builds/58594
Runtimes were 10 min for x86-64 and 16 min for aarch64, which is faster than I expected. But this is not entirely from scratch.
Fixes https://github.com/MaterializeInc/database-issues/issues/6024
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.