build: Build all cargo executables at once #20221

def- · 2023-06-28T18:55:35Z

Proof from the build logs that it works: https://buildkite.com/materialize/tests/builds/58594

[2023-06-29T10:49:58Z] Common cargo-build for: ['environmentd', 'clusterd', 'environmentd', 'clusterd', 'sqllogictest', 'testdrive']
[2023-06-29T10:49:58Z] $ env 'RUSTFLAGS=-Clink-arg=-Wl,--compress-debug-sections=zlib -Csymbol-mangling-version=v0 -Clink-arg=-fuse-ld=lld -L/opt/x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/sysroot/lib' cargo build --target x86_64-unknown-linux-gnu --bin environmentd --bin clusterd --bin environmentd --bin clusterd --bin sqllogictest --bin testdrive --release

Runtimes were 10 min for x86-64 and 16 min for aarch64, which is faster than I expected. But this is not entirely from scratch.

Fixes https://github.com/MaterializeInc/database-issues/issues/6024

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

src/adapter/src/coord.rs

benesch

Nice! This came out much cleaner than I was expecting.

misc/python/materialize/mzbuild.py

philip-stoev · 2023-06-30T06:29:55Z

I defer to @benesch on this one and am removing myself as a reviewer.

I think we can remove the antithesis stuff at this time. When/if we decide to re-engage them, most likely all of that directory will need to be rewritten anyway, e.g. the materialized container does not have CRDB in it at all, so the Dockerfile is invalid for the current Mz

def- · 2023-06-30T10:06:01Z

Another test run to prove that everything still works after the slight refactoring: https://buildkite.com/materialize/tests/builds/58655#01890b9c-c4ef-4862-ae02-1ac9ebc9ffd0

benesch · 2023-06-30T17:06:59Z

misc/python/materialize/mzbuild.py

+def generate_cargo_build_command(
+    bins: List[str],
+    examples: List[str],
+    rd: RepositoryDetails,


nit: I think it would be a bit cleaner to push the rd extraction into this function.

benesch · 2023-06-30T17:13:40Z

Gonna noodle on a refactor, one second.

benesch · 2023-06-30T18:21:18Z

I pushed up two changes:

A refactor to the common_cargo_build code to generalize it to apply to any PreImage subclass. Now any other PreImage can choose to declare a prepare_batch in order to perform any operations that are more efficient when performed in bulk.
Parallelization of docker push commands.

I've not tested any of this. Going to test by pushing up some dummy changes to this PR.

benesch · 2023-06-30T19:23:31Z

I'm worried that the change to build all binaries and examples in parallel causes OOMs. The aarch64 build failed in my test: https://buildkite.com/materialize/tests/builds/58695#01890da6-8704-41e7-bbbd-d0983ebaf30b

The presentation of that failure wasn't obviously an OOM, but OOMing while doing parallel links is a known problem: rust-lang/cargo#9157

def- · 2023-06-30T21:55:50Z

Unfortunately the "agent lost" seems to be a common way OoMs in CI manifest. On a 64 GB system, that seems crazy. Locally even with cargo build -j32 I haven't been able to go above 40 GB, maybe something is different on aarch64 though and the timing ends up with more link jobs at the same time. I see two potential solutions:

If it's the actual linking step that is going OoM, if we use a very parallel linker like mold (ci: Use mold linker to speed up building #20159) then we could wrap that in a small shell script so that only one linker process is running at a time:

#!/bin/bash

LOCK_FILE="/tmp/linker.lock"

while ! (set -o noclobber; (echo $$ > "$LOCK_FILE") 2> /dev/null); done
  sleep 1
done

cleanup() {
  rm -f "$LOCK_FILE"
}

trap cleanup EXIT HUP INT QUIT TERM

exec /usr/local/bin/realmoldbinaryhere

If it is the Rust processes, then we can split the cargo build into parts of max 6 executables, so instead of Common cargo build for: ['metabase-smoketest', 'clusterd', 'environmentd', 'kgen', 'clusterd', 'environmentd', 'mz', 'clusterd', 'sqllogictest', 'testdrive', 'persistcli', 'persistcli'] we'd have

Common cargo build for: ['metabase-smoketest', 'clusterd', 'environmentd', 'kgen', 'clusterd', 'environmentd']
Common cargo build for: ['mz', 'clusterd', 'sqllogictest', 'testdrive', 'persistcli', 'persistcli']

Based on my local experiments it seems to be the rustc processes using the memory actually, so have to go with (2) probably. I'll put up a change for that and see if that stays ok on CI.

benesch · 2023-06-30T22:14:41Z

Before you spend too long on this, Dennis, I think we might just be using wrong EC2 instance type. We're currently using compute optimized instances. That simply may not be the right ratio of CPU:memory (1:2) for compiling Rust. We should consider using general purpose instances instead (1:4).

Do you have 32 cores locally? If not, -j 32 is probably not representative of what compiles look like on the CI machines.

def- · 2023-06-30T22:26:14Z

Only 16 threads locally, so not that realistic.

The implementation was not too much effort:

diff --git a/misc/python/materialize/mzbuild.py b/misc/python/materialize/mzbuild.py
index 73afd171be..9e2b7e816d 100644
--- a/misc/python/materialize/mzbuild.py
+++ b/misc/python/materialize/mzbuild.py
@@ -41,6 +41,7 @@ from typing import (
     Optional,
     Sequence,
     Set,
+    Tuple,
     cast,
 )

@@ -381,9 +382,21 @@ class CargoBuild(CargoPreImage):
             examples.update(build.examples)
         assert rd

-        ui.say(f"Common cargo build for: {', '.join(bins | examples)}")
-        cargo_build = cls.generate_cargo_build_command(rd, list(bins), list(examples))
-        spawn.runv(cargo_build, cwd=rd.root)
+        def chunks(
+            list1: List[str], list2: List[str], n: int
+        ) -> Iterator[Tuple[List[str], List[str]]]:
+            for i in range(0, len(list1) + len(list2), n):
+                chunk1 = list1[i : i + n]
+                chunk2 = list2[max(0, i - len(list1)) : max(0, i - len(list1) + n)]
+                yield (chunk1, chunk2)
+
+        # Workaround for linking OoMs, see https://github.com/rust-lang/cargo/issues/9157
+        for (bins_chunk, examples_chunk) in chunks(list(bins), list(examples), 6):
+            ui.say(f"Common cargo build for: {', '.join(bins_chunk + examples_chunk)}")
+            cargo_build = cls.generate_cargo_build_command(
+                rd, bins_chunk, examples_chunk
+            )
+            spawn.runv(cargo_build, cwd=rd.root)

     def build(self) -> None:
         cargo_profile = "release" if self.rd.release_mode else "debug"

My guess is that we are still pretty CPU-constrained during builds, so going down to half the CPU would hurt us. Doubling memory would be fine of course, but we might be utilizing less than half of it in most cases. The move would be c6g.8xlarge -> m6g.8xlarge for aarch64 which seems about 16% more expensive. If that's acceptable, I have no problem with switching to general purpose for the builders.

Remove unused antithesis

This should speed up builds substantially.

def-

Looks good now: https://buildkite.com/materialize/tests/builds/58710#01890e86-7b62-4980-9293-f855e4a826d1

def- · 2023-06-30T23:19:49Z

misc/python/materialize/mzbuild.py

+            # We don't use `docker push --quiet`, as that disables progress
+            # output entirely.
+            push = subprocess.Popen(
+                f"docker push {shlex.quote(image)} | cat",


Assuming our docker version is new enough NO_COLOR=1 in environment should work too: moby/buildkit#2954

I suspect, though I did not check, that NO_COLOR=1 disables escape codes related to coloring, but not to rendering the progress bars that update in place. That's a different set of escape codes.

def- · 2023-06-30T23:20:43Z

misc/python/materialize/mzbuild.py

-                spawn.runv(["docker", "push", dep.spec()])
+                images_to_push.append(dep.spec())
+
+        # Push all Docker images in parallel to minimize build time.


That was easier than expected. I was looking for a way to have a single docker push call upload multiple images, went in the wrong direction :)

The ARM build is stubbornly slower than the x86_64 build. Since we're not actively running Materialize on ARM in production right now, stop running the ARM build on PRs, and switch to running it only on main. This way we'll still quickly catch bugs that cause us to stop compiling on ARM, but we'll spend less money and waste less developer time in doing so.

benesch · 2023-07-01T00:43:15Z

I pushed up one last change to get the relatively slower ARM builds out of PR builds. Merging now. We'll see how this fares over the next few days.

benesch · 2023-07-01T00:43:44Z

Thanks so much for turning this around so quickly, @def-!

def- · 2023-07-01T07:07:40Z

Thanks for the two additional changes @benesch !

def- force-pushed the pr-build-speedup branch 4 times, most recently from 5edea55 to 345f875 Compare June 29, 2023 08:09

def- changed the title ~~Pr build speedup~~ build: Build all cargo executables at once Jun 29, 2023

def- force-pushed the pr-build-speedup branch 2 times, most recently from f500a91 to c8310af Compare June 29, 2023 10:48

def- requested review from benesch and philip-stoev June 29, 2023 12:16

def- marked this pull request as ready for review June 29, 2023 12:17

def- requested a review from a team as a code owner June 29, 2023 12:17

jkosh44 reviewed Jun 29, 2023

View reviewed changes

src/adapter/src/coord.rs Outdated Show resolved Hide resolved

def- force-pushed the pr-build-speedup branch 2 times, most recently from a8fa067 to 311a874 Compare June 29, 2023 13:55

benesch reviewed Jun 30, 2023

View reviewed changes

philip-stoev removed their request for review June 30, 2023 06:30

def- force-pushed the pr-build-speedup branch from 311a874 to fea9288 Compare June 30, 2023 09:11

def- requested review from benesch and danhhz June 30, 2023 10:06

def- force-pushed the pr-build-speedup branch from fea9288 to 8246734 Compare June 30, 2023 15:28

benesch reviewed Jun 30, 2023

View reviewed changes

benesch force-pushed the pr-build-speedup branch from 8246734 to 1ae4655 Compare June 30, 2023 18:17

benesch requested a review from a team as a code owner June 30, 2023 18:17

benesch force-pushed the pr-build-speedup branch from 5548132 to e5c3cf6 Compare June 30, 2023 19:22

benesch and others added 3 commits June 30, 2023 22:49

deps: upgrade to mypy v1.4.1

56725cf

build: Build all cargo executables at once

441d45c

Remove unused antithesis

mzbuild: push docker images in parallel

20c719f

This should speed up builds substantially.

def- force-pushed the pr-build-speedup branch 2 times, most recently from 7829100 to 20c719f Compare June 30, 2023 23:29

def- commented Jun 30, 2023

View reviewed changes

benesch merged commit 1d08b9b into MaterializeInc:main Jul 1, 2023

def- deleted the pr-build-speedup branch July 1, 2023 07:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build: Build all cargo executables at once #20221

build: Build all cargo executables at once #20221

def- commented Jun 28, 2023 •

edited

Loading

benesch left a comment

philip-stoev commented Jun 30, 2023

def- commented Jun 30, 2023

benesch Jun 30, 2023

benesch commented Jun 30, 2023

benesch commented Jun 30, 2023

benesch commented Jun 30, 2023

def- commented Jun 30, 2023

benesch commented Jun 30, 2023

def- commented Jun 30, 2023

def- left a comment

def- Jun 30, 2023

benesch Jul 1, 2023

def- Jun 30, 2023

benesch commented Jul 1, 2023

benesch commented Jul 1, 2023

def- commented Jul 1, 2023

build: Build all cargo executables at once #20221

build: Build all cargo executables at once #20221

Conversation

def- commented Jun 28, 2023 • edited Loading

Checklist

benesch left a comment

Choose a reason for hiding this comment

philip-stoev commented Jun 30, 2023

def- commented Jun 30, 2023

benesch Jun 30, 2023

Choose a reason for hiding this comment

benesch commented Jun 30, 2023

benesch commented Jun 30, 2023

benesch commented Jun 30, 2023

def- commented Jun 30, 2023

benesch commented Jun 30, 2023

def- commented Jun 30, 2023

def- left a comment

Choose a reason for hiding this comment

def- Jun 30, 2023

Choose a reason for hiding this comment

benesch Jul 1, 2023

Choose a reason for hiding this comment

def- Jun 30, 2023

Choose a reason for hiding this comment

benesch commented Jul 1, 2023

benesch commented Jul 1, 2023

def- commented Jul 1, 2023

def- commented Jun 28, 2023 •

edited

Loading