Improve cross compilation process #544

binarylogic · 2019-06-26T22:10:58Z

As part of improving the Vector installation process, I went down the rabbit hole of improving our release / build / cross-compile process as I knew we'd want to support more platforms and architectures in the future. This, unfortunately, was a very deep rabbit hole full of nightmares :). While I didn't come out of this process successful, I feel that I would have been if I had a little more time.

Why?

A core premise of Vector is to unify data collection across multiple clouds, systems, architecture, devices, etc. Vector is successful when a user can leverage Vector as a single unified data collector, offering a single workflow for collecting that data. This includes everything from generic Linux servers with AMD and ARM architectures, to Windows servers, to IOT devices running on Raspberry Pis, and more.

The Problem

Currently, Vector uses CircleCI to build and release Vector. There are a number of problems with this process:

It depends on the CircleCI's "mac" instances since we were unable to cross compile for x86_64-apple-darwin within a Docker container (more on this below). This further locks us into Circle. Ideally we would perform this process in a Docker container since it is portable and does not depend on any specific provider.
We use the Rust cross library in hopes that we can easily cross compile as the README claims. This has a number of problems:
1. cross is basically a wrapper around a bunch of Docker images. Outside of handling all of the cross compilation library installation, it doesn't appear to do anything else fancy.
2. Because it launches a new Docker image we cannot execute the cross build --target <target> command inside of an already actively running Docker image (which Circle generally recommends). Circle offers a "system step" called setup_remote_docker that is designed to solve this. It proxies docker commands through a port that can launch new docker instances. We failed to get this to work with cross though.
3. Given 2, we were forced to use Circle's "machine" executor type. This solved the problem of launch cross' Docker images, but this also moved us even further away from Docker and creating portable reproducible builds. In addition, this further locks us into Circle.
4. Given 3, our builds are now incredibly slow since Circle does not let you specify a resource_class for "machine" executors. You get what you get, and apparently the machines are resource constrained making the builds within cross' Docker images take upwards of 30 minutes, when they should take ~3 minutes.
5. Cross is poorly managed from everything I can see. Releases are not made often. For example, the master branch on cross appears to have over 2 years worth of changes beyond what is available on cargo and Docker hub. This wasted an incredibly amount of time and forced us to install cross from master via cargo install --git https://github.com/rust-embedded/cross. We then needed to rebuild their Docker images as well since they were also over 2 years old.
All of this made debugging incredibly frustrating. For example, because we are not using Docker it is not possible to simply test the build process locally. Instead, we have to trigger a Circle build and wait for the results. Given the fourth point above, this can take up to 30 minutes to see results. And if you want to SSH onto a Circle machine you must wait for the first build to complete, and then re-run the build via the "Re-rerun with SSH" option the Circle provides. This means, in some cases, that it takes an hour to simply SSH onto a Circle machine to debug. This is simply not conducive to testing other targets in the future.

The Solution

I believe the solution starts with moving off of CircleCI. The solution I've been eyeing is BuildKite, which I think this the right choice to solve all of the problems above. Why?

We have not been happy with Circle. There are a slew of reasons, but needless to say we've been waiting for the opportunity to move to another provider.
BuildKite let's you host your own agents. These could be AWS EC2 instances, a physical mac mini, a physical raspberry pi, anything the BuildKite agent can install onto. Setting aside this cheaper, it's much more flexible.
Owning, controlling, and having direct access to our own machines drastically reduces the debug time. Even for the simple fact that we can immediately SSH onto the machine saves 30 minutes in the above explained CIrcleCI problems.
As mentioned in 2, we have more options for solving cross compilation issues. For example, if we find the cross compiling for MacOS is too difficult, we can find a way to get access to a MacOS machine and install the BuildKite agent on it. This could be a mac mini in our office or a hosted mac offered by a separate provider. The same could be said for building on the AMD architecture, we could use AWS' a1.* instances to simplify this. Basically, this decoupling means that we are not limited in our hardware choices.
We can use much more powerful machines to reduce the test and build times.

Ideal Solution

In my opinion the ideal solution is:

Move to Buildkite.
Setup the BuildKite CloudFormation stack. I already did this (see the BuildKite) section below.
Define a Makefile in the root of the Vector project that servers as a top level interface to all maintenance functions. Ex: make build x86_64-apple-darwin, make test, etc.
Contain the entire test, build, and deploy process in a collection of Docker images with scripts as the ENTRYPOINT. These docker images and scripts should be checked in to the Vector repo making it easy to run, test, and debug locally. We should extend or use the cross docker images as starting points. If we want to stay in sync with upstream changes, we can simply build their images and then extend them in our own via:
```
docker build \
  -t timberio/vector-builder-base-$target:latest \
  -f $target/Dockerfile \
  github.com/rust-embedded/cross#:docker
```
Where $target is the target triple: x86_64-apple-darwin

Nuggets of Wisdom

Cross compiling OSX

I attempted to cross compile Vector to OSX (x86_64-apple-darwin) in a Docker container
via osxcross. Surprisingly, I got this to work and was able to successfully compile the tool chain via:

apt-get install -y --no-install-recommends \
    autotools-dev \
    automake \
    cmake \
    gcc \
    g++ \
    libbz2-dev \
    libmpc-dev \
    libmpfr-dev \
    libgmp-dev \
    libssl-dev \
    libxml2-dev \
    llvm-dev \
    uuid-dev \
    wget \
    zlib1g-dev

git clone https://github.com/tpoechtrager/osxcross.git
cd osxcross
curl -O https://vector-ci-assets.s3.amazonaws.com/MacOSX10.14.sdk.tar.xz
mv -v MacOSX10.14.sdk.tar.xz tarballs/
UNATTENDED=yes OSX_VERSION_MIN=10.7 ./build.sh
ls /osxcross/target/bin
cd ..

This placed the target toolchain binaries (ex: /osxcross/target/bin/x86_64-apple-darwin18-clang) in the /osycross/target/bin directory. I then told Rust about it by setting:

CARGO_TARGET_X86_64_APPLE_DARWIN_LINKER=/osxcross/target/bin/x86_64-apple-darwin18-clang

I ran into problems when I tried to actually compile vector via:

cargo build --target x86_64-apple-darwin

Specifically, the build failed when it tried to build the leveldb libraries (more on this below).

At this point I gave up and I think the correct solution would be too build leveldb and rdkafka against the target architecture and somehow bind those to the Vector Rust libraries.

Leveldb

Vector uses leveldb for it's on-disk buffering. We use this library which provides leveldb bindings for Rust. You'll notice it requires libleveldb-dev and libsnappy-dev, which makes sense. But, when cross compiling these need to be compiles against the target architecture's toolchain.

Rdkafka

Similar to leveldb we use the this library for Rust bindings to librdkafka. This library will attempt to compile from source upon installation as noted in the README:

By default a submodule with the librdkafka sources pinned to a specific commit will be used to compile and statically link the library.

This process is only supported for select targets, making cross compilation difficult. Fortunately, it offers a dynamic_linking option that allows you to link to your own binaries:

The dynamic_linking feature can be used to link rdkafka to a locally installed version of librdkafka: if the feature is enabled, the build script will use pkg-config to check the version of the library installed in the system, and it will configure the compiler to use dynamic linking.

BuildKite

I was able to successfully setup BuildKite using their CloudFormation stack. A few things to note:

Defined various AWS resources via the vector-management repo: https://github.com/timberio/vector-management/blob/master/terraform/ci.tf. These resources should be used when configuring and launching the BuildKite CloudFormation stack. Specifically:
The vector-ci-artifacts bucket should be used to store artifacts.
The vector-ci-secrets bucket should be used for secrets as outlined in their README.
The vector/private_ssh_key in the vector-ci-secrets bucket should be used as a deploy key to clone the Vector code.
The vector-ci AWS keypair should be used as the EC2 instance keypair.
A vector BuildKite pipeline should be defined (the pipeline slug should be vector) with these steps defined:
```
steps:
   - command: "buildkite-agent pipeline upload"
```
This tells Buildkite to use the .buildkite/pipeline.yml file within the Vector project. This is where we should define steps.
The Buildkite steps should simply call the Makefile commands as outlined above. Ex: Make build x86_64-apple-darwin. This should handle launching a docker container, etc.

Conclusion

The purpose of this issues to hopefully save time when we decide to pick this back up again.

The text was updated successfully, but these errors were encountered:

binarylogic · 2019-09-12T15:49:05Z

I'm closing this since I do not think this represents any new actionable work. I don't think the strategy outlined here is the correct one to support all of the targets we want as demonstrated by #689.

binarylogic added type: enhancement A value-adding code change that enhances its existing functionality. Core: Workflow labels Jun 26, 2019

binarylogic mentioned this issue Jun 29, 2019

Update Installation Instructions #548

Merged

binarylogic mentioned this issue Jul 18, 2019

chore: add support for armv7 releases, both gnueabihf and musleabihf #662

Merged

binarylogic added this to the Support more targets milestone Sep 7, 2019

binarylogic closed this as completed Sep 12, 2019

FabijanC mentioned this issue May 8, 2024

CircleCI macOS Intel support deprecation 0xSpaceShard/starknet-devnet#441

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cross compilation process #544

Improve cross compilation process #544

binarylogic commented Jun 26, 2019 •

edited

Loading

binarylogic commented Sep 12, 2019

Improve cross compilation process #544

Improve cross compilation process #544

Comments

binarylogic commented Jun 26, 2019 • edited Loading

Why?

The Problem

The Solution

Ideal Solution

Nuggets of Wisdom

Cross compiling OSX

Leveldb

Rdkafka

BuildKite

Conclusion

binarylogic commented Sep 12, 2019

binarylogic commented Jun 26, 2019 •

edited

Loading