You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of improving the Vector installation process, I went down the rabbit hole of improving our release / build / cross-compile process as I knew we'd want to support more platforms and architectures in the future. This, unfortunately, was a very deep rabbit hole full of nightmares :). While I didn't come out of this process successful, I feel that I would have been if I had a little more time.
Why?
A core premise of Vector is to unify data collection across multiple clouds, systems, architecture, devices, etc. Vector is successful when a user can leverage Vector as a single unified data collector, offering a single workflow for collecting that data. This includes everything from generic Linux servers with AMD and ARM architectures, to Windows servers, to IOT devices running on Raspberry Pis, and more.
The Problem
Currently, Vector uses CircleCI to build and release Vector. There are a number of problems with this process:
It depends on the CircleCI's "mac" instances since we were unable to cross compile for x86_64-apple-darwin within a Docker container (more on this below). This further locks us into Circle. Ideally we would perform this process in a Docker container since it is portable and does not depend on any specific provider.
We use the Rust cross library in hopes that we can easily cross compile as the README claims. This has a number of problems:
cross is basically a wrapper around a bunch of Docker images. Outside of handling all of the cross compilation library installation, it doesn't appear to do anything else fancy.
Because it launches a new Docker image we cannot execute the cross build --target <target> command inside of an already actively running Docker image (which Circle generally recommends). Circle offers a "system step" called setup_remote_docker that is designed to solve this. It proxies docker commands through a port that can launch new docker instances. We failed to get this to work with cross though.
Given 2, we were forced to use Circle's "machine" executor type. This solved the problem of launch cross' Docker images, but this also moved us even further away from Docker and creating portable reproducible builds. In addition, this further locks us into Circle.
Given 3, our builds are now incredibly slow since Circle does not let you specify a resource_class for "machine" executors. You get what you get, and apparently the machines are resource constrained making the builds within cross' Docker images take upwards of 30 minutes, when they should take ~3 minutes.
Cross is poorly managed from everything I can see. Releases are not made often. For example, the master branch on cross appears to have over 2 years worth of changes beyond what is available on cargo and Docker hub. This wasted an incredibly amount of time and forced us to install cross from master via cargo install --git https://github.com/rust-embedded/cross. We then needed to rebuild their Docker images as well since they were also over 2 years old.
All of this made debugging incredibly frustrating. For example, because we are not using Docker it is not possible to simply test the build process locally. Instead, we have to trigger a Circle build and wait for the results. Given the fourth point above, this can take up to 30 minutes to see results. And if you want to SSH onto a Circle machine you must wait for the first build to complete, and then re-run the build via the "Re-rerun with SSH" option the Circle provides. This means, in some cases, that it takes an hour to simply SSH onto a Circle machine to debug. This is simply not conducive to testing other targets in the future.
The Solution
I believe the solution starts with moving off of CircleCI. The solution I've been eyeing is BuildKite, which I think this the right choice to solve all of the problems above. Why?
We have not been happy with Circle. There are a slew of reasons, but needless to say we've been waiting for the opportunity to move to another provider.
BuildKite let's you host your own agents. These could be AWS EC2 instances, a physical mac mini, a physical raspberry pi, anything the BuildKite agent can install onto. Setting aside this cheaper, it's much more flexible.
Owning, controlling, and having direct access to our own machines drastically reduces the debug time. Even for the simple fact that we can immediately SSH onto the machine saves 30 minutes in the above explained CIrcleCI problems.
As mentioned in 2, we have more options for solving cross compilation issues. For example, if we find the cross compiling for MacOS is too difficult, we can find a way to get access to a MacOS machine and install the BuildKite agent on it. This could be a mac mini in our office or a hosted mac offered by a separate provider. The same could be said for building on the AMD architecture, we could use AWS' a1.* instances to simplify this. Basically, this decoupling means that we are not limited in our hardware choices.
We can use much more powerful machines to reduce the test and build times.
Define a Makefile in the root of the Vector project that servers as a top level interface to all maintenance functions. Ex: make build x86_64-apple-darwin, make test, etc.
Contain the entire test, build, and deploy process in a collection of Docker images with scripts as the ENTRYPOINT. These docker images and scripts should be checked in to the Vector repo making it easy to run, test, and debug locally. We should extend or use the cross docker images as starting points. If we want to stay in sync with upstream changes, we can simply build their images and then extend them in our own via:
Where $target is the target triple: x86_64-apple-darwin
Nuggets of Wisdom
Cross compiling OSX
I attempted to cross compile Vector to OSX (x86_64-apple-darwin) in a Docker container
via osxcross. Surprisingly, I got this to work and was able to successfully compile the tool chain via:
This placed the target toolchain binaries (ex: /osxcross/target/bin/x86_64-apple-darwin18-clang) in the /osycross/target/bin directory. I then told Rust about it by setting:
I ran into problems when I tried to actually compile vector via:
cargo build --target x86_64-apple-darwin
Specifically, the build failed when it tried to build the leveldb libraries (more on this below).
At this point I gave up and I think the correct solution would be too build leveldb and rdkafka against the target architecture and somehow bind those to the Vector Rust libraries.
Leveldb
Vector uses leveldb for it's on-disk buffering. We use this library which provides leveldb bindings for Rust. You'll notice it requires libleveldb-dev and libsnappy-dev, which makes sense. But, when cross compiling these need to be compiles against the target architecture's toolchain.
Rdkafka
Similar to leveldb we use the this library for Rust bindings to librdkafka. This library will attempt to compile from source upon installation as noted in the README:
By default a submodule with the librdkafka sources pinned to a specific commit will be used to compile and statically link the library.
This process is only supported for select targets, making cross compilation difficult. Fortunately, it offers a dynamic_linking option that allows you to link to your own binaries:
The dynamic_linking feature can be used to link rdkafka to a locally installed version of librdkafka: if the feature is enabled, the build script will use pkg-config to check the version of the library installed in the system, and it will configure the compiler to use dynamic linking.
BuildKite
I was able to successfully setup BuildKite using their CloudFormation stack. A few things to note:
This tells Buildkite to use the .buildkite/pipeline.yml file within the Vector project. This is where we should define steps.
The Buildkite steps should simply call the Makefile commands as outlined above. Ex: Make build x86_64-apple-darwin. This should handle launching a docker container, etc.
Conclusion
The purpose of this issues to hopefully save time when we decide to pick this back up again.
The text was updated successfully, but these errors were encountered:
I'm closing this since I do not think this represents any new actionable work. I don't think the strategy outlined here is the correct one to support all of the targets we want as demonstrated by #689.
As part of improving the Vector installation process, I went down the rabbit hole of improving our release / build / cross-compile process as I knew we'd want to support more platforms and architectures in the future. This, unfortunately, was a very deep rabbit hole full of nightmares :). While I didn't come out of this process successful, I feel that I would have been if I had a little more time.
Why?
A core premise of Vector is to unify data collection across multiple clouds, systems, architecture, devices, etc. Vector is successful when a user can leverage Vector as a single unified data collector, offering a single workflow for collecting that data. This includes everything from generic Linux servers with AMD and ARM architectures, to Windows servers, to IOT devices running on Raspberry Pis, and more.
The Problem
Currently, Vector uses CircleCI to build and release Vector. There are a number of problems with this process:
It depends on the CircleCI's "mac" instances since we were unable to cross compile for
x86_64-apple-darwin
within a Docker container (more on this below). This further locks us into Circle. Ideally we would perform this process in a Docker container since it is portable and does not depend on any specific provider.We use the Rust
cross
library in hopes that we can easily cross compile as the README claims. This has a number of problems:cross
is basically a wrapper around a bunch of Docker images. Outside of handling all of the cross compilation library installation, it doesn't appear to do anything else fancy.Because it launches a new Docker image we cannot execute the
cross build --target <target>
command inside of an already actively running Docker image (which Circle generally recommends). Circle offers a "system step" calledsetup_remote_docker
that is designed to solve this. It proxiesdocker
commands through a port that can launch new docker instances. We failed to get this to work withcross
though.Given 2, we were forced to use Circle's "machine" executor type. This solved the problem of launch
cross
' Docker images, but this also moved us even further away from Docker and creating portable reproducible builds. In addition, this further locks us into Circle.Given 3, our builds are now incredibly slow since Circle does not let you specify a
resource_class
for "machine" executors. You get what you get, and apparently the machines are resource constrained making the builds withincross
' Docker images take upwards of 30 minutes, when they should take ~3 minutes.Cross is poorly managed from everything I can see. Releases are not made often. For example, the
master
branch oncross
appears to have over 2 years worth of changes beyond what is available oncargo
and Docker hub. This wasted an incredibly amount of time and forced us to installcross
from master viacargo install --git https://github.com/rust-embedded/cross
. We then needed to rebuild their Docker images as well since they were also over 2 years old.All of this made debugging incredibly frustrating. For example, because we are not using Docker it is not possible to simply test the build process locally. Instead, we have to trigger a Circle build and wait for the results. Given the fourth point above, this can take up to 30 minutes to see results. And if you want to SSH onto a Circle machine you must wait for the first build to complete, and then re-run the build via the "Re-rerun with SSH" option the Circle provides. This means, in some cases, that it takes an hour to simply SSH onto a Circle machine to debug. This is simply not conducive to testing other targets in the future.
The Solution
I believe the solution starts with moving off of CircleCI. The solution I've been eyeing is BuildKite, which I think this the right choice to solve all of the problems above. Why?
We have not been happy with Circle. There are a slew of reasons, but needless to say we've been waiting for the opportunity to move to another provider.
BuildKite let's you host your own agents. These could be AWS EC2 instances, a physical mac mini, a physical raspberry pi, anything the BuildKite agent can install onto. Setting aside this cheaper, it's much more flexible.
Owning, controlling, and having direct access to our own machines drastically reduces the debug time. Even for the simple fact that we can immediately SSH onto the machine saves 30 minutes in the above explained CIrcleCI problems.
As mentioned in 2, we have more options for solving cross compilation issues. For example, if we find the cross compiling for MacOS is too difficult, we can find a way to get access to a MacOS machine and install the BuildKite agent on it. This could be a mac mini in our office or a hosted mac offered by a separate provider. The same could be said for building on the AMD architecture, we could use AWS'
a1.*
instances to simplify this. Basically, this decoupling means that we are not limited in our hardware choices.We can use much more powerful machines to reduce the test and build times.
Ideal Solution
In my opinion the ideal solution is:
Move to Buildkite.
Setup the BuildKite CloudFormation stack. I already did this (see the BuildKite) section below.
Define a
Makefile
in the root of the Vector project that servers as a top level interface to all maintenance functions. Ex:make build x86_64-apple-darwin
,make test
, etc.Contain the entire test, build, and deploy process in a collection of Docker images with scripts as the
ENTRYPOINT
. These docker images and scripts should be checked in to the Vector repo making it easy to run, test, and debug locally. We should extend or use the cross docker images as starting points. If we want to stay in sync with upstream changes, we can simply build their images and then extend them in our own via:Where
$target
is the target triple:x86_64-apple-darwin
Nuggets of Wisdom
Cross compiling OSX
I attempted to cross compile Vector to OSX (
x86_64-apple-darwin
) in a Docker containervia
osxcross
. Surprisingly, I got this to work and was able to successfully compile the tool chain via:This placed the target toolchain binaries (ex:
/osxcross/target/bin/x86_64-apple-darwin18-clang
) in the/osycross/target/bin
directory. I then told Rust about it by setting:I ran into problems when I tried to actually compile vector via:
Specifically, the build failed when it tried to build the
leveldb
libraries (more on this below).At this point I gave up and I think the correct solution would be too build
leveldb
andrdkafka
against the target architecture and somehow bind those to the Vector Rust libraries.Leveldb
Vector uses leveldb for it's on-disk buffering. We use this library which provides
leveldb
bindings for Rust. You'll notice it requireslibleveldb-dev
andlibsnappy-dev
, which makes sense. But, when cross compiling these need to be compiles against the target architecture's toolchain.Rdkafka
Similar to leveldb we use the this library for Rust bindings to
librdkafka
. This library will attempt to compile from source upon installation as noted in the README:This process is only supported for select targets, making cross compilation difficult. Fortunately, it offers a
dynamic_linking
option that allows you to link to your own binaries:BuildKite
I was able to successfully setup BuildKite using their CloudFormation stack. A few things to note:
Defined various AWS resources via the
vector-management
repo: https://github.com/timberio/vector-management/blob/master/terraform/ci.tf. These resources should be used when configuring and launching the BuildKite CloudFormation stack. Specifically:The
vector-ci-artifacts
bucket should be used to store artifacts.The
vector-ci-secrets
bucket should be used for secrets as outlined in their README.The
vector/private_ssh_key
in thevector-ci-secrets
bucket should be used as a deploy key to clone the Vector code.The
vector-ci
AWS keypair should be used as the EC2 instance keypair.A
vector
BuildKite pipeline should be defined (the pipeline slug should bevector
) with these steps defined:This tells Buildkite to use the
.buildkite/pipeline.yml
file within the Vector project. This is where we should define steps.The Buildkite steps should simply call the
Makefile
commands as outlined above. Ex:Make build x86_64-apple-darwin
. This should handle launching a docker container, etc.Conclusion
The purpose of this issues to hopefully save time when we decide to pick this back up again.
The text was updated successfully, but these errors were encountered: