Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: make docker caching on self-hosted runners work #3925

Closed
thomaseizinger opened this issue May 14, 2023 · 5 comments · Fixed by #4593
Closed

ci: make docker caching on self-hosted runners work #3925

thomaseizinger opened this issue May 14, 2023 · 5 comments · Fixed by #4593

Comments

@thomaseizinger
Copy link
Contributor

With self-hosted runners, I thought that caching via docker's RUN --mount=type=chache,target=./target would work because the machines are persistent but for some reason it doesn't. See https://github.com/libp2p/rust-libp2p/actions/runs/4973759342/jobs/8899807653#step:4:93.

@galargh Do you have any idea why? Can we make that work at all?

@galargh
Copy link
Contributor

galargh commented May 15, 2023

The machines are ephemeral and that's by design. You could use pl-strflt/tf-aws-gh-runner/.github/actions/docker-cache@main action. In it's outputs, you'll find to and from variables which are proposed values for --cache-to and --cache-from docker-run params. This setup uses a shared S3 bucket for caching.

Another option, would be to use the libp2p S3 bucket. There's also GHA actions cache but that won't work here because of 10GB size limit.

@thomaseizinger
Copy link
Contributor Author

thomaseizinger commented May 15, 2023

The machines are ephemeral and that's by design.

Got it.

You could use pl-strflt/tf-aws-gh-runner/.github/actions/docker-cache@main action. In it's outputs, you'll find to and from variables which are proposed values for --cache-to and --cache-from docker-run params. This setup uses a shared S3 bucket for caching.

That caches layers though right? I am looking for ways to share the build cache1 because the layer will be invalid every time as we are building from latest HEAD.

It would also be nice if we could keep the Dockerfile clean like that and not mess around with copying lockfile and dummy main first etc.

Reading https://stackoverflow.com/a/66890439, it sounds like these caches would use the same storage driver which is overlay2 I am assuming?

Can we somehow mount those directories on a persistent disk (with a unique label on the hosted-runner maybe?)

Can you attach a persistent disk to multiple ec2 machines?

Or perhaps switch to a different storage driver like zfs and have it deal with sharing the data?

Or mount S3 via a FUSE imlementation and continue to use the overlay2 driver?

It would be amazing if we could get this to work. We are currently spending ~ 5min rebuilding almost the same artifacts over and over again on each CI run.

Footnotes

  1. This is a quite recent feature of docker: https://docs.docker.com/engine/reference/builder/#run---mounttypecache

@galargh
Copy link
Contributor

galargh commented May 15, 2023

That caches layers though right? I am looking for ways to share the build cache1 because the layer will be invalid every time as we are building from latest HEAD.

Yes, that's for layers. Cool, I didn't know they did that. Thanks for sharing.

Can you attach a persistent disk to multiple ec2 machines?

You can use EFS for that. There's also EBS multi-attach but it's more limited.

It would be amazing if we could get this to work.

We also have https://github.com/pl-strflt/tf-aws-gh-runner/tree/main/.github/actions/upload-artifact and https://github.com/pl-strflt/tf-aws-gh-runner/tree/main/.github/actions/download-artifact which you can use to upload/download stuff to/from S3 without needing to configure access. I think it might be easiest to generalise these actions so that they can use shared paths instead of run attempt specific ones, and then use them to handle the target cache dir.

@thomaseizinger
Copy link
Contributor Author

I played around with this a bit:

  1. docker buildx du --filter Type=exec.cachemount lists all build caches
❯ docker buildx du --filter Type=exec.cachemount
ID                                              RECLAIMABLE     SIZE            LAST ACCESSED
gtzqdyd3b8y2fxfhma2wqxsyk*                      true            2.886GB         8 minutes ago
3rprl8btsurb1xnyjxrjhxkfh*                      true            333.5MB         8 minutes ago
  1. They ID is a directory under /var/lib/docker/overlay2
❯ l /var/lib/docker/overlay2/ -l
drwx--x--- - root 14 May 20:28 1c6f7ed86c9875eb6bd7376d120a9cc0495eeb51ae86383aa6b30e3721e441cb
drwx--x--- - root 14 May 22:17 2e0unsy8wmgfvsq6dxj54hs4l
drwx--x--- - root 14 May 22:20 3ec1da38c9f472f48c49f6ceeb7e399721f3a82bfb7a368122e2eeec2b390d6d
drwx--x--- - root 14 May 22:17 3h8q5v4cbyr34wyb434fxd38x
drwx--x--- - root 14 May 22:22 3rprl8btsurb1xnyjxrjhxkfh
drwx--x--- - root 14 May 20:28 6fa01eed0487484085d6ff01d7f8508aa9100cc0034aa1632a263e1c5968e5ba
drwx--x--- - root 14 May 22:17 52f411f544223b2e8c0c45246d2ec5cd423925e7ce81034dd38bd1edd4f25127
drwx--x--- - root 14 May 20:28 67ecae26a3eacba5e087dca7951e95d05010cc390da95b091da546d1007d8b4a
drwx--x--- - root 14 May 20:28 69dcb4c2234d25c8da573c988fa438876c43d3d343906518f9bb44c0f04869df
drwx--x--- - root 14 May 22:28 74hftykdfcarzy1wlpuppuv0c
drwx--x--- - root 14 May 20:28 40097b7a42ee7f5452431c8a4ab18b025ccdab103fea8d5256fc04285dfe7942
drwx--x--- - root 14 May 22:20 269509b45590fe1c927361aca11be82a20c758ec99ac0733eab860ad2d8341b9
drwx--x--- - root 14 May 22:28 6274096ce49139ff0e5ded0d2068c49a422d55abe779b54749f78c798f3910d8
drwx--x--- - root 14 May 22:21 a05a51933fdb3ba16d0fce52805ebe7231e8edf08b44cc62080b2196d01b74e3
drwx--x--- - root 14 May 22:20 b25094cadd0b4daf9627c2dcf275adfb4334e36c586e0aea5fa43fd5e13abe49
drwx--x--- - root 14 May 22:21 c8d874225ad9917c52b91f186191230e46add5758320eb306ecf6956035b45a7
drwx--x--- - root 14 May 22:28 cdigd163j7otehodcd11ra9t9
drwx--x--- - root 14 May 20:28 f7f08531090d4adbaa4d792cc84a343deb777283d1238106901861977c903d98
drwx--x--- - root 14 May 22:32 f15752aa40e1291d6aba37ea73bb42162eda94ba4a86f9b08f05d51334e3ac73
drwx--x--- - root 14 May 21:25 f15752aa40e1291d6aba37ea73bb42162eda94ba4a86f9b08f05d51334e3ac73-init
drwx--x--- - root 14 May 22:28 ggtim3523df4uf31x5x10vuon
drwx--x--- - root 14 May 22:22 gtzqdyd3b8y2fxfhma2wqxsyk
drwx--x--- - root 14 May 22:17 ijhhoc77up9hw5h5vb5adfy3t
drwx------ - root 14 May 22:28 l
drwx--x--- - root 14 May 22:28 ra2g6syusbpmtyholkpg918mr

Those are the directories we would have to mount / save & restore.

Can you attach a persistent disk to multiple ec2 machines?

You can use EFS for that. There's also EBS multi-attach but it's more limited.

Have you used EFS before? How difficult would it be to do something like:

  • Have a label "efs-docker-storage" for self-hosted runners
  • When applied, everything within /var/lib/docker/overlay2 is mounted as an EFS

mergify bot pushed a commit that referenced this issue May 15, 2023
By using a multi-stage docker build, a distroless base image and a release build, we can get the size of the Rust interop test down to 50MB. Previously, the image would be around 500MB. A debug build image would still have ~400MB. The release build slows down our interop build step by about 1min 20s. That however is only because we don't currently seem to utilize the caches that from what I understand should work on self-hosted runners. I opted #3925 for that.

Resolves: #3881.

Pull-Request: #3926.
mergify bot pushed a commit that referenced this issue May 15, 2023
By using a multi-stage docker build, a distroless base image and a release build, we can get the size of the Rust interop test down to 50MB. Previously, the image would be around 500MB. A debug build image would still have ~400MB. The release build slows down our interop build step by about 1min 20s. That however is only because we don't currently seem to utilize the caches that from what I understand should work on self-hosted runners. I opted #3925 for that.

Resolves: #3881.

Pull-Request: #3926.
(cherry picked from commit 0bc724a)

# Conflicts:
#	interop-tests/Dockerfile
mergify bot pushed a commit that referenced this issue May 15, 2023
By using a multi-stage docker build, a distroless base image and a release build, we can get the size of the Rust interop test down to 50MB. Previously, the image would be around 500MB. A debug build image would still have ~400MB. The release build slows down our interop build step by about 1min 20s. That however is only because we don't currently seem to utilize the caches that from what I understand should work on self-hosted runners. I opted #3925 for that.

Resolves: #3881.

Pull-Request: #3926.
(cherry picked from commit 0bc724a)
mergify bot pushed a commit that referenced this issue May 15, 2023
By using a multi-stage docker build, a distroless base image and a release build, we can get the size of the Rust interop test down to 50MB. Previously, the image would be around 500MB. A debug build image would still have ~400MB. The release build slows down our interop build step by about 1min 20s. That however is only because we don't currently seem to utilize the caches that from what I understand should work on self-hosted runners. I opted #3925 for that.

Resolves: #3881.

Pull-Request: #3926.
(cherry picked from commit 0bc724a)
mergify bot pushed a commit that referenced this issue May 15, 2023
By using a multi-stage docker build, a distroless base image and a release build, we can get the size of the Rust interop test down to 50MB. Previously, the image would be around 500MB. A debug build image would still have ~400MB. The release build slows down our interop build step by about 1min 20s. That however is only because we don't currently seem to utilize the caches that from what I understand should work on self-hosted runners. I opted #3925 for that.

Resolves: #3881.

Pull-Request: #3926.
(cherry picked from commit 0bc724a)
@galargh
Copy link
Contributor

galargh commented May 15, 2023

Have you used EFS before? How difficult would it be to do something like:

  • Have a label "efs-docker-storage" for self-hosted runners
  • When applied, everything within /var/lib/docker/overlay2 is mounted as an EFS

I did but it was quite a while back. I remember it being quite smooth but the details are really vague in my head. I don't think it'd be terribly complicated think to put together. I added an issue for this in the self-hosted runners repo so that we don't forget about it - ipdxco/custom-github-runners#26. Unfortunately, I don't think we'll be able to pick it up any time soon due to other commitments. I could let you into our AWS account if you wanted to experiment with it yourself. But I think falling back to S3 upload/download might be quicker to put together.

@mergify mergify bot closed this as completed in #4593 Oct 5, 2023
mergify bot pushed a commit that referenced this issue Oct 5, 2023
Currently, the Docker images for the HEAD branch of the pull-request get re-built completely every time we push a new commit to a branch. That is because the RUN caches use the local disk of the host system but those are ephemeral in GitHub actions.

To fix this, we rewrite the dockerfiles to use `cargo chef`, a tool developed to create a cached layer of built dependencies that doesn't get invalidated as the application source changes.

Normally, these layers are also cached on the local filesystem. To have them available across pull-requests and branches, we instruct buildkit to use the same S3 cache as we use in the interop tests already for docker layers. As a result, this should greatly speed up our CI.

Resolves: #3925.

Pull-Request: #4593.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants