-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: make docker caching on self-hosted runners work #3925
Comments
The machines are ephemeral and that's by design. You could use pl-strflt/tf-aws-gh-runner/.github/actions/docker-cache@main action. In it's outputs, you'll find Another option, would be to use the libp2p S3 bucket. There's also GHA actions cache but that won't work here because of 10GB size limit. |
Got it.
That caches layers though right? I am looking for ways to share the build cache1 because the layer will be invalid every time as we are building from latest HEAD. It would also be nice if we could keep the Dockerfile clean like that and not mess around with copying lockfile and dummy main first etc. Reading https://stackoverflow.com/a/66890439, it sounds like these caches would use the same storage driver which is overlay2 I am assuming? Can we somehow mount those directories on a persistent disk (with a unique label on the hosted-runner maybe?) Can you attach a persistent disk to multiple ec2 machines? Or perhaps switch to a different storage driver like zfs and have it deal with sharing the data? Or mount S3 via a FUSE imlementation and continue to use the overlay2 driver? It would be amazing if we could get this to work. We are currently spending ~ 5min rebuilding almost the same artifacts over and over again on each CI run. Footnotes
|
Yes, that's for layers. Cool, I didn't know they did that. Thanks for sharing.
You can use EFS for that. There's also EBS multi-attach but it's more limited.
We also have https://github.com/pl-strflt/tf-aws-gh-runner/tree/main/.github/actions/upload-artifact and https://github.com/pl-strflt/tf-aws-gh-runner/tree/main/.github/actions/download-artifact which you can use to upload/download stuff to/from S3 without needing to configure access. I think it might be easiest to generalise these actions so that they can use shared paths instead of run attempt specific ones, and then use them to handle the target cache dir. |
I played around with this a bit:
Those are the directories we would have to mount / save & restore.
Have you used EFS before? How difficult would it be to do something like:
|
By using a multi-stage docker build, a distroless base image and a release build, we can get the size of the Rust interop test down to 50MB. Previously, the image would be around 500MB. A debug build image would still have ~400MB. The release build slows down our interop build step by about 1min 20s. That however is only because we don't currently seem to utilize the caches that from what I understand should work on self-hosted runners. I opted #3925 for that. Resolves: #3881. Pull-Request: #3926.
By using a multi-stage docker build, a distroless base image and a release build, we can get the size of the Rust interop test down to 50MB. Previously, the image would be around 500MB. A debug build image would still have ~400MB. The release build slows down our interop build step by about 1min 20s. That however is only because we don't currently seem to utilize the caches that from what I understand should work on self-hosted runners. I opted #3925 for that. Resolves: #3881. Pull-Request: #3926. (cherry picked from commit 0bc724a) # Conflicts: # interop-tests/Dockerfile
By using a multi-stage docker build, a distroless base image and a release build, we can get the size of the Rust interop test down to 50MB. Previously, the image would be around 500MB. A debug build image would still have ~400MB. The release build slows down our interop build step by about 1min 20s. That however is only because we don't currently seem to utilize the caches that from what I understand should work on self-hosted runners. I opted #3925 for that. Resolves: #3881. Pull-Request: #3926. (cherry picked from commit 0bc724a)
By using a multi-stage docker build, a distroless base image and a release build, we can get the size of the Rust interop test down to 50MB. Previously, the image would be around 500MB. A debug build image would still have ~400MB. The release build slows down our interop build step by about 1min 20s. That however is only because we don't currently seem to utilize the caches that from what I understand should work on self-hosted runners. I opted #3925 for that. Resolves: #3881. Pull-Request: #3926. (cherry picked from commit 0bc724a)
By using a multi-stage docker build, a distroless base image and a release build, we can get the size of the Rust interop test down to 50MB. Previously, the image would be around 500MB. A debug build image would still have ~400MB. The release build slows down our interop build step by about 1min 20s. That however is only because we don't currently seem to utilize the caches that from what I understand should work on self-hosted runners. I opted #3925 for that. Resolves: #3881. Pull-Request: #3926. (cherry picked from commit 0bc724a)
I did but it was quite a while back. I remember it being quite smooth but the details are really vague in my head. I don't think it'd be terribly complicated think to put together. I added an issue for this in the self-hosted runners repo so that we don't forget about it - ipdxco/custom-github-runners#26. Unfortunately, I don't think we'll be able to pick it up any time soon due to other commitments. I could let you into our AWS account if you wanted to experiment with it yourself. But I think falling back to S3 upload/download might be quicker to put together. |
Currently, the Docker images for the HEAD branch of the pull-request get re-built completely every time we push a new commit to a branch. That is because the RUN caches use the local disk of the host system but those are ephemeral in GitHub actions. To fix this, we rewrite the dockerfiles to use `cargo chef`, a tool developed to create a cached layer of built dependencies that doesn't get invalidated as the application source changes. Normally, these layers are also cached on the local filesystem. To have them available across pull-requests and branches, we instruct buildkit to use the same S3 cache as we use in the interop tests already for docker layers. As a result, this should greatly speed up our CI. Resolves: #3925. Pull-Request: #4593.
With self-hosted runners, I thought that caching via docker's
RUN --mount=type=chache,target=./target
would work because the machines are persistent but for some reason it doesn't. See https://github.com/libp2p/rust-libp2p/actions/runs/4973759342/jobs/8899807653#step:4:93.@galargh Do you have any idea why? Can we make that work at all?
The text was updated successfully, but these errors were encountered: