Make our ship_it.yml GHA workflow resilient #476
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR
As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines): https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1
As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391
While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io, K8s), let's make our GHA workflow resilient by:
secrets
)While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from.
This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further.
Part of this, we also improved on how we check for Fly.io connectivity.
Things that could be improved in follow-ups:
dagger-on-github-fallback
job succeedsdagger-on-fly-docker
failsdagger-on-k8s
job as secondary fallback