Make our ship_it.yml GHA workflow resilient #476

gerhard · 2023-07-30T15:06:05Z

TL;DR

As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines): https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1

As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391

While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io, K8s), let's make our GHA workflow resilient by:

Run on our preferred back-end by default (Dagger on Fly.io)
- ✅ If it succeeds, we are done
- ❌ If it fails, fallback to running on the free GitHub runners
In forks, use free GitHub runners by default (we cannot share secrets)

While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from.

This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further.

Part of this, we also improved on how we check for Fly.io connectivity.

Things that could be improved in follow-ups:

the workflow should succeed if the dagger-on-github-fallback job succeeds
- currently it fails if dagger-on-fly-docker fails
add dagger-on-k8s job as secondary fallback
- GitHub Actions is currently missing Add ability to prioritize GitHub Action runners actions/runner#1665
maybe leverage a Dagger cache that works in forks too 😉
Run Dagger Engine as a Fly Machine (no more Docker) #471

gerhard · 2023-07-30T15:20:11Z

Key takeaway: using our own runners is ~7-8x quicker (regardless whether they run on Fly.io or K8s).

Here is the job that ran on my forked repo: https://github.com/gerhard/changelog.com/actions/runs/5706915296

Note
PRs from forks do not have access to this repo's vars or secrets, or the ones defined in the forked repo.

And this is the PR job:

We may want to consider:

allowing public repositories in Runner groups / Default
require approval for all outside collaborators at the org level
replace default GitHub runners with self-hosted ones (currently running on my production K8s cluster) - FWIW:

Let's discuss!

As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines), e.g. https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1 As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391 While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io. K8s), let's make our GHA workflow resilient by: - Run on our preferred back-end by default (Dagger on Fly.io) - ✅ If it succeeds, we are done - ❌ If it fails, fallback to running on the free GitHub runners - In forks, use free GitHub runners by default (we cannot share `secrets`) While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from. This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further. Part of this, we also improved on how we check for Fly.io connectivity. Things that could be improved in follow-ups: - the workflow should succeed if the `dagger-on-github-fallback` job succeeds - currently it fails if `dagger-on-fly-docker` fails - add `dagger-on-k8s` job as secondary fallback - GitHub Actions is currently missing actions/runner#1665 - maybe leverage a Dagger cache that works in forks too 😉 - Run Dagger Engine as a Fly Machine (no more Docker) - thechangelog#471 Signed-off-by: Gerhard Lazu <[email protected]>

gerhard · 2023-07-31T07:41:02Z

FWIW: https://community.fly.io/t/wireguard-tunnel-is-not-working-linux-mint-20-3/7949/9

Related to #476 Signed-off-by: Gerhard Lazu <[email protected]>

gerhard force-pushed the make-our-ship_it-workflow-resilient branch from 921a242 to 5c13070 Compare July 30, 2023 15:06

gerhard force-pushed the make-our-ship_it-workflow-resilient branch 4 times, most recently from cea42b8 to eda1014 Compare July 31, 2023 07:08

gerhard force-pushed the make-our-ship_it-workflow-resilient branch from eda1014 to 2fcffd9 Compare July 31, 2023 07:11

gerhard merged commit 271286c into thechangelog:master Jul 31, 2023

gerhard deleted the make-our-ship_it-workflow-resilient branch July 31, 2023 07:30

gerhard added a commit that referenced this pull request Jul 31, 2023

Bump flyctl & capture most recent wireguard new peer instructions

c7b8a57

Related to #476 Signed-off-by: Gerhard Lazu <[email protected]>

gerhard mentioned this pull request Oct 30, 2023

ci: Add DAGGER_CLOUD_TOKEN client-side dagger/dagger#6019

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make our ship_it.yml GHA workflow resilient #476

Make our ship_it.yml GHA workflow resilient #476

gerhard commented Jul 30, 2023 •

edited

Loading

gerhard commented Jul 30, 2023 •

edited

Loading

gerhard commented Jul 31, 2023

Make our ship_it.yml GHA workflow resilient #476

Make our ship_it.yml GHA workflow resilient #476

Conversation

gerhard commented Jul 30, 2023 • edited Loading

TL;DR

gerhard commented Jul 30, 2023 • edited Loading

gerhard commented Jul 31, 2023

gerhard commented Jul 30, 2023 •

edited

Loading

gerhard commented Jul 30, 2023 •

edited

Loading