Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make our ship_it.yml GHA workflow resilient #476

Merged

Conversation

gerhard
Copy link
Member

@gerhard gerhard commented Jul 30, 2023

TL;DR

image

As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines): https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1
image
image

As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391
image

While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io, K8s), let's make our GHA workflow resilient by:

  • Run on our preferred back-end by default (Dagger on Fly.io)
    • ✅ If it succeeds, we are done
    • ❌ If it fails, fallback to running on the free GitHub runners
  • In forks, use free GitHub runners by default (we cannot share secrets)

While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from.

This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further.

Part of this, we also improved on how we check for Fly.io connectivity.


Things that could be improved in follow-ups:

@gerhard gerhard force-pushed the make-our-ship_it-workflow-resilient branch from 921a242 to 5c13070 Compare July 30, 2023 15:06
@gerhard
Copy link
Member Author

gerhard commented Jul 30, 2023

Key takeaway: using our own runners is ~7-8x quicker (regardless whether they run on Fly.io or K8s).


Here is the job that ran on my forked repo: https://github.com/gerhard/changelog.com/actions/runs/5706915296
image

Note
PRs from forks do not have access to this repo's vars or secrets, or the ones defined in the forked repo.

And this is the PR job:
image


We may want to consider:

  • allowing public repositories in Runner groups / Default
  • require approval for all outside collaborators at the org level
  • replace default GitHub runners with self-hosted ones (currently running on my production K8s cluster) - FWIW:
image

Let's discuss!

@gerhard gerhard force-pushed the make-our-ship_it-workflow-resilient branch 4 times, most recently from cea42b8 to eda1014 Compare July 31, 2023 07:08
As you already know, we use Dagger for CI/CD. By default, this runs on
Fly.io (via Docker). In some cases, this can fail.

The last failure was when DNS resolution stopped working after the
Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io
machines), e.g.
https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1

As a temporary fix, we had to delete some secrets and re-run the job.
The job ran on GHA free runners & failed for genuine reasons
6 mins later:
https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391

While running on the free GHA runners can be 3x-8x slower, it's a good
fall-back. You heard us mention on multiple occasions: "always have
redundancies in place". Since we already have multiple CI runtimes in
place (Fly.io. K8s), let's make our GHA workflow resilient by:
- Run on our preferred back-end by default (Dagger on Fly.io)
  - ✅ If it succeeds, we are done
  - ❌ If it fails, fallback to running on the free GitHub runners
- In forks, use free GitHub runners by default (we cannot share `secrets`)

While this means that a workflow which fails for genuine reasons will
fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems
like a better place to improve from.

This change goes one step further. We are using a third back-end: Dagger
on K8s. This uses a self-hosted GitHub runner on K8s which is already
integrated with Dagger. For now, we are using it just to see how the CI
part compares to our primary setup (Dagger on Fly.io). We are not using
Dagger on K8s to deploy the app. Let's see how this setup behaves over a
few weeks/months before we consider taking it further.

Part of this, we also improved on how we check for Fly.io connectivity.

Things that could be improved in follow-ups:
- the workflow should succeed if the `dagger-on-github-fallback` job succeeds
  - currently it fails if `dagger-on-fly-docker` fails
- add `dagger-on-k8s` job as secondary fallback
  - GitHub Actions is currently missing actions/runner#1665
- maybe leverage a Dagger cache that works in forks too 😉
- Run Dagger Engine as a Fly Machine (no more Docker)
  - thechangelog#471

Signed-off-by: Gerhard Lazu <[email protected]>
@gerhard gerhard force-pushed the make-our-ship_it-workflow-resilient branch from eda1014 to 2fcffd9 Compare July 31, 2023 07:11
@gerhard gerhard merged commit 271286c into thechangelog:master Jul 31, 2023
@gerhard gerhard deleted the make-our-ship_it-workflow-resilient branch July 31, 2023 07:30
@gerhard
Copy link
Member Author

gerhard commented Jul 31, 2023

gerhard added a commit that referenced this pull request Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant