Skip to content

Commit 46e9f0b

Browse files
authored
Merge pull request #751 from Kobzol/rustc-ci
Add a link to CI documentation in rustc-dev-guide
2 parents d12ceb9 + aa5a929 commit 46e9f0b

File tree

1 file changed

+4
-272
lines changed

1 file changed

+4
-272
lines changed

src/infra/docs/rustc-ci.md

+4-272
Original file line numberDiff line numberDiff line change
@@ -1,276 +1,8 @@
11
# How the Rust CI works
22

3-
Rust CI ensures that the master branch of rust-lang/rust is always in a valid state.
3+
Continuous integration (CI) workflows on the `rust-lang/rust` repository ensure that the `master` branch
4+
is always in a valid state.
45

5-
A developer submitting a pull request to rust-lang/rust, experiences the following:
6+
The CI infrastructure is described in detail in the [rustc-dev-guide][rustc-dev-guide].
67

7-
- A small subset of tests and checks are run on each commit to catch common errors.
8-
- When the PR is ready and approved, the "bors" tool enqueues a full CI run.
9-
- The full run either queues the specific PR or the PR is "rolled up" with other changes.
10-
- Eventually a CI run containing the changes from the PR is performed and either passes or fails with an error the developer must address.
11-
12-
## Which jobs we run
13-
14-
The `rust-lang/rust` repository uses GitHub Actions to test [all the
15-
platforms][platforms] we support. We currently have two kinds of jobs running
16-
for each commit we want to merge to master:
17-
18-
- Dist jobs build a full release of the compiler for that platform, including
19-
all the tools we ship through rustup; Those builds are then uploaded to the
20-
`rust-lang-ci2` S3 bucket and are available to be locally installed with the
21-
[rustup-toolchain-install-master] tool; The same builds are also used for
22-
actual releases: our release process basically consists of copying those
23-
artifacts from `rust-lang-ci2` to the production endpoint and signing them.
24-
- Non-dist jobs run our full test suite on the platform, and the test suite of
25-
all the tools we ship through rustup; The amount of stuff we test depends on
26-
the platform (for example some tests are run only on Tier 1 platforms), and
27-
some quicker platforms are grouped together on the same builder to avoid
28-
wasting CI resources.
29-
30-
All the builds except those on macOS and Windows are executed inside that
31-
platform’s custom [Docker container]. This has a lot of advantages for us:
32-
33-
- The build environment is consistent regardless of the changes of the
34-
underlying image (switching from the trusty image to xenial was painless for
35-
us).
36-
- We can use ancient build environments to ensure maximum binary compatibility,
37-
for example [using older CentOS releases][dist-x86_64-linux] on our Linux builders.
38-
- We can avoid reinstalling tools (like QEMU or the Android emulator) every
39-
time thanks to Docker image caching.
40-
- Users can run the same tests in the same environment locally by just running
41-
`src/ci/docker/run.sh image-name`, which is awesome to debug failures.
42-
43-
The docker images prefixed with `dist-` are used for building artifacts while those without that prefix run tests and checks.
44-
45-
We also run tests for less common architectures (mainly Tier 2 and Tier 3
46-
platforms) in CI. Since those platforms are not x86 we either run
47-
everything inside QEMU or just cross-compile if we don’t want to run the tests
48-
for that platform.
49-
50-
These builders are running on a special pool of builders set up and maintained for us by GitHub.
51-
52-
Almost all build steps shell out to separate scripts. This keeps the CI fairly platform independent (i.e., we are not
53-
overly reliant on GitHub Actions). GitHub Actions is only relied on for bootstrapping the CI process and for orchestrating
54-
the scripts that drive the process.
55-
56-
[platforms]: https://doc.rust-lang.org/nightly/rustc/platform-support.html
57-
[rustup-toolchain-install-master]: https://github.com/kennytm/rustup-toolchain-install-master
58-
[Docker container]: https://github.com/rust-lang/rust/tree/master/src/ci/docker
59-
[dist-x86_64-linux]: https://github.com/rust-lang/rust/blob/master/src/ci/docker/host-x86_64/dist-x86_64-linux/Dockerfile
60-
61-
## Merging PRs serially with bors
62-
63-
CI services usually test the last commit of a branch merged with the last
64-
commit in master, and while that’s great to check if the feature works in
65-
isolation it doesn’t provide any guarantee the code is going to work once it’s
66-
merged. Breakages like these usually happen when another, incompatible PR is
67-
merged after the build happened.
68-
69-
To ensure a master that works all the time we forbid manual merges: instead all
70-
PRs have to be approved through our bot, [bors] (the software behind it is
71-
called [homu]). All the approved PRs are put [in a queue][homu-queue] (sorted
72-
by priority and creation date) and are automatically tested one at the time. If
73-
all the builders are green the PR is merged, otherwise the failure is recorded
74-
and the PR will have to be re-approved again.
75-
76-
Bors doesn’t interact with CI services directly, but it works by pushing the
77-
merge commit it wants to test to a branch called `auto`, and detecting the
78-
outcome of the build by listening for either Commit Statuses or Check Runs.
79-
Since the merge commit is based on the latest master and only one can be tested
80-
at the same time, when the results are green master is fast-forwarded to that
81-
merge commit.
82-
83-
The `auto` branch and other branches used by bors live on a fork of rust-lang/rust:
84-
[rust-lang-ci/rust]. This was originally done due to some security limitations in GitHub
85-
Actions. These limitations have been addressed, but we've not yet done the work of removing
86-
the use of the fork.
87-
88-
Unfortunately testing a single PR at the time, combined with our long CI (~3
89-
hours for a full run)[^1], means we can’t merge too many PRs in a single day, and a
90-
single failure greatly impacts our throughput for the day. The maximum number
91-
of PRs we can merge in a day is around 8.
92-
93-
The large CI run times and requirement for a large builder pool is largely due to the
94-
fact that full release artifacts are built in the `dist-` builders. This is worth it
95-
because these release artifacts:
96-
97-
- allow perf testing even at a later date
98-
- allow bisection when bugs are discovered later
99-
- ensure release quality since if we're always releasing, we can catch problems early
100-
101-
Bors [runs on ecs](https://github.com/rust-lang/simpleinfra/blob/master/terraform/bors/app.tf) and uses a sqlite database running in a volume as storage.
102-
103-
[^1]: As of January 2023, the bottleneck are the `dist-x86_64-linux` and `dist-x86_64-linux-alt` runners because of their usage of [BOLT] and [PGO] optimization tooling.
104-
105-
[bors]: https://github.com/bors
106-
[homu]: https://github.com/rust-lang/homu
107-
[homu-queue]: https://bors.rust-lang.org/queue/rust
108-
[rust-lang-ci/rust]: https://github.com/rust-lang-ci/rust
109-
[BOLT]: https://github.com/facebookincubator/BOLT
110-
[PGO]: https://en.wikipedia.org/wiki/Profile-guided_optimization
111-
112-
### Rollups
113-
114-
Some PRs don’t need the full test suite to be executed: trivial changes like
115-
typo fixes or README improvements *shouldn’t* break the build, and testing
116-
every single one of them for 2 to 3 hours is a big waste of time. To solve this
117-
we do a "rollup", a PR where we merge all the trivial PRs so they can be tested
118-
together. Rollups are created manually by a team member using the "create a rollup" button on the [bors queue]. The team member uses their
119-
judgment to decide if a PR is risky or not, and are the best tool we have at
120-
the moment to keep the queue in a manageable state.
121-
122-
[bors queue]: https://bors.rust-lang.org/queue/rust
123-
124-
### Try builds
125-
126-
Sometimes we need a working compiler build before approving a PR, usually for
127-
[benchmarking][perf] or [checking the impact of the PR across the
128-
ecosystem][crater]. Bors supports creating them by pushing the merge commit on
129-
a separate branch (`try`), and they basically work the same as normal builds,
130-
without the actual merge at the end. Any number of try builds can happen at the
131-
same time, even if there is a normal PR in progress.
132-
133-
You can see the CI configuration for try builds [here](https://github.com/rust-lang/rust/blob/9d46c7a3e69966782e163877151c1f0cea8b630a/src/ci/github-actions/ci.yml#L728-L741).
134-
135-
If you want to perform a try build with a different configuration (e.g. try to
136-
perform a compiler build for a different architecture), you can temporarily change
137-
the `try` CI job in your PR:
138-
139-
1) Open `src/ci/github-actions/ci.yml`
140-
2) Find the CI job that you want to run (e.g. `dist-aarch64-linux`)
141-
3) Copy-paste the entry of the CI job
142-
4) Find the `try:` job in the file
143-
5) Replace the `dist-x86_64-linux` job in the matrix with the copied entry from step 3)
144-
6) Run `python3 x.py run src/tools/expand-yaml-anchors`
145-
7) Push your changes and start a try build with `@bors try`
146-
147-
[perf]: https://perf.rust-lang.org
148-
[crater]: https://github.com/rust-lang/crater
149-
150-
## Which branches we test
151-
152-
Our builders are defined in [`src/ci/github-actions/ci.yml`].
153-
154-
[`src/ci/github-actions/ci.yml`]: https://github.com/rust-lang/rust/blob/master/src/ci/github-actions/ci.yml
155-
156-
### PR builds
157-
158-
All the commits pushed in a PR run a limited set of tests: a job containing a
159-
bunch of lints plus a cross-compile check build to Windows mingw (without
160-
producing any artifacts) and the `x86_64-gnu-llvm-##` non-dist builder (where
161-
`##` is the *system* LLVM version we are currently testing). Those two
162-
builders are enough to catch most of the common errors introduced in a PR, but
163-
they don’t cover other platforms at all. Unfortunately it would take too many
164-
resources to run the full test suite for each commit on every PR.
165-
166-
Additionally, if the PR changes certain tools (or certain platform-specific
167-
parts of std to check for miri breakage), the `x86_64-gnu-tools` non-dist
168-
builder is run.
169-
170-
### The `try` branch
171-
172-
On the main rust repo, `try` builds produce just a Linux toolchain using the
173-
`dist-x86_64-linux` image.
174-
175-
### The `auto` branch
176-
177-
This branch is used by bors to run all the tests on a PR before merging it, so
178-
all the builders are enabled for it. bors will repeatedly force-push on it
179-
(every time a new commit is tested).
180-
181-
### The `master` branch
182-
183-
Since all the commits to `master` are fast-forwarded from the `auto` branch (if
184-
they pass all the tests there) we don’t need to build or test anything. A quick
185-
job is executed on each push to update toolstate (see the toolstate description
186-
below).
187-
188-
### Other branches
189-
190-
Other branches are just disabled and don’t run any kind of builds, since all
191-
the in-progress branches will eventually be tested in a PR.
192-
193-
## Caching
194-
195-
The main rust repository doesn’t use the native GitHub Actions caching tools.
196-
All our caching is uploaded to an S3 bucket we control
197-
(`rust-lang-ci-sccache2`), and it’s used mainly for two things:
198-
199-
### Docker images caching
200-
201-
The Docker images we use to run most of the Linux-based builders take a *long*
202-
time to fully build. To speed up the build, we cache the exported images on the
203-
S3 bucket (with `docker save`/`docker load`).
204-
205-
Since we test multiple, diverged branches (`master`, `beta` and `stable`) we
206-
can’t rely on a single cache for the images, otherwise builds on a branch would
207-
override the cache for the others. Instead we store the images identifying them
208-
with a custom hash, made from the host’s Docker version and the contents of all
209-
the Dockerfiles and related scripts.
210-
211-
### LLVM caching with sccache
212-
213-
We build some C/C++ stuff during the build and we rely on [sccache] to cache
214-
intermediate LLVM artifacts. Sccache is a distributed ccache developed by
215-
Mozilla, and it can use an object storage bucket as the storage backend, like
216-
we do with our S3 bucket.
217-
218-
[sccache]: https://github.com/mozilla/sccache
219-
220-
## Custom tooling around CI
221-
222-
During the years we developed some custom tooling to improve our CI experience.
223-
224-
### Rust Log Analyzer to show the error message in PRs
225-
226-
The build logs for `rust-lang/rust` are huge, and it’s not practical to find
227-
what caused the build to fail by looking at the logs. To improve the
228-
developers’ experience we developed a bot called [Rust Log Analyzer][rla] (RLA)
229-
that receives the build logs on failure and extracts the error message
230-
automatically, posting it on the PR.
231-
232-
The bot is not hardcoded to look for error strings, but was trained with a
233-
bunch of build failures to recognize which lines are common between builds and
234-
which are not. While the generated snippets can be weird sometimes, the bot is
235-
pretty good at identifying the relevant lines even if it’s an error we've never
236-
seen before.
237-
238-
[rla]: https://github.com/rust-lang/rust-log-analyzer
239-
240-
### Toolstate to support allowed failures
241-
242-
The `rust-lang/rust` repo doesn’t only test the compiler on its CI, but also a
243-
variety of tools and documentation. Some documentation is pulled in via git
244-
submodules. If we blocked merging rustc PRs on the documentation being fixed,
245-
we would be stuck in a chicken-and-egg problem, because the documentation's CI
246-
would not pass since updating it would need the not-yet-merged version of
247-
rustc to test against (and we usually require CI to be passing).
248-
249-
To avoid the problem, submodules are allowed to fail, and their status is
250-
recorded in [rust-toolstate]. When a submodule breaks, a bot automatically
251-
pings the maintainers so they know about the breakage, and it records the
252-
failure on the toolstate repository. The release process will then ignore
253-
broken tools on nightly, removing them from the shipped nightlies.
254-
255-
While tool failures are allowed most of the time, they’re automatically
256-
forbidden a week before a release: we don’t care if tools are broken on nightly
257-
but they must work on beta and stable, so they also need to work on nightly a
258-
few days before we promote nightly to beta.
259-
260-
More information is available in the [toolstate documentation].
261-
262-
### GitHub Actions Templating
263-
264-
GitHub Actions does not natively support templating which can cause configurations to be large and difficult to change. We use YAML anchors for templating and a custom tool, [`expand-yaml-anchors`], to expand [the template] into the CI configuration that [GitHub uses][ci config].
265-
266-
This templating language is fairly straightforward:
267-
268-
- `&` indicates a template section
269-
- `*` expands the indicated template in place
270-
- `<<` merges yaml dictionaries
271-
272-
[rust-toolstate]: https://rust-lang-nursery.github.io/rust-toolstate
273-
[toolstate documentation]: ../toolstate.md
274-
[`expand-yaml-anchors`]: https://github.com/rust-lang/rust/tree/master/src/tools/expand-yaml-anchors
275-
[the template]: https://github.com/rust-lang/rust/blob/736c675d2ab65bcde6554e1b73340c2dbc27c85a/src/ci/github-actions/ci.yml
276-
[ci config]: https://github.com/rust-lang/rust/blob/master/.github/workflows/ci.yml
8+
[rustc-dev-guide]: https://rustc-dev-guide.rust-lang.org/tests/ci.html

0 commit comments

Comments
 (0)