-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
For January. A bit long, but there was a lot of maintenance this week :) Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
- Loading branch information
Showing
1 changed file
with
204 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,204 @@ | ||
--- | ||
layout: post | ||
title: "CI CI CI" | ||
--- | ||
|
||
The [previous post]({% post_url 2024-01-01-Angel-Project %}) was predicting more | ||
work on the CI side, and... That was correct! A new environment causing | ||
unexpected kernel crashes, new CI instances finding new bugs, third party tools | ||
causing issues, etc. Read on to find out why January seems to be the month | ||
dedicated to CI activities! | ||
|
||
<!--more--> | ||
|
||
## CI | ||
|
||
In addition to the work around MPTCP CI, some modifications have also been done | ||
for other CIs validating MPTCP selftests. | ||
|
||
### MPTCP CI | ||
|
||
As mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %}), | ||
the migration to Github Actions continued in January: adding email | ||
notifications, display test results and extra metrics, enabling them on new | ||
patches, etc. In short, re-adding many bits we had before with Cirrus-CI, and | ||
needed to keep the workflow closed to what it was before. | ||
|
||
Unfortunately, the migration came with some unexpected surprises! The first one | ||
is a small increase in the number of "flaky" tests. This can partially be | ||
explained by the fact KVM acceleration cannot be used on GitHub Actions when | ||
using free runners. Packetdrill tests seem to suffer more than the others. | ||
Compared to the other tests, they are executed in parallel. It looks like they | ||
take all/most of the available CPU resources. Reducing the limit to have maximum | ||
2 tests in parallel greatly improved the results, but there are still some rare | ||
hiccups. | ||
|
||
The second surprise initially looked bad: some new kernel panics have been | ||
[detected](https://github.com/multipath-tcp/mptcp_net-next/issues/471). They | ||
were not due to MPTCP code -- it was happening while doing some "ping" between | ||
net namespaces -- but it was reported by the CI ~25-30% of the time. The bug has | ||
been [reported](https://lore.kernel.org/netdev/[email protected]/T/) | ||
to the Netdev community, and quickly identified as an issue somewhere in x86 | ||
code. It was not easy to reproduce the issue, but with determination, and hours | ||
of tests, a [patch](https://lore.kernel.org/all/[email protected]/T/) | ||
causing the issue has been identified. It turns out, the patch accidentally | ||
removed some code that was preventing a bug in QEmu's TCG code. Before reporting | ||
this to the QEmu's community, I tried to reproduce the issue with the last | ||
stable version of QEmu -- requiring an update of other dependences, including | ||
a fix for virtme that is no longer maintained -- and I didn't manage to. Still, | ||
the QEmu's version from the last Ubuntu stable release doesn't include this fix. | ||
This has been reported to the [Ubuntu bug tracker](https://bugs.launchpad.net/bugs/2051965). | ||
After another long bisecting session, I just managed to identify the commits | ||
fixing the bug. Sadly, these commits have not been backported on QEmu's side, | ||
and the commit message was lacking explanations about how the bug was initially | ||
found, and when the bug was introduced. This doesn't help to get fixes | ||
backported, and to quickly identify which patch fixed someone else's issue. At | ||
the end, quite some time has been spent on a bug impacting us, but not due to | ||
MPTCP. On the other hand, it helped to improve kernel panic detection, quickly | ||
stopping the VM in case of such critical issues, and the fix will likely be | ||
available to all Ubuntu users. | ||
|
||
Not related to the migration, but some other subtests have started to fail for a | ||
few days in a row. It turns out it was due to the last version of IPRoute2, more | ||
precisely due to a [regression](https://lore.kernel.org/netdev/[email protected]/) | ||
in the `ss` tool we use for some tests. A | ||
[fix](https://lore.kernel.org/netdev/[email protected]/) | ||
has been sent, applied, but not backported. Even if it breaks our selftests, and | ||
it might break scripts using `ss` to monitor connections or to report info, the | ||
[issue is not big enough](https://lore.kernel.org/netdev/[email protected]/) | ||
to justify a new bug-fix version. In other words, if you want to validate MPTCP | ||
selftests, don't use IPRoute 6.7.0. To reduce the number of users running tests | ||
with this buggy version, note that the version on Debian, used by a few CIs, has | ||
been [patched](https://salsa.debian.org/kernel-team/iproute2/-/commit/14cfd95e) | ||
with the fix. | ||
|
||
### LKFT CI | ||
|
||
Linaro's Linux Kernel Functional Test framework, or LKFT, test a very impressive | ||
number of kernel versions. From old stable ones, to linux-next. They have been | ||
validating MPTCP selftests for years, and that's great, thanks guys for doing | ||
that! It is especially useful to find issues due to a bad backport in stable | ||
versions, or to prevent issues by testing linux-next. One important | ||
particularity when validating stable kernel versions, is that they follow stable | ||
team [recommendations](https://lore.kernel.org/stable/[email protected]/): | ||
running the kselftests suite from the last stable version (e.g. 6.7.4) when | ||
validating (old) stable kernel (e.g. 5.15.148). It is an important assumption to | ||
know when writing kselftests, and I don't think this is well known. I don't | ||
think that's easy to support all kernels, especially in the networking area, | ||
where some features/behaviours are not directly exposed to the userspace. Some | ||
MPTCP kselftests have to look at `/proc/kallsyms` or use other (ugly?) | ||
[workarounds](https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net) | ||
to predict what we are supposed to have, depending on the kernel that is being | ||
used. But it is important to do something to support these old kernels, not to | ||
have big kselftests, with many different subtests, always marked as "failed" | ||
when validating new stable releases, removing all their value. | ||
|
||
Regularly, test results from LKFT are analysed, to check if some tests have not | ||
been skipped by accident, and if there are no real issues being ignored there. | ||
Recently, some modifications have been done to support more environments, add | ||
missing kernel configs, increase timeout, and other techniques trying to support | ||
very slow environments, and more. | ||
|
||
### Netdev CI | ||
|
||
Great news for the Networking subsystem: a new | ||
[CI environment](https://netdev.bots.linux.dev/status.html) has been put in | ||
place by Jakub Kicinski, supported by Meta. It currently runs most tests | ||
available in the kernel source code, including ours for MPTCP: kselftests, | ||
KUnit, and more. The initial goal is to validate patches that are sent on the | ||
mailing list, and send reports on [Patchwork](https://patchwork.kernel.org/project/netdevbpf/list/). | ||
That's an important addition, something we have done with MPTCP for a long time, | ||
and is part of the common practice: each new modification has to be validated by | ||
a CI. Certainly due to the same reasons as mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %}) | ||
and valid for most kernel subsystems, and Open-Source project in general -- no | ||
infrastructure in place that can be easily used by kernel dev to monitor new | ||
patches sent by email, to automatically apply, then run tests, and send a | ||
report ; who would pay to develop and maintain that, for the VM or the specific | ||
hardware, etc. --, this has been lacking for years. This doesn't mean no tests | ||
were being executed, it's just that this was not done in public: it is easier | ||
for a company to run these tests internally, especially when some specific | ||
hardware is needed, or the test suite is not public. But that's not impossible: | ||
DRM subsystem has been using GitLab for years, with self-hosted runners, for | ||
everybody's benefit. But it needs coordination, and some brave people to push | ||
their manager to start such projects that are not going to be directly | ||
beneficial for this company. Anyway, here, that's great to have Meta's support. | ||
|
||
I'm glad I was able to share some knowledge to help a bit having this in place. | ||
The environment is a bit different, mainly because no "public CIs" are involved. | ||
When I look at the different issues I had in the past, and the workarounds I put | ||
in place to have our different tests supported on such "public CIs", I'm now | ||
thinking I could have probably done a lot more with the same time if I was able | ||
to control everything. But on the other hand, that's great anybody can maintain | ||
it, and it is not tied to a single company -- which is currently not the case | ||
with the Netdev CI, even if a lot of components are in a | ||
[public repository](https://github.com/linux-netdev/nipa/). | ||
|
||
Having CI executed in a new environment, also means new issues specific to this | ||
new environment! This was the case, especially because the tests are running | ||
without KVM acceleration. It is good that some of the fixes needed for LKFT CI | ||
can be re-used here. | ||
|
||
Talking about fixes, Paolo managed to find what was causing some subtests to be | ||
[unstable](https://github.com/multipath-tcp/mptcp_net-next/issues/468) in very | ||
slow environments. It took a bit of time to identify the root cause, but it | ||
appears to be due to the use of a TCP-specific helper on an MPTCP socket. As | ||
mentioned in the [commit message](https://git.kernel.org/netdev/net/c/b6c620dc43cc), | ||
funnily enough, fuzzers and static checkers are happy, as the accessed memory | ||
still belongs to the `mptcp_sock struct`, but completely unrelated. Even from a | ||
functional perspective, the check being done was always skipped before. Until an | ||
unrelated [refactoring](https://git.kernel.org/netdev/net/c/d5fed5addb2b) which | ||
exposed the issue. In short, always be careful when casting a structure into | ||
another one! BTW, that's not the first time such an issue has been introduced | ||
because that's easy to mix-up MPTCP and TCP socket structures. I'm working on | ||
adding new small checks in debug mode to avoid such issues in the future. | ||
|
||
One last thing, not to have entire kselftests, including hundreds of subtests, | ||
being ignored due to a single unstable subtest, I'm adding subtests support. We | ||
would feel better skipping just one or two subtests while we are trying to | ||
improve the situation, than having hundreds of them being ignored for days, | ||
adding unneeded pressure. | ||
|
||
## Misc. Maintenance Work | ||
|
||
Mainly to expose that the maintenance work is always full of unexpected stuff | ||
that needs to be done "quickly", here are a bunch of tasks that needed to be | ||
done this month, on top of the classic reviews, meetings, questions, bugs, etc.: | ||
|
||
- BPF: we have some code in progress in our tree that is not fully ready yet. | ||
But then it means that we continuously have to adapt our code when some | ||
changes are done upstream. It was the case recently, to adapt to | ||
[different](https://git.kernel.org/netdev/net/c/f6be98d19985) `bpf_struct_ops` | ||
[changes](https://git.kernel.org/netdev/net/c/4cbb270e115b). | ||
|
||
- `mptcpd` debian packages: two new versions -- | ||
[0.12-3](https://tracker.debian.org/news/1494263/accepted-mptcpd-012-3-source-into-unstable/) and | ||
[0.12-4](https://tracker.debian.org/news/1499261/accepted-mptcpd-012-4-source-into-unstable/) -- | ||
have been released to fix some issues due to some changes in the test | ||
infrastructure, and how SystemD dependence is handled. | ||
|
||
- `packetdrill`: Packetdrill is a very handy tool to validate networking | ||
features. We maintain an out-of-tree version supporting MPTCP. This version is | ||
not "upstreamable" as it is, because the old experimental code needs a serious | ||
refactoring. But still, it works quite well. Anyway, when trying to find a fix | ||
for the bug we had with QEmu (see above), the build environment for the tests | ||
has been updated, breaking our branches dedicated to different stable versions | ||
containing tests valid for the specific versions: MPTCP support has been added | ||
progressively. To fix compilation issues when using newer compiler versions, | ||
but also to help with the maintenance later, each stable branch has been | ||
synced with modifications from the upstream repository maintained by Google. | ||
|
||
## What's next? | ||
|
||
Even if I would prefer to add new features, there is still some work needed for | ||
the CI: | ||
- using [self-hosted runners](https://github.com/multipath-tcp/mptcp_net-next/issues/474) | ||
to improve the stability of some tests | ||
- switching to [virtme-ng](https://github.com/multipath-tcp/mptcp_net-next/issues/472) | ||
because the `virtme` tool we use is no longer maintained | ||
- validating [MPTCP BPF tests](https://github.com/multipath-tcp/mptcp_net-next/issues/406) | ||
to track regressions in this area | ||
- tracking [flaky tests](https://github.com/multipath-tcp/mptcp_net-next/issues/473) | ||
to be able to easily identify them | ||
|
||
But once done, that will help the whole MPTCP community, and I will be able to | ||
focus on other things! |