Skip to content

Commit

Permalink
post: new CI CI CI post
Browse files Browse the repository at this point in the history
For January. A bit long, but there was a lot of maintenance this week :)

Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
  • Loading branch information
matttbe committed Feb 12, 2024
1 parent 6888b9c commit 9a19ab7
Showing 1 changed file with 204 additions and 0 deletions.
204 changes: 204 additions & 0 deletions _posts/2024-02-02-CI-CI-CI.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
layout: post
title: "CI CI CI"
---

The [previous post]({% post_url 2024-01-01-Angel-Project %}) was predicting more
work on the CI side, and... That was correct! A new environment causing
unexpected kernel crashes, new CI instances finding new bugs, third party tools
causing issues, etc. Read on to find out why January seems to be the month
dedicated to CI activities!

<!--more-->

## CI

In addition to the work around MPTCP CI, some modifications have also been done
for other CIs validating MPTCP selftests.

### MPTCP CI

As mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %}),
the migration to Github Actions continued in January: adding email
notifications, display test results and extra metrics, enabling them on new
patches, etc. In short, re-adding many bits we had before with Cirrus-CI, and
needed to keep the workflow closed to what it was before.

Unfortunately, the migration came with some unexpected surprises! The first one
is a small increase in the number of "flaky" tests. This can partially be
explained by the fact KVM acceleration cannot be used on GitHub Actions when
using free runners. Packetdrill tests seem to suffer more than the others.
Compared to the other tests, they are executed in parallel. It looks like they
take all/most of the available CPU resources. Reducing the limit to have maximum
2 tests in parallel greatly improved the results, but there are still some rare
hiccups.

The second surprise initially looked bad: some new kernel panics have been
[detected](https://github.com/multipath-tcp/mptcp_net-next/issues/471). They
were not due to MPTCP code -- it was happening while doing some "ping" between
net namespaces -- but it was reported by the CI ~25-30% of the time. The bug has
been [reported](https://lore.kernel.org/netdev/[email protected]/T/)
to the Netdev community, and quickly identified as an issue somewhere in x86
code. It was not easy to reproduce the issue, but with determination, and hours
of tests, a [patch](https://lore.kernel.org/all/[email protected]/T/)
causing the issue has been identified. It turns out, the patch accidentally
removed some code that was preventing a bug in QEmu's TCG code. Before reporting
this to the QEmu's community, I tried to reproduce the issue with the last
stable version of QEmu -- requiring an update of other dependences, including
a fix for virtme that is no longer maintained -- and I didn't manage to. Still,
the QEmu's version from the last Ubuntu stable release doesn't include this fix.
This has been reported to the [Ubuntu bug tracker](https://bugs.launchpad.net/bugs/2051965).
After another long bisecting session, I just managed to identify the commits
fixing the bug. Sadly, these commits have not been backported on QEmu's side,
and the commit message was lacking explanations about how the bug was initially
found, and when the bug was introduced. This doesn't help to get fixes
backported, and to quickly identify which patch fixed someone else's issue. At
the end, quite some time has been spent on a bug impacting us, but not due to
MPTCP. On the other hand, it helped to improve kernel panic detection, quickly
stopping the VM in case of such critical issues, and the fix will likely be
available to all Ubuntu users.

Not related to the migration, but some other subtests have started to fail for a
few days in a row. It turns out it was due to the last version of IPRoute2, more
precisely due to a [regression](https://lore.kernel.org/netdev/[email protected]/)
in the `ss` tool we use for some tests. A
[fix](https://lore.kernel.org/netdev/[email protected]/)
has been sent, applied, but not backported. Even if it breaks our selftests, and
it might break scripts using `ss` to monitor connections or to report info, the
[issue is not big enough](https://lore.kernel.org/netdev/[email protected]/)
to justify a new bug-fix version. In other words, if you want to validate MPTCP
selftests, don't use IPRoute 6.7.0. To reduce the number of users running tests
with this buggy version, note that the version on Debian, used by a few CIs, has
been [patched](https://salsa.debian.org/kernel-team/iproute2/-/commit/14cfd95e)
with the fix.

### LKFT CI

Linaro's Linux Kernel Functional Test framework, or LKFT, test a very impressive
number of kernel versions. From old stable ones, to linux-next. They have been
validating MPTCP selftests for years, and that's great, thanks guys for doing
that! It is especially useful to find issues due to a bad backport in stable
versions, or to prevent issues by testing linux-next. One important
particularity when validating stable kernel versions, is that they follow stable
team [recommendations](https://lore.kernel.org/stable/[email protected]/):
running the kselftests suite from the last stable version (e.g. 6.7.4) when
validating (old) stable kernel (e.g. 5.15.148). It is an important assumption to
know when writing kselftests, and I don't think this is well known. I don't
think that's easy to support all kernels, especially in the networking area,
where some features/behaviours are not directly exposed to the userspace. Some
MPTCP kselftests have to look at `/proc/kallsyms` or use other (ugly?)
[workarounds](https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net)
to predict what we are supposed to have, depending on the kernel that is being
used. But it is important to do something to support these old kernels, not to
have big kselftests, with many different subtests, always marked as "failed"
when validating new stable releases, removing all their value.

Regularly, test results from LKFT are analysed, to check if some tests have not
been skipped by accident, and if there are no real issues being ignored there.
Recently, some modifications have been done to support more environments, add
missing kernel configs, increase timeout, and other techniques trying to support
very slow environments, and more.

### Netdev CI

Great news for the Networking subsystem: a new
[CI environment](https://netdev.bots.linux.dev/status.html) has been put in
place by Jakub Kicinski, supported by Meta. It currently runs most tests
available in the kernel source code, including ours for MPTCP: kselftests,
KUnit, and more. The initial goal is to validate patches that are sent on the
mailing list, and send reports on [Patchwork](https://patchwork.kernel.org/project/netdevbpf/list/).
That's an important addition, something we have done with MPTCP for a long time,
and is part of the common practice: each new modification has to be validated by
a CI. Certainly due to the same reasons as mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %})
and valid for most kernel subsystems, and Open-Source project in general -- no
infrastructure in place that can be easily used by kernel dev to monitor new
patches sent by email, to automatically apply, then run tests, and send a
report ; who would pay to develop and maintain that, for the VM or the specific
hardware, etc. --, this has been lacking for years. This doesn't mean no tests
were being executed, it's just that this was not done in public: it is easier
for a company to run these tests internally, especially when some specific
hardware is needed, or the test suite is not public. But that's not impossible:
DRM subsystem has been using GitLab for years, with self-hosted runners, for
everybody's benefit. But it needs coordination, and some brave people to push
their manager to start such projects that are not going to be directly
beneficial for this company. Anyway, here, that's great to have Meta's support.

I'm glad I was able to share some knowledge to help a bit having this in place.
The environment is a bit different, mainly because no "public CIs" are involved.
When I look at the different issues I had in the past, and the workarounds I put
in place to have our different tests supported on such "public CIs", I'm now
thinking I could have probably done a lot more with the same time if I was able
to control everything. But on the other hand, that's great anybody can maintain
it, and it is not tied to a single company -- which is currently not the case
with the Netdev CI, even if a lot of components are in a
[public repository](https://github.com/linux-netdev/nipa/).

Having CI executed in a new environment, also means new issues specific to this
new environment! This was the case, especially because the tests are running
without KVM acceleration. It is good that some of the fixes needed for LKFT CI
can be re-used here.

Talking about fixes, Paolo managed to find what was causing some subtests to be
[unstable](https://github.com/multipath-tcp/mptcp_net-next/issues/468) in very
slow environments. It took a bit of time to identify the root cause, but it
appears to be due to the use of a TCP-specific helper on an MPTCP socket. As
mentioned in the [commit message](https://git.kernel.org/netdev/net/c/b6c620dc43cc),
funnily enough, fuzzers and static checkers are happy, as the accessed memory
still belongs to the `mptcp_sock struct`, but completely unrelated. Even from a
functional perspective, the check being done was always skipped before. Until an
unrelated [refactoring](https://git.kernel.org/netdev/net/c/d5fed5addb2b) which
exposed the issue. In short, always be careful when casting a structure into
another one! BTW, that's not the first time such an issue has been introduced
because that's easy to mix-up MPTCP and TCP socket structures. I'm working on
adding new small checks in debug mode to avoid such issues in the future.

One last thing, not to have entire kselftests, including hundreds of subtests,
being ignored due to a single unstable subtest, I'm adding subtests support. We
would feel better skipping just one or two subtests while we are trying to
improve the situation, than having hundreds of them being ignored for days,
adding unneeded pressure.

## Misc. Maintenance Work

Mainly to expose that the maintenance work is always full of unexpected stuff
that needs to be done "quickly", here are a bunch of tasks that needed to be
done this month, on top of the classic reviews, meetings, questions, bugs, etc.:

- BPF: we have some code in progress in our tree that is not fully ready yet.
But then it means that we continuously have to adapt our code when some
changes are done upstream. It was the case recently, to adapt to
[different](https://git.kernel.org/netdev/net/c/f6be98d19985) `bpf_struct_ops`
[changes](https://git.kernel.org/netdev/net/c/4cbb270e115b).

- `mptcpd` debian packages: two new versions --
[0.12-3](https://tracker.debian.org/news/1494263/accepted-mptcpd-012-3-source-into-unstable/) and
[0.12-4](https://tracker.debian.org/news/1499261/accepted-mptcpd-012-4-source-into-unstable/) --
have been released to fix some issues due to some changes in the test
infrastructure, and how SystemD dependence is handled.

- `packetdrill`: Packetdrill is a very handy tool to validate networking
features. We maintain an out-of-tree version supporting MPTCP. This version is
not "upstreamable" as it is, because the old experimental code needs a serious
refactoring. But still, it works quite well. Anyway, when trying to find a fix
for the bug we had with QEmu (see above), the build environment for the tests
has been updated, breaking our branches dedicated to different stable versions
containing tests valid for the specific versions: MPTCP support has been added
progressively. To fix compilation issues when using newer compiler versions,
but also to help with the maintenance later, each stable branch has been
synced with modifications from the upstream repository maintained by Google.

## What's next?

Even if I would prefer to add new features, there is still some work needed for
the CI:
- using [self-hosted runners](https://github.com/multipath-tcp/mptcp_net-next/issues/474)
to improve the stability of some tests
- switching to [virtme-ng](https://github.com/multipath-tcp/mptcp_net-next/issues/472)
because the `virtme` tool we use is no longer maintained
- validating [MPTCP BPF tests](https://github.com/multipath-tcp/mptcp_net-next/issues/406)
to track regressions in this area
- tracking [flaky tests](https://github.com/multipath-tcp/mptcp_net-next/issues/473)
to be able to easily identify them

But once done, that will help the whole MPTCP community, and I will be able to
focus on other things!

0 comments on commit 9a19ab7

Please sign in to comment.