post: new CI CI CI post

matttbe · matttbe · commit 9a19ab720fc4 · 2024-02-12T22:11:27.000+01:00
For January. A bit long, but there was a lot of maintenance this week :)

Signed-off-by: Matthieu Baerts (NGI0) &lt;matttbe@kernel.org&gt;
diff --git a/_posts/2024-02-02-CI-CI-CI.md b/_posts/2024-02-02-CI-CI-CI.md
@@ -0,0 +1,204 @@
+---
+layout: post
+title:  "CI CI CI"
+---
+
+The [previous post]({% post_url 2024-01-01-Angel-Project %}) was predicting more
+work on the CI side, and... That was correct! A new environment causing
+unexpected kernel crashes, new CI instances finding new bugs, third party tools
+causing issues, etc. Read on to find out why January seems to be the month
+dedicated to CI activities!
+
+<!--more-->
+
+## CI
+
+In addition to the work around MPTCP CI, some modifications have also been done
+for other CIs validating MPTCP selftests.
+
+### MPTCP CI
+
+As mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %}),
+the migration to Github Actions continued in January: adding email
+notifications, display test results and extra metrics, enabling them on new
+patches, etc. In short, re-adding many bits we had before with Cirrus-CI, and
+needed to keep the workflow closed to what it was before.
+
+Unfortunately, the migration came with some unexpected surprises! The first one
+is a small increase in the number of "flaky" tests. This can partially be
+explained by the fact KVM acceleration cannot be used on GitHub Actions when
+using free runners. Packetdrill tests seem to suffer more than the others.
+Compared to the other tests, they are executed in parallel. It looks like they
+take all/most of the available CPU resources. Reducing the limit to have maximum
+2 tests in parallel greatly improved the results, but there are still some rare
+hiccups.
+
+The second surprise initially looked bad: some new kernel panics have been
+[detected](https://github.com/multipath-tcp/mptcp_net-next/issues/471). They
+were not due to MPTCP code -- it was happening while doing some "ping" between
+net namespaces -- but it was reported by the CI ~25-30% of the time. The bug has
+been [reported](https://lore.kernel.org/netdev/98724dcd-ddf3-4f78-a386-f966ffbc9528@kernel.org/T/)
+to the Netdev community, and quickly identified as an issue somewhere in x86
+code. It was not easy to reproduce the issue, but with determination, and hours
+of tests, a [patch](https://lore.kernel.org/all/06cb540e-34ff-4dcd-b936-19d4d14378c9@kernel.org/T/)
+causing the issue has been identified. It turns out, the patch accidentally
+removed some code that was preventing a bug in QEmu's TCG code. Before reporting
+this to the QEmu's community, I tried to reproduce the issue with the last
+stable version of QEmu -- requiring an update of other dependences, including
+a fix for virtme that is no longer maintained -- and I didn't manage to. Still,
+the QEmu's version from the last Ubuntu stable release doesn't include this fix.
+This has been reported to the [Ubuntu bug tracker](https://bugs.launchpad.net/bugs/2051965).
+After another long bisecting session, I just managed to identify the commits
+fixing the bug. Sadly, these commits have not been backported on QEmu's side,
+and the commit message was lacking explanations about how the bug was initially
+found, and when the bug was introduced. This doesn't help to get fixes
+backported, and to quickly identify which patch fixed someone else's issue. At
+the end, quite some time has been spent on a bug impacting us, but not due to
+MPTCP. On the other hand, it helped to improve kernel panic detection, quickly
+stopping the VM in case of such critical issues, and the fix will likely be
+available to all Ubuntu users.
+
+Not related to the migration, but some other subtests have started to fail for a
+few days in a row. It turns out it was due to the last version of IPRoute2, more
+precisely due to a [regression](https://lore.kernel.org/netdev/f0abd655-0c10-4190-97bf-34ffb92f98ad@kernel.org/)
+in the `ss` tool we use for some tests. A
+[fix](https://lore.kernel.org/netdev/20240113-ss-fix-ext-col-disabled-v1-1-cf99a7381dec@kernel.org/)
+has been sent, applied, but not backported. Even if it breaks our selftests, and
+it might break scripts using `ss` to monitor connections or to report info, the
+[issue is not big enough](https://lore.kernel.org/netdev/20240115090548.04eb7f6f@hermes.local/)
+to justify a new bug-fix version. In other words, if you want to validate MPTCP
+selftests, don't use IPRoute 6.7.0. To reduce the number of users running tests
+with this buggy version, note that the version on Debian, used by a few CIs, has
+been [patched](https://salsa.debian.org/kernel-team/iproute2/-/commit/14cfd95e)
+with the fix.
+
+### LKFT CI
+
+Linaro's Linux Kernel Functional Test framework, or LKFT, test a very impressive
+number of kernel versions. From old stable ones, to linux-next. They have been
+validating MPTCP selftests for years, and that's great, thanks guys for doing
+that! It is especially useful to find issues due to a bad backport in stable
+versions, or to prevent issues by testing linux-next. One important
+particularity when validating stable kernel versions, is that they follow stable
+team [recommendations](https://lore.kernel.org/stable/ZAG8dla274kYfxoK@kroah.com/):
+running the kselftests suite from the last stable version (e.g. 6.7.4) when
+validating (old) stable kernel (e.g. 5.15.148). It is an important assumption to
+know when writing kselftests, and I don't think this is well known. I don't
+think that's easy to support all kernels, especially in the networking area,
+where some features/behaviours are not directly exposed to the userspace. Some
+MPTCP kselftests have to look at `/proc/kallsyms` or use other (ugly?)
+[workarounds](https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net)
+to predict what we are supposed to have, depending on the kernel that is being
+used. But it is important to do something to support these old kernels, not to
+have big kselftests, with many different subtests, always marked as "failed"
+when validating new stable releases, removing all their value.
+
+Regularly, test results from LKFT are analysed, to check if some tests have not
+been skipped by accident, and if there are no real issues being ignored there.
+Recently, some modifications have been done to support more environments, add
+missing kernel configs, increase timeout, and other techniques trying to support
+very slow environments, and more.
+
+### Netdev CI
+
+Great news for the Networking subsystem: a new
+[CI environment](https://netdev.bots.linux.dev/status.html) has been put in
+place by Jakub Kicinski, supported by Meta. It currently runs most tests
+available in the kernel source code, including ours for MPTCP: kselftests,
+KUnit, and more. The initial goal is to validate patches that are sent on the
+mailing list, and send reports on [Patchwork](https://patchwork.kernel.org/project/netdevbpf/list/).
+That's an important addition, something we have done with MPTCP for a long time,
+and is part of the common practice: each new modification has to be validated by
+a CI. Certainly due to the same reasons as mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %})
+and valid for most kernel subsystems, and Open-Source project in general -- no
+infrastructure in place that can be easily used by kernel dev to monitor new
+patches sent by email, to automatically apply, then run tests, and send a
+report ; who would pay to develop and maintain that, for the VM or the specific
+hardware, etc. --, this has been lacking for years. This doesn't mean no tests
+were being executed, it's just that this was not done in public: it is easier
+for a company to run these tests internally, especially when some specific
+hardware is needed, or the test suite is not public. But that's not impossible:
+DRM subsystem has been using GitLab for years, with self-hosted runners, for
+everybody's benefit. But it needs coordination, and some brave people to push
+their manager to start such projects that are not going to be directly
+beneficial for this company. Anyway, here, that's great to have Meta's support.
+
+I'm glad I was able to share some knowledge to help a bit having this in place.
+The environment is a bit different, mainly because no "public CIs" are involved.
+When I look at the different issues I had in the past, and the workarounds I put
+in place to have our different tests supported on such "public CIs", I'm now
+thinking I could have probably done a lot more with the same time if I was able
+to control everything. But on the other hand, that's great anybody can maintain
+it, and it is not tied to a single company -- which is currently not the case
+with the Netdev CI, even if a lot of components are in a
+[public repository](https://github.com/linux-netdev/nipa/).
+
+Having CI executed in a new environment, also means new issues specific to this
+new environment! This was the case, especially because the tests are running
+without KVM acceleration. It is good that some of the fixes needed for LKFT CI
+can be re-used here.
+
+Talking about fixes, Paolo managed to find what was causing some subtests to be
+[unstable](https://github.com/multipath-tcp/mptcp_net-next/issues/468) in very
+slow environments. It took a bit of time to identify the root cause, but it
+appears to be due to the use of a TCP-specific helper on an MPTCP socket. As
+mentioned in the [commit message](https://git.kernel.org/netdev/net/c/b6c620dc43cc),
+funnily enough, fuzzers and static checkers are happy, as the accessed memory
+still belongs to the `mptcp_sock struct`, but completely unrelated. Even from a
+functional perspective, the check being done was always skipped before. Until an
+unrelated [refactoring](https://git.kernel.org/netdev/net/c/d5fed5addb2b) which
+exposed the issue. In short, always be careful when casting a structure into
+another one! BTW, that's not the first time such an issue has been introduced
+because that's easy to mix-up MPTCP and TCP socket structures. I'm working on
+adding new small checks in debug mode to avoid such issues in the future.
+
+One last thing, not to have entire kselftests, including hundreds of subtests,
+being ignored due to a single unstable subtest, I'm adding subtests support. We
+would feel better skipping just one or two subtests while we are trying to
+improve the situation, than having hundreds of them being ignored for days,
+adding unneeded pressure.
+
+## Misc. Maintenance Work
+
+Mainly to expose that the maintenance work is always full of unexpected stuff
+that needs to be done "quickly", here are a bunch of tasks that needed to be
+done this month, on top of the classic reviews, meetings, questions, bugs, etc.:
+
+- BPF: we have some code in progress in our tree that is not fully ready yet.
+  But then it means that we continuously have to adapt our code when some
+  changes are done upstream. It was the case recently, to adapt to
+  [different](https://git.kernel.org/netdev/net/c/f6be98d19985) `bpf_struct_ops`
+  [changes](https://git.kernel.org/netdev/net/c/4cbb270e115b).
+
+- `mptcpd` debian packages: two new versions --
+  [0.12-3](https://tracker.debian.org/news/1494263/accepted-mptcpd-012-3-source-into-unstable/) and
+  [0.12-4](https://tracker.debian.org/news/1499261/accepted-mptcpd-012-4-source-into-unstable/) --
+  have been released to fix some issues due to some changes in the test
+  infrastructure, and how SystemD dependence is handled.
+
+- `packetdrill`: Packetdrill is a very handy tool to validate networking
+  features. We maintain an out-of-tree version supporting MPTCP. This version is
+  not "upstreamable" as it is, because the old experimental code needs a serious
+  refactoring. But still, it works quite well. Anyway, when trying to find a fix
+  for the bug we had with QEmu (see above), the build environment for the tests
+  has been updated, breaking our branches dedicated to different stable versions
+  containing tests valid for the specific versions: MPTCP support has been added
+  progressively. To fix compilation issues when using newer compiler versions,
+  but also to help with the maintenance later, each stable branch has been
+  synced with modifications from the upstream repository maintained by Google.
+
+## What's next?
+
+Even if I would prefer to add new features, there is still some work needed for
+the CI:
+- using [self-hosted runners](https://github.com/multipath-tcp/mptcp_net-next/issues/474)
+  to improve the stability of some tests
+- switching to [virtme-ng](https://github.com/multipath-tcp/mptcp_net-next/issues/472)
+  because the `virtme` tool we use is no longer maintained
+- validating [MPTCP BPF tests](https://github.com/multipath-tcp/mptcp_net-next/issues/406)
+  to track regressions in this area
+- tracking [flaky tests](https://github.com/multipath-tcp/mptcp_net-next/issues/473)
+  to be able to easily identify them
+
+But once done, that will help the whole MPTCP community, and I will be able to
+focus on other things!