From 9a19ab720fc4295f1f2df7260099440a8284e45c Mon Sep 17 00:00:00 2001 From: "Matthieu Baerts (NGI0)" Date: Mon, 12 Feb 2024 22:11:23 +0100 Subject: [PATCH] post: new CI CI CI post For January. A bit long, but there was a lot of maintenance this week :) Signed-off-by: Matthieu Baerts (NGI0) --- _posts/2024-02-02-CI-CI-CI.md | 204 ++++++++++++++++++++++++++++++++++ 1 file changed, 204 insertions(+) create mode 100644 _posts/2024-02-02-CI-CI-CI.md diff --git a/_posts/2024-02-02-CI-CI-CI.md b/_posts/2024-02-02-CI-CI-CI.md new file mode 100644 index 0000000..b55cd60 --- /dev/null +++ b/_posts/2024-02-02-CI-CI-CI.md @@ -0,0 +1,204 @@ +--- +layout: post +title: "CI CI CI" +--- + +The [previous post]({% post_url 2024-01-01-Angel-Project %}) was predicting more +work on the CI side, and... That was correct! A new environment causing +unexpected kernel crashes, new CI instances finding new bugs, third party tools +causing issues, etc. Read on to find out why January seems to be the month +dedicated to CI activities! + + + +## CI + +In addition to the work around MPTCP CI, some modifications have also been done +for other CIs validating MPTCP selftests. + +### MPTCP CI + +As mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %}), +the migration to Github Actions continued in January: adding email +notifications, display test results and extra metrics, enabling them on new +patches, etc. In short, re-adding many bits we had before with Cirrus-CI, and +needed to keep the workflow closed to what it was before. + +Unfortunately, the migration came with some unexpected surprises! The first one +is a small increase in the number of "flaky" tests. This can partially be +explained by the fact KVM acceleration cannot be used on GitHub Actions when +using free runners. Packetdrill tests seem to suffer more than the others. +Compared to the other tests, they are executed in parallel. It looks like they +take all/most of the available CPU resources. Reducing the limit to have maximum +2 tests in parallel greatly improved the results, but there are still some rare +hiccups. + +The second surprise initially looked bad: some new kernel panics have been +[detected](https://github.com/multipath-tcp/mptcp_net-next/issues/471). They +were not due to MPTCP code -- it was happening while doing some "ping" between +net namespaces -- but it was reported by the CI ~25-30% of the time. The bug has +been [reported](https://lore.kernel.org/netdev/98724dcd-ddf3-4f78-a386-f966ffbc9528@kernel.org/T/) +to the Netdev community, and quickly identified as an issue somewhere in x86 +code. It was not easy to reproduce the issue, but with determination, and hours +of tests, a [patch](https://lore.kernel.org/all/06cb540e-34ff-4dcd-b936-19d4d14378c9@kernel.org/T/) +causing the issue has been identified. It turns out, the patch accidentally +removed some code that was preventing a bug in QEmu's TCG code. Before reporting +this to the QEmu's community, I tried to reproduce the issue with the last +stable version of QEmu -- requiring an update of other dependences, including +a fix for virtme that is no longer maintained -- and I didn't manage to. Still, +the QEmu's version from the last Ubuntu stable release doesn't include this fix. +This has been reported to the [Ubuntu bug tracker](https://bugs.launchpad.net/bugs/2051965). +After another long bisecting session, I just managed to identify the commits +fixing the bug. Sadly, these commits have not been backported on QEmu's side, +and the commit message was lacking explanations about how the bug was initially +found, and when the bug was introduced. This doesn't help to get fixes +backported, and to quickly identify which patch fixed someone else's issue. At +the end, quite some time has been spent on a bug impacting us, but not due to +MPTCP. On the other hand, it helped to improve kernel panic detection, quickly +stopping the VM in case of such critical issues, and the fix will likely be +available to all Ubuntu users. + +Not related to the migration, but some other subtests have started to fail for a +few days in a row. It turns out it was due to the last version of IPRoute2, more +precisely due to a [regression](https://lore.kernel.org/netdev/f0abd655-0c10-4190-97bf-34ffb92f98ad@kernel.org/) +in the `ss` tool we use for some tests. A +[fix](https://lore.kernel.org/netdev/20240113-ss-fix-ext-col-disabled-v1-1-cf99a7381dec@kernel.org/) +has been sent, applied, but not backported. Even if it breaks our selftests, and +it might break scripts using `ss` to monitor connections or to report info, the +[issue is not big enough](https://lore.kernel.org/netdev/20240115090548.04eb7f6f@hermes.local/) +to justify a new bug-fix version. In other words, if you want to validate MPTCP +selftests, don't use IPRoute 6.7.0. To reduce the number of users running tests +with this buggy version, note that the version on Debian, used by a few CIs, has +been [patched](https://salsa.debian.org/kernel-team/iproute2/-/commit/14cfd95e) +with the fix. + +### LKFT CI + +Linaro's Linux Kernel Functional Test framework, or LKFT, test a very impressive +number of kernel versions. From old stable ones, to linux-next. They have been +validating MPTCP selftests for years, and that's great, thanks guys for doing +that! It is especially useful to find issues due to a bad backport in stable +versions, or to prevent issues by testing linux-next. One important +particularity when validating stable kernel versions, is that they follow stable +team [recommendations](https://lore.kernel.org/stable/ZAG8dla274kYfxoK@kroah.com/): +running the kselftests suite from the last stable version (e.g. 6.7.4) when +validating (old) stable kernel (e.g. 5.15.148). It is an important assumption to +know when writing kselftests, and I don't think this is well known. I don't +think that's easy to support all kernels, especially in the networking area, +where some features/behaviours are not directly exposed to the userspace. Some +MPTCP kselftests have to look at `/proc/kallsyms` or use other (ugly?) +[workarounds](https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net) +to predict what we are supposed to have, depending on the kernel that is being +used. But it is important to do something to support these old kernels, not to +have big kselftests, with many different subtests, always marked as "failed" +when validating new stable releases, removing all their value. + +Regularly, test results from LKFT are analysed, to check if some tests have not +been skipped by accident, and if there are no real issues being ignored there. +Recently, some modifications have been done to support more environments, add +missing kernel configs, increase timeout, and other techniques trying to support +very slow environments, and more. + +### Netdev CI + +Great news for the Networking subsystem: a new +[CI environment](https://netdev.bots.linux.dev/status.html) has been put in +place by Jakub Kicinski, supported by Meta. It currently runs most tests +available in the kernel source code, including ours for MPTCP: kselftests, +KUnit, and more. The initial goal is to validate patches that are sent on the +mailing list, and send reports on [Patchwork](https://patchwork.kernel.org/project/netdevbpf/list/). +That's an important addition, something we have done with MPTCP for a long time, +and is part of the common practice: each new modification has to be validated by +a CI. Certainly due to the same reasons as mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %}) +and valid for most kernel subsystems, and Open-Source project in general -- no +infrastructure in place that can be easily used by kernel dev to monitor new +patches sent by email, to automatically apply, then run tests, and send a +report ; who would pay to develop and maintain that, for the VM or the specific +hardware, etc. --, this has been lacking for years. This doesn't mean no tests +were being executed, it's just that this was not done in public: it is easier +for a company to run these tests internally, especially when some specific +hardware is needed, or the test suite is not public. But that's not impossible: +DRM subsystem has been using GitLab for years, with self-hosted runners, for +everybody's benefit. But it needs coordination, and some brave people to push +their manager to start such projects that are not going to be directly +beneficial for this company. Anyway, here, that's great to have Meta's support. + +I'm glad I was able to share some knowledge to help a bit having this in place. +The environment is a bit different, mainly because no "public CIs" are involved. +When I look at the different issues I had in the past, and the workarounds I put +in place to have our different tests supported on such "public CIs", I'm now +thinking I could have probably done a lot more with the same time if I was able +to control everything. But on the other hand, that's great anybody can maintain +it, and it is not tied to a single company -- which is currently not the case +with the Netdev CI, even if a lot of components are in a +[public repository](https://github.com/linux-netdev/nipa/). + +Having CI executed in a new environment, also means new issues specific to this +new environment! This was the case, especially because the tests are running +without KVM acceleration. It is good that some of the fixes needed for LKFT CI +can be re-used here. + +Talking about fixes, Paolo managed to find what was causing some subtests to be +[unstable](https://github.com/multipath-tcp/mptcp_net-next/issues/468) in very +slow environments. It took a bit of time to identify the root cause, but it +appears to be due to the use of a TCP-specific helper on an MPTCP socket. As +mentioned in the [commit message](https://git.kernel.org/netdev/net/c/b6c620dc43cc), +funnily enough, fuzzers and static checkers are happy, as the accessed memory +still belongs to the `mptcp_sock struct`, but completely unrelated. Even from a +functional perspective, the check being done was always skipped before. Until an +unrelated [refactoring](https://git.kernel.org/netdev/net/c/d5fed5addb2b) which +exposed the issue. In short, always be careful when casting a structure into +another one! BTW, that's not the first time such an issue has been introduced +because that's easy to mix-up MPTCP and TCP socket structures. I'm working on +adding new small checks in debug mode to avoid such issues in the future. + +One last thing, not to have entire kselftests, including hundreds of subtests, +being ignored due to a single unstable subtest, I'm adding subtests support. We +would feel better skipping just one or two subtests while we are trying to +improve the situation, than having hundreds of them being ignored for days, +adding unneeded pressure. + +## Misc. Maintenance Work + +Mainly to expose that the maintenance work is always full of unexpected stuff +that needs to be done "quickly", here are a bunch of tasks that needed to be +done this month, on top of the classic reviews, meetings, questions, bugs, etc.: + +- BPF: we have some code in progress in our tree that is not fully ready yet. + But then it means that we continuously have to adapt our code when some + changes are done upstream. It was the case recently, to adapt to + [different](https://git.kernel.org/netdev/net/c/f6be98d19985) `bpf_struct_ops` + [changes](https://git.kernel.org/netdev/net/c/4cbb270e115b). + +- `mptcpd` debian packages: two new versions -- + [0.12-3](https://tracker.debian.org/news/1494263/accepted-mptcpd-012-3-source-into-unstable/) and + [0.12-4](https://tracker.debian.org/news/1499261/accepted-mptcpd-012-4-source-into-unstable/) -- + have been released to fix some issues due to some changes in the test + infrastructure, and how SystemD dependence is handled. + +- `packetdrill`: Packetdrill is a very handy tool to validate networking + features. We maintain an out-of-tree version supporting MPTCP. This version is + not "upstreamable" as it is, because the old experimental code needs a serious + refactoring. But still, it works quite well. Anyway, when trying to find a fix + for the bug we had with QEmu (see above), the build environment for the tests + has been updated, breaking our branches dedicated to different stable versions + containing tests valid for the specific versions: MPTCP support has been added + progressively. To fix compilation issues when using newer compiler versions, + but also to help with the maintenance later, each stable branch has been + synced with modifications from the upstream repository maintained by Google. + +## What's next? + +Even if I would prefer to add new features, there is still some work needed for +the CI: +- using [self-hosted runners](https://github.com/multipath-tcp/mptcp_net-next/issues/474) + to improve the stability of some tests +- switching to [virtme-ng](https://github.com/multipath-tcp/mptcp_net-next/issues/472) + because the `virtme` tool we use is no longer maintained +- validating [MPTCP BPF tests](https://github.com/multipath-tcp/mptcp_net-next/issues/406) + to track regressions in this area +- tracking [flaky tests](https://github.com/multipath-tcp/mptcp_net-next/issues/473) + to be able to easily identify them + +But once done, that will help the whole MPTCP community, and I will be able to +focus on other things!