|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "CI CI CI" |
| 4 | +--- |
| 5 | + |
| 6 | +The [previous post]({% post_url 2024-01-01-Angel-Project %}) was predicting more |
| 7 | +work on the CI side, and... That was correct! A new environment causing |
| 8 | +unexpected kernel crashes, new CI instances finding new bugs, third party tools |
| 9 | +causing issues, etc. Read on to find out why January seems to be the month |
| 10 | +dedicated to CI activities! |
| 11 | + |
| 12 | +<!--more--> |
| 13 | + |
| 14 | +## CI |
| 15 | + |
| 16 | +In addition to the work around MPTCP CI, some modifications have also been done |
| 17 | +for other CIs validating MPTCP selftests. |
| 18 | + |
| 19 | +### MPTCP CI |
| 20 | + |
| 21 | +As mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %}), |
| 22 | +the migration to Github Actions continued in January: adding email |
| 23 | +notifications, display test results and extra metrics, enabling them on new |
| 24 | +patches, etc. In short, re-adding many bits we had before with Cirrus-CI, and |
| 25 | +needed to keep the workflow closed to what it was before. |
| 26 | + |
| 27 | +Unfortunately, the migration came with some unexpected surprises! The first one |
| 28 | +is a small increase in the number of "flaky" tests. This can partially be |
| 29 | +explained by the fact KVM acceleration cannot be used on GitHub Actions when |
| 30 | +using free runners. Packetdrill tests seem to suffer more than the others. |
| 31 | +Compared to the other tests, they are executed in parallel. It looks like they |
| 32 | +take all/most of the available CPU resources. Reducing the limit to have maximum |
| 33 | +2 tests in parallel greatly improved the results, but there are still some rare |
| 34 | +hiccups. |
| 35 | + |
| 36 | +The second surprise initially looked bad: some new kernel panics have been |
| 37 | +[detected](https://github.com/multipath-tcp/mptcp_net-next/issues/471). They |
| 38 | +were not due to MPTCP code -- it was happening while doing some "ping" between |
| 39 | +net namespaces -- but it was reported by the CI ~25-30% of the time. The bug has |
| 40 | +been [reported ](https://lore.kernel.org/netdev/[email protected]/T/) |
| 41 | +to the Netdev community, and quickly identified as an issue somewhere in x86 |
| 42 | +code. It was not easy to reproduce the issue, but with determination, and hours |
| 43 | +of tests, a [patch ](https://lore.kernel.org/all/[email protected]/T/) |
| 44 | +causing the issue has been identified. It turns out, the patch accidentally |
| 45 | +removed some code that was preventing a bug in QEmu's TCG code. Before reporting |
| 46 | +this to the QEmu's community, I tried to reproduce the issue with the last |
| 47 | +stable version of QEmu -- requiring an update of other dependences, including |
| 48 | +a fix for virtme that is no longer maintained -- and I didn't manage to. Still, |
| 49 | +the QEmu's version from the last Ubuntu stable release doesn't include this fix. |
| 50 | +This has been reported to the [Ubuntu bug tracker](https://bugs.launchpad.net/bugs/2051965). |
| 51 | +After another long bisecting session, I just managed to identify the commits |
| 52 | +fixing the bug. Sadly, these commits have not been backported on QEmu's side, |
| 53 | +and the commit message was lacking explanations about how the bug was initially |
| 54 | +found, and when the bug was introduced. This doesn't help to get fixes |
| 55 | +backported, and to quickly identify which patch fixed someone else's issue. At |
| 56 | +the end, quite some time has been spent on a bug impacting us, but not due to |
| 57 | +MPTCP. On the other hand, it helped to improve kernel panic detection, quickly |
| 58 | +stopping the VM in case of such critical issues, and the fix will likely be |
| 59 | +available to all Ubuntu users. |
| 60 | + |
| 61 | +Not related to the migration, but some other subtests have started to fail for a |
| 62 | +few days in a row. It turns out it was due to the last version of IPRoute2, more |
| 63 | +precisely due to a [regression ](https://lore.kernel.org/netdev/[email protected]/) |
| 64 | +in the `ss` tool we use for some tests. A |
| 65 | +[fix ](https://lore.kernel.org/netdev/[email protected]/) |
| 66 | +has been sent, applied, but not backported. Even if it breaks our selftests, and |
| 67 | +it might break scripts using `ss` to monitor connections or to report info, the |
| 68 | +[issue is not big enough ](https://lore.kernel.org/netdev/[email protected]/) |
| 69 | +to justify a new bug-fix version. In other words, if you want to validate MPTCP |
| 70 | +selftests, don't use IPRoute 6.7.0. To reduce the number of users running tests |
| 71 | +with this buggy version, note that the version on Debian, used by a few CIs, has |
| 72 | +been [patched](https://salsa.debian.org/kernel-team/iproute2/-/commit/14cfd95e) |
| 73 | +with the fix. |
| 74 | + |
| 75 | +### LKFT CI |
| 76 | + |
| 77 | +Linaro's Linux Kernel Functional Test framework, or LKFT, test a very impressive |
| 78 | +number of kernel versions. From old stable ones, to linux-next. They have been |
| 79 | +validating MPTCP selftests for years, and that's great, thanks guys for doing |
| 80 | +that! It is especially useful to find issues due to a bad backport in stable |
| 81 | +versions, or to prevent issues by testing linux-next. One important |
| 82 | +particularity when validating stable kernel versions, is that they follow stable |
| 83 | +team [recommendations ](https://lore.kernel.org/stable/[email protected]/): |
| 84 | +running the kselftests suite from the last stable version (e.g. 6.7.4) when |
| 85 | +validating (old) stable kernel (e.g. 5.15.148). It is an important assumption to |
| 86 | +know when writing kselftests, and I don't think this is well known. I don't |
| 87 | +think that's easy to support all kernels, especially in the networking area, |
| 88 | +where some features/behaviours are not directly exposed to the userspace. Some |
| 89 | +MPTCP kselftests have to look at `/proc/kallsyms` or use other (ugly?) |
| 90 | +[workarounds](https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net) |
| 91 | +to predict what we are supposed to have, depending on the kernel that is being |
| 92 | +used. But it is important to do something to support these old kernels, not to |
| 93 | +have big kselftests, with many different subtests, always marked as "failed" |
| 94 | +when validating new stable releases, removing all their value. |
| 95 | + |
| 96 | +Regularly, test results from LKFT are analysed, to check if some tests have not |
| 97 | +been skipped by accident, and if there are no real issues being ignored there. |
| 98 | +Recently, some modifications have been done to support more environments, add |
| 99 | +missing kernel configs, increase timeout, and other techniques trying to support |
| 100 | +very slow environments, and more. |
| 101 | + |
| 102 | +### Netdev CI |
| 103 | + |
| 104 | +Great news for the Networking subsystem: a new |
| 105 | +[CI environment](https://netdev.bots.linux.dev/status.html) has been put in |
| 106 | +place by Jakub Kicinski, supported by Meta. It currently runs most tests |
| 107 | +available in the kernel source code, including ours for MPTCP: kselftests, |
| 108 | +KUnit, and more. The initial goal is to validate patches that are sent on the |
| 109 | +mailing list, and send reports on [Patchwork](https://patchwork.kernel.org/project/netdevbpf/list/). |
| 110 | +That's an important addition, something we have done with MPTCP for a long time, |
| 111 | +and is part of the common practice: each new modification has to be validated by |
| 112 | +a CI. Certainly due to the same reasons as mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %}) |
| 113 | +and valid for most kernel subsystems, and Open-Source project in general -- no |
| 114 | +infrastructure in place that can be easily used by kernel dev to monitor new |
| 115 | +patches sent by email, to automatically apply, then run tests, and send a |
| 116 | +report ; who would pay to develop and maintain that, for the VM or the specific |
| 117 | +hardware, etc. --, this has been lacking for years. This doesn't mean no tests |
| 118 | +were being executed, it's just that this was not done in public: it is easier |
| 119 | +for a company to run these tests internally, especially when some specific |
| 120 | +hardware is needed, or the test suite is not public. But that's not impossible: |
| 121 | +DRM subsystem has been using GitLab for years, with self-hosted runners, for |
| 122 | +everybody's benefit. But it needs coordination, and some brave people to push |
| 123 | +their manager to start such projects that are not going to be directly |
| 124 | +beneficial for this company. Anyway, here, that's great to have Meta's support. |
| 125 | + |
| 126 | +I'm glad I was able to share some knowledge to help a bit having this in place. |
| 127 | +The environment is a bit different, mainly because no "public CIs" are involved. |
| 128 | +When I look at the different issues I had in the past, and the workarounds I put |
| 129 | +in place to have our different tests supported on such "public CIs", I'm now |
| 130 | +thinking I could have probably done a lot more with the same time if I was able |
| 131 | +to control everything. But on the other hand, that's great anybody can maintain |
| 132 | +it, and it is not tied to a single company -- which is currently not the case |
| 133 | +with the Netdev CI, even if a lot of components are in a |
| 134 | +[public repository](https://github.com/linux-netdev/nipa/). |
| 135 | + |
| 136 | +Having CI executed in a new environment, also means new issues specific to this |
| 137 | +new environment! This was the case, especially because the tests are running |
| 138 | +without KVM acceleration. It is good that some of the fixes needed for LKFT CI |
| 139 | +can be re-used here. |
| 140 | + |
| 141 | +Talking about fixes, Paolo managed to find what was causing some subtests to be |
| 142 | +[unstable](https://github.com/multipath-tcp/mptcp_net-next/issues/468) in very |
| 143 | +slow environments. It took a bit of time to identify the root cause, but it |
| 144 | +appears to be due to the use of a TCP-specific helper on an MPTCP socket. As |
| 145 | +mentioned in the [commit message](https://git.kernel.org/netdev/net/c/b6c620dc43cc), |
| 146 | +funnily enough, fuzzers and static checkers are happy, as the accessed memory |
| 147 | +still belongs to the `mptcp_sock struct`, but completely unrelated. Even from a |
| 148 | +functional perspective, the check being done was always skipped before. Until an |
| 149 | +unrelated [refactoring](https://git.kernel.org/netdev/net/c/d5fed5addb2b) which |
| 150 | +exposed the issue. In short, always be careful when casting a structure into |
| 151 | +another one! BTW, that's not the first time such an issue has been introduced |
| 152 | +because that's easy to mix-up MPTCP and TCP socket structures. I'm working on |
| 153 | +adding new small checks in debug mode to avoid such issues in the future. |
| 154 | + |
| 155 | +One last thing, not to have entire kselftests, including hundreds of subtests, |
| 156 | +being ignored due to a single unstable subtest, I'm adding subtests support. We |
| 157 | +would feel better skipping just one or two subtests while we are trying to |
| 158 | +improve the situation, than having hundreds of them being ignored for days, |
| 159 | +adding unneeded pressure. |
| 160 | + |
| 161 | +## Misc. Maintenance Work |
| 162 | + |
| 163 | +Mainly to expose that the maintenance work is always full of unexpected stuff |
| 164 | +that needs to be done "quickly", here are a bunch of tasks that needed to be |
| 165 | +done this month, on top of the classic reviews, meetings, questions, bugs, etc.: |
| 166 | + |
| 167 | +- BPF: we have some code in progress in our tree that is not fully ready yet. |
| 168 | + But then it means that we continuously have to adapt our code when some |
| 169 | + changes are done upstream. It was the case recently, to adapt to |
| 170 | + [different](https://git.kernel.org/netdev/net/c/f6be98d19985) `bpf_struct_ops` |
| 171 | + [changes](https://git.kernel.org/netdev/net/c/4cbb270e115b). |
| 172 | + |
| 173 | +- `mptcpd` debian packages: two new versions -- |
| 174 | + [0.12-3](https://tracker.debian.org/news/1494263/accepted-mptcpd-012-3-source-into-unstable/) and |
| 175 | + [0.12-4](https://tracker.debian.org/news/1499261/accepted-mptcpd-012-4-source-into-unstable/) -- |
| 176 | + have been released to fix some issues due to some changes in the test |
| 177 | + infrastructure, and how SystemD dependence is handled. |
| 178 | + |
| 179 | +- `packetdrill`: Packetdrill is a very handy tool to validate networking |
| 180 | + features. We maintain an out-of-tree version supporting MPTCP. This version is |
| 181 | + not "upstreamable" as it is, because the old experimental code needs a serious |
| 182 | + refactoring. But still, it works quite well. Anyway, when trying to find a fix |
| 183 | + for the bug we had with QEmu (see above), the build environment for the tests |
| 184 | + has been updated, breaking our branches dedicated to different stable versions |
| 185 | + containing tests valid for the specific versions: MPTCP support has been added |
| 186 | + progressively. To fix compilation issues when using newer compiler versions, |
| 187 | + but also to help with the maintenance later, each stable branch has been |
| 188 | + synced with modifications from the upstream repository maintained by Google. |
| 189 | + |
| 190 | +## What's next? |
| 191 | + |
| 192 | +Even if I would prefer to add new features, there is still some work needed for |
| 193 | +the CI: |
| 194 | +- using [self-hosted runners](https://github.com/multipath-tcp/mptcp_net-next/issues/474) |
| 195 | + to improve the stability of some tests |
| 196 | +- switching to [virtme-ng](https://github.com/multipath-tcp/mptcp_net-next/issues/472) |
| 197 | + because the `virtme` tool we use is no longer maintained |
| 198 | +- validating [MPTCP BPF tests](https://github.com/multipath-tcp/mptcp_net-next/issues/406) |
| 199 | + to track regressions in this area |
| 200 | +- tracking [flaky tests](https://github.com/multipath-tcp/mptcp_net-next/issues/473) |
| 201 | + to be able to easily identify them |
| 202 | + |
| 203 | +But once done, that will help the whole MPTCP community, and I will be able to |
| 204 | +focus on other things! |
0 commit comments