Skip to content

Commit 9a19ab7

Browse files
committed
post: new CI CI CI post
For January. A bit long, but there was a lot of maintenance this week :) Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
1 parent 6888b9c commit 9a19ab7

File tree

1 file changed

+204
-0
lines changed

1 file changed

+204
-0
lines changed

_posts/2024-02-02-CI-CI-CI.md

Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
---
2+
layout: post
3+
title: "CI CI CI"
4+
---
5+
6+
The [previous post]({% post_url 2024-01-01-Angel-Project %}) was predicting more
7+
work on the CI side, and... That was correct! A new environment causing
8+
unexpected kernel crashes, new CI instances finding new bugs, third party tools
9+
causing issues, etc. Read on to find out why January seems to be the month
10+
dedicated to CI activities!
11+
12+
<!--more-->
13+
14+
## CI
15+
16+
In addition to the work around MPTCP CI, some modifications have also been done
17+
for other CIs validating MPTCP selftests.
18+
19+
### MPTCP CI
20+
21+
As mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %}),
22+
the migration to Github Actions continued in January: adding email
23+
notifications, display test results and extra metrics, enabling them on new
24+
patches, etc. In short, re-adding many bits we had before with Cirrus-CI, and
25+
needed to keep the workflow closed to what it was before.
26+
27+
Unfortunately, the migration came with some unexpected surprises! The first one
28+
is a small increase in the number of "flaky" tests. This can partially be
29+
explained by the fact KVM acceleration cannot be used on GitHub Actions when
30+
using free runners. Packetdrill tests seem to suffer more than the others.
31+
Compared to the other tests, they are executed in parallel. It looks like they
32+
take all/most of the available CPU resources. Reducing the limit to have maximum
33+
2 tests in parallel greatly improved the results, but there are still some rare
34+
hiccups.
35+
36+
The second surprise initially looked bad: some new kernel panics have been
37+
[detected](https://github.com/multipath-tcp/mptcp_net-next/issues/471). They
38+
were not due to MPTCP code -- it was happening while doing some "ping" between
39+
net namespaces -- but it was reported by the CI ~25-30% of the time. The bug has
40+
been [reported](https://lore.kernel.org/netdev/[email protected]/T/)
41+
to the Netdev community, and quickly identified as an issue somewhere in x86
42+
code. It was not easy to reproduce the issue, but with determination, and hours
43+
of tests, a [patch](https://lore.kernel.org/all/[email protected]/T/)
44+
causing the issue has been identified. It turns out, the patch accidentally
45+
removed some code that was preventing a bug in QEmu's TCG code. Before reporting
46+
this to the QEmu's community, I tried to reproduce the issue with the last
47+
stable version of QEmu -- requiring an update of other dependences, including
48+
a fix for virtme that is no longer maintained -- and I didn't manage to. Still,
49+
the QEmu's version from the last Ubuntu stable release doesn't include this fix.
50+
This has been reported to the [Ubuntu bug tracker](https://bugs.launchpad.net/bugs/2051965).
51+
After another long bisecting session, I just managed to identify the commits
52+
fixing the bug. Sadly, these commits have not been backported on QEmu's side,
53+
and the commit message was lacking explanations about how the bug was initially
54+
found, and when the bug was introduced. This doesn't help to get fixes
55+
backported, and to quickly identify which patch fixed someone else's issue. At
56+
the end, quite some time has been spent on a bug impacting us, but not due to
57+
MPTCP. On the other hand, it helped to improve kernel panic detection, quickly
58+
stopping the VM in case of such critical issues, and the fix will likely be
59+
available to all Ubuntu users.
60+
61+
Not related to the migration, but some other subtests have started to fail for a
62+
few days in a row. It turns out it was due to the last version of IPRoute2, more
63+
precisely due to a [regression](https://lore.kernel.org/netdev/[email protected]/)
64+
in the `ss` tool we use for some tests. A
65+
[fix](https://lore.kernel.org/netdev/[email protected]/)
66+
has been sent, applied, but not backported. Even if it breaks our selftests, and
67+
it might break scripts using `ss` to monitor connections or to report info, the
68+
[issue is not big enough](https://lore.kernel.org/netdev/[email protected]/)
69+
to justify a new bug-fix version. In other words, if you want to validate MPTCP
70+
selftests, don't use IPRoute 6.7.0. To reduce the number of users running tests
71+
with this buggy version, note that the version on Debian, used by a few CIs, has
72+
been [patched](https://salsa.debian.org/kernel-team/iproute2/-/commit/14cfd95e)
73+
with the fix.
74+
75+
### LKFT CI
76+
77+
Linaro's Linux Kernel Functional Test framework, or LKFT, test a very impressive
78+
number of kernel versions. From old stable ones, to linux-next. They have been
79+
validating MPTCP selftests for years, and that's great, thanks guys for doing
80+
that! It is especially useful to find issues due to a bad backport in stable
81+
versions, or to prevent issues by testing linux-next. One important
82+
particularity when validating stable kernel versions, is that they follow stable
83+
team [recommendations](https://lore.kernel.org/stable/[email protected]/):
84+
running the kselftests suite from the last stable version (e.g. 6.7.4) when
85+
validating (old) stable kernel (e.g. 5.15.148). It is an important assumption to
86+
know when writing kselftests, and I don't think this is well known. I don't
87+
think that's easy to support all kernels, especially in the networking area,
88+
where some features/behaviours are not directly exposed to the userspace. Some
89+
MPTCP kselftests have to look at `/proc/kallsyms` or use other (ugly?)
90+
[workarounds](https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net)
91+
to predict what we are supposed to have, depending on the kernel that is being
92+
used. But it is important to do something to support these old kernels, not to
93+
have big kselftests, with many different subtests, always marked as "failed"
94+
when validating new stable releases, removing all their value.
95+
96+
Regularly, test results from LKFT are analysed, to check if some tests have not
97+
been skipped by accident, and if there are no real issues being ignored there.
98+
Recently, some modifications have been done to support more environments, add
99+
missing kernel configs, increase timeout, and other techniques trying to support
100+
very slow environments, and more.
101+
102+
### Netdev CI
103+
104+
Great news for the Networking subsystem: a new
105+
[CI environment](https://netdev.bots.linux.dev/status.html) has been put in
106+
place by Jakub Kicinski, supported by Meta. It currently runs most tests
107+
available in the kernel source code, including ours for MPTCP: kselftests,
108+
KUnit, and more. The initial goal is to validate patches that are sent on the
109+
mailing list, and send reports on [Patchwork](https://patchwork.kernel.org/project/netdevbpf/list/).
110+
That's an important addition, something we have done with MPTCP for a long time,
111+
and is part of the common practice: each new modification has to be validated by
112+
a CI. Certainly due to the same reasons as mentioned in my [previous post]({% post_url 2024-01-01-Angel-Project %})
113+
and valid for most kernel subsystems, and Open-Source project in general -- no
114+
infrastructure in place that can be easily used by kernel dev to monitor new
115+
patches sent by email, to automatically apply, then run tests, and send a
116+
report ; who would pay to develop and maintain that, for the VM or the specific
117+
hardware, etc. --, this has been lacking for years. This doesn't mean no tests
118+
were being executed, it's just that this was not done in public: it is easier
119+
for a company to run these tests internally, especially when some specific
120+
hardware is needed, or the test suite is not public. But that's not impossible:
121+
DRM subsystem has been using GitLab for years, with self-hosted runners, for
122+
everybody's benefit. But it needs coordination, and some brave people to push
123+
their manager to start such projects that are not going to be directly
124+
beneficial for this company. Anyway, here, that's great to have Meta's support.
125+
126+
I'm glad I was able to share some knowledge to help a bit having this in place.
127+
The environment is a bit different, mainly because no "public CIs" are involved.
128+
When I look at the different issues I had in the past, and the workarounds I put
129+
in place to have our different tests supported on such "public CIs", I'm now
130+
thinking I could have probably done a lot more with the same time if I was able
131+
to control everything. But on the other hand, that's great anybody can maintain
132+
it, and it is not tied to a single company -- which is currently not the case
133+
with the Netdev CI, even if a lot of components are in a
134+
[public repository](https://github.com/linux-netdev/nipa/).
135+
136+
Having CI executed in a new environment, also means new issues specific to this
137+
new environment! This was the case, especially because the tests are running
138+
without KVM acceleration. It is good that some of the fixes needed for LKFT CI
139+
can be re-used here.
140+
141+
Talking about fixes, Paolo managed to find what was causing some subtests to be
142+
[unstable](https://github.com/multipath-tcp/mptcp_net-next/issues/468) in very
143+
slow environments. It took a bit of time to identify the root cause, but it
144+
appears to be due to the use of a TCP-specific helper on an MPTCP socket. As
145+
mentioned in the [commit message](https://git.kernel.org/netdev/net/c/b6c620dc43cc),
146+
funnily enough, fuzzers and static checkers are happy, as the accessed memory
147+
still belongs to the `mptcp_sock struct`, but completely unrelated. Even from a
148+
functional perspective, the check being done was always skipped before. Until an
149+
unrelated [refactoring](https://git.kernel.org/netdev/net/c/d5fed5addb2b) which
150+
exposed the issue. In short, always be careful when casting a structure into
151+
another one! BTW, that's not the first time such an issue has been introduced
152+
because that's easy to mix-up MPTCP and TCP socket structures. I'm working on
153+
adding new small checks in debug mode to avoid such issues in the future.
154+
155+
One last thing, not to have entire kselftests, including hundreds of subtests,
156+
being ignored due to a single unstable subtest, I'm adding subtests support. We
157+
would feel better skipping just one or two subtests while we are trying to
158+
improve the situation, than having hundreds of them being ignored for days,
159+
adding unneeded pressure.
160+
161+
## Misc. Maintenance Work
162+
163+
Mainly to expose that the maintenance work is always full of unexpected stuff
164+
that needs to be done "quickly", here are a bunch of tasks that needed to be
165+
done this month, on top of the classic reviews, meetings, questions, bugs, etc.:
166+
167+
- BPF: we have some code in progress in our tree that is not fully ready yet.
168+
But then it means that we continuously have to adapt our code when some
169+
changes are done upstream. It was the case recently, to adapt to
170+
[different](https://git.kernel.org/netdev/net/c/f6be98d19985) `bpf_struct_ops`
171+
[changes](https://git.kernel.org/netdev/net/c/4cbb270e115b).
172+
173+
- `mptcpd` debian packages: two new versions --
174+
[0.12-3](https://tracker.debian.org/news/1494263/accepted-mptcpd-012-3-source-into-unstable/) and
175+
[0.12-4](https://tracker.debian.org/news/1499261/accepted-mptcpd-012-4-source-into-unstable/) --
176+
have been released to fix some issues due to some changes in the test
177+
infrastructure, and how SystemD dependence is handled.
178+
179+
- `packetdrill`: Packetdrill is a very handy tool to validate networking
180+
features. We maintain an out-of-tree version supporting MPTCP. This version is
181+
not "upstreamable" as it is, because the old experimental code needs a serious
182+
refactoring. But still, it works quite well. Anyway, when trying to find a fix
183+
for the bug we had with QEmu (see above), the build environment for the tests
184+
has been updated, breaking our branches dedicated to different stable versions
185+
containing tests valid for the specific versions: MPTCP support has been added
186+
progressively. To fix compilation issues when using newer compiler versions,
187+
but also to help with the maintenance later, each stable branch has been
188+
synced with modifications from the upstream repository maintained by Google.
189+
190+
## What's next?
191+
192+
Even if I would prefer to add new features, there is still some work needed for
193+
the CI:
194+
- using [self-hosted runners](https://github.com/multipath-tcp/mptcp_net-next/issues/474)
195+
to improve the stability of some tests
196+
- switching to [virtme-ng](https://github.com/multipath-tcp/mptcp_net-next/issues/472)
197+
because the `virtme` tool we use is no longer maintained
198+
- validating [MPTCP BPF tests](https://github.com/multipath-tcp/mptcp_net-next/issues/406)
199+
to track regressions in this area
200+
- tracking [flaky tests](https://github.com/multipath-tcp/mptcp_net-next/issues/473)
201+
to be able to easily identify them
202+
203+
But once done, that will help the whole MPTCP community, and I will be able to
204+
focus on other things!

0 commit comments

Comments
 (0)