Skip to content

Commit 44a0b38

Browse files
committed
Add cover letter
1 parent 86053fa commit 44a0b38

File tree

1 file changed

+160
-0
lines changed

1 file changed

+160
-0
lines changed

cover-letter

+160
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
I was recently looking into using BPF socket iterators in conjunction
2+
with the bpf_sock_destroy() kfunc as a means to forcefully destroy a
3+
set of UDP sockets connected to a deleted backend [1]. The intent is to
4+
use BPF iterators + kfuncs in lieu of INET_DIAG infrastructure to
5+
destroy sockets in order to simplify Cilium's system requirements. Aditi
6+
describes the scenario in [2], the patch series that introduced
7+
bpf_sock_destroy() for this very purpose:
8+
9+
> This patch set adds the capability to destroy sockets in BPF. We plan
10+
> to use the capability in Cilium to force client sockets to reconnect
11+
> when their remote load-balancing backends are deleted. The other use
12+
> case is on-the-fly policy enforcement where existing socket
13+
> connections prevented by policies need to be terminated.
14+
15+
One would want and expect an iterator to visit every socket that existed
16+
before the iterator was created, if not exactly once, then at least
17+
once, otherwise we could accidentally skip a socket that we intended to
18+
destroy. With the iterator implementation as it exists today, this is
19+
the behavior you would observe in the vast majority of cases.
20+
21+
However, in the process of reviewing [2] and some follow up fixes to
22+
bpf_iter_udp_batch() ([3] [4]) by Martin, it occurred to me that there
23+
are situations where BPF socket iterators may repeat, or worse, skip
24+
sockets altogether even if they existed prior to iterator creation,
25+
making BPF iterators as a mechanism to achieve the goal stated above
26+
slightly buggy.
27+
28+
This RFC highlights some of these scenarios, extending
29+
prog_tests/sock_iter_batch.c to illustrate conditions under which
30+
sockets can be skipped or repeated, and proposes a solution for
31+
achieving exactly-once semantics for socket iterators in all cases as
32+
it relates to sockets that existed prior to the start of iteration.
33+
34+
I'm hoping to raise awareness of this issue generally if
35+
it's not already common knowledge and get some feedback on the viability
36+
of the proposed improvement.
37+
38+
THE PROBLEM
39+
===========
40+
Both UDP and TCP socket iterators use iter->offset to track progress
41+
through a bucket, which is a measure of the number of matching sockets
42+
from the current bucket that have been seen or processed by the
43+
iterator. On subsequent iterations, if the current bucket has
44+
unprocessed items, we skip at least iter->offset matching items in the
45+
bucket before adding any remaining items to the next batch. The intent
46+
seems to be to skip any items we've already seen, but iter->offset
47+
isn't always an accurate measure of "things already seen". There are a
48+
variety of scenarios where the underlying bucket changes between reads,
49+
leading to either repeated or skipped sockets. Two such scenarios are
50+
illustrated below and reproduced by the self tests.
51+
52+
Skip A Socket
53+
+------+--------------------+--------------+---------------+
54+
| Time | Event | Bucket State | Bucket Offset |
55+
+------+--------------------+--------------+---------------+
56+
| 1 | read(iter_fd) -> A | A->B->C->D | 1 |
57+
| 2 | close(A) | B->C->D | 1 |
58+
| 3 | read(iter_fd) -> C | B->C->D | 2 |
59+
| 4 | read(iter_fd) -> D | B->C->D | 3 |
60+
| 5 | read(iter_fd) -> 0 | B->C->D | - |
61+
+------+--------------------+--------------+---------------+
62+
63+
Iteration sees these buckets: [A, C, D]
64+
B is skipped.
65+
66+
Repeat A Socket
67+
+------+--------------------+---------------+---------------+
68+
| Time | Event | Bucket State | Bucket Offset |
69+
+------+--------------------+---------------+---------------+
70+
| 1 | read(iter_fd) -> A | A->B->C->D | 1 |
71+
| 2 | connect(E) | E->A->B->C->D | 1 |
72+
| 3 | read(iter_fd) -> A | E->A->B->C->D | 2 |
73+
| 3 | read(iter_fd) -> B | E->A->B->C->D | 3 |
74+
| 3 | read(iter_fd) -> C | E->A->B->C->D | 4 |
75+
| 4 | read(iter_fd) -> D | E->A->B->C->D | 5 |
76+
| 5 | read(iter_fd) -> 0 | E->A->B->C->D | - |
77+
+------+--------------------+---------------+---------------+
78+
79+
Iteration sees these buckets: [A, A, B, C, D]
80+
A is repeated.
81+
82+
If we consider corner cases like these, semantics are neither
83+
at-most-once, nor at-least-once, nor exactly-once. Repeating a socket
84+
during iteration is perhaps less problematic than skipping it
85+
altogether as long as the BPF program is aware that duplicates are
86+
possible; however, in an ideal world, we could process each socket
87+
exactly once. There are some constraints that make this a bit more
88+
difficult:
89+
90+
1) Despite batch resize attempts inside both bpf_iter_udp_batch() and
91+
bpf_iter_tcp_batch(), we have to deal with the possibility that our
92+
batch size cannot contain all items in a bucket at once.
93+
2) We cannot hold a lock on the bucket between iterations, meaning that
94+
the structure can change in lots of interesting ways.
95+
96+
PROPOSAL
97+
========
98+
Can we achieve exactly-once semantics for socket iterators even in the
99+
face of concurrent additions or removals to the current bucket? If we
100+
ignore the possibility of signed 64 bit rollover, then yes. This
101+
series replaces the current offset-based scheme used for progress
102+
tracking with a scheme based on a monotonically increasing version
103+
number. It works as follows:
104+
105+
* Assign index numbers on sockets in the bucket's linked list such that
106+
they are monotonically increasing as you read from the head to tail.
107+
108+
* Every time a socket is added to a bucket, increment the hash
109+
table's version number, ver.
110+
* If the socket is being added to the head of the bucket's linked
111+
list, set sk->idx to -1*ver.
112+
* If the socket is being added to the tail of the bucket's linked
113+
list, set sk->ver to ver.
114+
115+
Ex: append_head(C), append_head(B), append_tail(D), append_head(A),
116+
append_tail(E) results in the following state.
117+
118+
A -> B -> C -> D -> E
119+
-4 -2 -1 3 5
120+
* As we iterate through a bucket, keep track of the last index number
121+
we've seen for that bucket, iter->prev_idx.
122+
* On subsequent iterations, skip ahead in the bucket until we see a
123+
socket whose index, sk->idx, is greater than iter->prev_idx.
124+
125+
Since we always iterate from head to tail and indexes are always
126+
increasing in that direction, we can be sure that any socket whose index
127+
is greater than iter->prev_idx has not yet been seen. Any socket whose
128+
index is less than or equal to iter->prev_idx has either been seen
129+
before or was added since we last saw that bucket. In either case, it's
130+
safe to skip them (any new sockets did not exist when we created the
131+
iterator).
132+
133+
SOME ALTERNATIVES
134+
=================
135+
1. One alternative I considered was simply counting the number of
136+
removals that have occurred per bucket, remembering this between
137+
calls to bpf_iter_(tcp|udp)_batch() as part of the iterator state,
138+
then using it to detect if it has changed. If any removals have
139+
occurred, we would need to walk back iter->offset by at least that
140+
much to avoid skips. This approach is simpler but may repeat sockets.
141+
2. Don't allow partial batches; always make sure we capture all sockets
142+
in a bucket in a batch. bpf_iter_(tcp|udp)_batch() already have some
143+
logic to try one time to resize the batch, but as far as I know,
144+
this isn't viable since we have to contend with the fact that
145+
bpf_iter_(tcp|udp)_realloc_batch() may not be able to grab more
146+
memory.
147+
148+
Anyway, maybe everyone already knows this can happen and isn't
149+
overly concerned, since the possibility of skips or repeats is small,
150+
but I thought I'd highlight the possibility just in case. It certainly
151+
seems like something we'd want to avoid if we can help it, and with a
152+
few adjustments, we can.
153+
154+
-Jordan
155+
156+
[1]: https://github.com/cilium/cilium/issues/37907
157+
[2]: https://lore.kernel.org/bpf/[email protected]/
158+
[3]: https://lore.kernel.org/netdev/[email protected]/
159+
[4]: https://lore.kernel.org/netdev/[email protected]/
160+

0 commit comments

Comments
 (0)