Add cover letter

jrife · jrife · commit 44a0b38ab5df · 2025-03-12T20:51:02.000Z
diff --git a/cover-letter b/cover-letter
@@ -0,0 +1,160 @@
+I was recently looking into using BPF socket iterators in conjunction
+with the bpf_sock_destroy() kfunc as a means to forcefully destroy a
+set of UDP sockets connected to a deleted backend [1]. The intent is to
+use BPF iterators + kfuncs in lieu of INET_DIAG infrastructure to
+destroy sockets in order to simplify Cilium's system requirements. Aditi
+describes the scenario in [2], the patch series that introduced
+bpf_sock_destroy() for this very purpose:
+
+> This patch set adds the capability to destroy sockets in BPF. We plan
+> to use the capability in Cilium to force client sockets to reconnect
+> when their remote load-balancing backends are deleted. The other use
+> case is on-the-fly policy enforcement where existing socket
+> connections prevented by policies need to be terminated.
+
+One would want and expect an iterator to visit every socket that existed
+before the iterator was created, if not exactly once, then at least
+once, otherwise we could accidentally skip a socket that we intended to
+destroy. With the iterator implementation as it exists today, this is
+the behavior you would observe in the vast majority of cases.
+
+However, in the process of reviewing [2] and some follow up fixes to
+bpf_iter_udp_batch() ([3] [4]) by Martin, it occurred to me that there
+are situations where BPF socket iterators may repeat, or worse, skip
+sockets altogether even if they existed prior to iterator creation,
+making BPF iterators as a mechanism to achieve the goal stated above
+slightly buggy.
+
+This RFC highlights some of these scenarios, extending
+prog_tests/sock_iter_batch.c to illustrate conditions under which
+sockets can be skipped or repeated, and proposes a solution for
+achieving exactly-once semantics for socket iterators in all cases as
+it relates to sockets that existed prior to the start of iteration.
+
+I'm hoping to raise awareness of this issue generally if
+it's not already common knowledge and get some feedback on the viability
+of the proposed improvement.
+
+THE PROBLEM
+===========
+Both UDP and TCP socket iterators use iter->offset to track progress
+through a bucket, which is a measure of the number of matching sockets
+from the current bucket that have been seen or processed by the
+iterator. On subsequent iterations, if the current bucket has
+unprocessed items, we skip at least iter->offset matching items in the
+bucket before adding any remaining items to the next batch. The intent
+seems to be to skip any items we've already seen, but iter->offset
+isn't always an accurate measure of "things already seen". There are a
+variety of scenarios where the underlying bucket changes between reads,
+leading to either repeated or skipped sockets. Two such scenarios are
+illustrated below and reproduced by the self tests.
+
+Skip A Socket
++------+--------------------+--------------+---------------+
+| Time | Event              | Bucket State | Bucket Offset |
++------+--------------------+--------------+---------------+
+| 1    | read(iter_fd) -> A | A->B->C->D   | 1             |
+| 2    | close(A)           | B->C->D      | 1             |
+| 3    | read(iter_fd) -> C | B->C->D      | 2             |
+| 4    | read(iter_fd) -> D | B->C->D      | 3             |
+| 5    | read(iter_fd) -> 0 | B->C->D      | -             |
++------+--------------------+--------------+---------------+
+
+Iteration sees these buckets: [A, C, D]
+B is skipped.
+
+Repeat A Socket
++------+--------------------+---------------+---------------+
+| Time | Event              | Bucket State  | Bucket Offset |
++------+--------------------+---------------+---------------+
+| 1    | read(iter_fd) -> A | A->B->C->D    | 1             |
+| 2    | connect(E)         | E->A->B->C->D | 1             |
+| 3    | read(iter_fd) -> A | E->A->B->C->D | 2             |
+| 3    | read(iter_fd) -> B | E->A->B->C->D | 3             |
+| 3    | read(iter_fd) -> C | E->A->B->C->D | 4             |
+| 4    | read(iter_fd) -> D | E->A->B->C->D | 5             |
+| 5    | read(iter_fd) -> 0 | E->A->B->C->D | -             |
++------+--------------------+---------------+---------------+
+
+Iteration sees these buckets: [A, A, B, C, D]
+A is repeated.
+
+If we consider corner cases like these, semantics are neither
+at-most-once, nor at-least-once, nor exactly-once. Repeating a socket
+during iteration is perhaps less problematic than skipping it
+altogether as long as the BPF program is aware that duplicates are
+possible; however, in an ideal world, we could process each socket
+exactly once. There are some constraints that make this a bit more
+difficult:
+
+1) Despite batch resize attempts inside both bpf_iter_udp_batch() and
+   bpf_iter_tcp_batch(), we have to deal with the possibility that our
+   batch size cannot contain all items in a bucket at once.
+2) We cannot hold a lock on the bucket between iterations, meaning that
+   the structure can change in lots of interesting ways.
+
+PROPOSAL
+========
+Can we achieve exactly-once semantics for socket iterators even in the
+face of concurrent additions or removals to the current bucket? If we
+ignore the possibility of signed 64 bit rollover, then yes. This
+series replaces the current offset-based scheme used for progress
+tracking with a scheme based on a monotonically increasing version
+number. It works as follows:
+
+* Assign index numbers on sockets in the bucket's linked list such that
+  they are monotonically increasing as you read from the head to tail.
+
+  * Every time a socket is added to a bucket, increment the hash
+    table's version number, ver.
+  * If the socket is being added to the head of the bucket's linked
+    list, set sk->idx to -1*ver.
+  * If the socket is being added to the tail of the bucket's linked
+    list, set sk->ver to ver.
+
+  Ex: append_head(C), append_head(B), append_tail(D), append_head(A),
+      append_tail(E) results in the following state.
+    
+      A -> B -> C -> D -> E
+     -4   -2   -1    3    5
+* As we iterate through a bucket, keep track of the last index number
+  we've seen for that bucket, iter->prev_idx.
+* On subsequent iterations, skip ahead in the bucket until we see a
+  socket whose index, sk->idx, is greater than iter->prev_idx.
+    
+Since we always iterate from head to tail and indexes are always
+increasing in that direction, we can be sure that any socket whose index
+is greater than iter->prev_idx has not yet been seen. Any socket whose
+index is less than or equal to iter->prev_idx has either been seen
+before or was added since we last saw that bucket. In either case, it's
+safe to skip them (any new sockets did not exist when we created the
+iterator).
+
+SOME ALTERNATIVES
+=================
+1. One alternative I considered was simply counting the number of
+   removals that have occurred per bucket, remembering this between
+   calls to bpf_iter_(tcp|udp)_batch() as part of the iterator state,
+   then using it to detect if it has changed. If any removals have
+   occurred, we would need to walk back iter->offset by at least that
+   much to avoid skips. This approach is simpler but may repeat sockets.
+2. Don't allow partial batches; always make sure we capture all sockets
+   in a bucket in a batch. bpf_iter_(tcp|udp)_batch() already have some
+   logic to try one time to resize the batch, but as far as I know,
+   this isn't viable since we have to contend with the fact that
+   bpf_iter_(tcp|udp)_realloc_batch() may not be able to grab more
+   memory.
+
+Anyway, maybe everyone already knows this can happen and isn't
+overly concerned, since the possibility of skips or repeats is small,
+but I thought I'd highlight the possibility just in case. It certainly
+seems like something we'd want to avoid if we can help it, and with a
+few adjustments, we can.
+
+-Jordan
+
+[1]: https://github.com/cilium/cilium/issues/37907
+[2]: https://lore.kernel.org/bpf/20230519225157.760788-1-aditi.ghag@isovalent.com/
+[3]: https://lore.kernel.org/netdev/20240112190530.3751661-1-martin.lau@linux.dev/
+[4]: https://lore.kernel.org/netdev/20240112190530.3751661-2-martin.lau@linux.dev/
+