|
| 1 | +I was recently looking into using BPF socket iterators in conjunction |
| 2 | +with the bpf_sock_destroy() kfunc as a means to forcefully destroy a |
| 3 | +set of UDP sockets connected to a deleted backend [1]. The intent is to |
| 4 | +use BPF iterators + kfuncs in lieu of INET_DIAG infrastructure to |
| 5 | +destroy sockets in order to simplify Cilium's system requirements. Aditi |
| 6 | +describes the scenario in [2], the patch series that introduced |
| 7 | +bpf_sock_destroy() for this very purpose: |
| 8 | + |
| 9 | +> This patch set adds the capability to destroy sockets in BPF. We plan |
| 10 | +> to use the capability in Cilium to force client sockets to reconnect |
| 11 | +> when their remote load-balancing backends are deleted. The other use |
| 12 | +> case is on-the-fly policy enforcement where existing socket |
| 13 | +> connections prevented by policies need to be terminated. |
| 14 | + |
| 15 | +One would want and expect an iterator to visit every socket that existed |
| 16 | +before the iterator was created, if not exactly once, then at least |
| 17 | +once, otherwise we could accidentally skip a socket that we intended to |
| 18 | +destroy. With the iterator implementation as it exists today, this is |
| 19 | +the behavior you would observe in the vast majority of cases. |
| 20 | + |
| 21 | +However, in the process of reviewing [2] and some follow up fixes to |
| 22 | +bpf_iter_udp_batch() ([3] [4]) by Martin, it occurred to me that there |
| 23 | +are situations where BPF socket iterators may repeat, or worse, skip |
| 24 | +sockets altogether even if they existed prior to iterator creation, |
| 25 | +making BPF iterators as a mechanism to achieve the goal stated above |
| 26 | +slightly buggy. |
| 27 | + |
| 28 | +This RFC highlights some of these scenarios, extending |
| 29 | +prog_tests/sock_iter_batch.c to illustrate conditions under which |
| 30 | +sockets can be skipped or repeated, and proposes a solution for |
| 31 | +achieving exactly-once semantics for socket iterators in all cases as |
| 32 | +it relates to sockets that existed prior to the start of iteration. |
| 33 | + |
| 34 | +I'm hoping to raise awareness of this issue generally if |
| 35 | +it's not already common knowledge and get some feedback on the viability |
| 36 | +of the proposed improvement. |
| 37 | + |
| 38 | +THE PROBLEM |
| 39 | +=========== |
| 40 | +Both UDP and TCP socket iterators use iter->offset to track progress |
| 41 | +through a bucket, which is a measure of the number of matching sockets |
| 42 | +from the current bucket that have been seen or processed by the |
| 43 | +iterator. On subsequent iterations, if the current bucket has |
| 44 | +unprocessed items, we skip at least iter->offset matching items in the |
| 45 | +bucket before adding any remaining items to the next batch. The intent |
| 46 | +seems to be to skip any items we've already seen, but iter->offset |
| 47 | +isn't always an accurate measure of "things already seen". There are a |
| 48 | +variety of scenarios where the underlying bucket changes between reads, |
| 49 | +leading to either repeated or skipped sockets. Two such scenarios are |
| 50 | +illustrated below and reproduced by the self tests. |
| 51 | + |
| 52 | +Skip A Socket |
| 53 | ++------+--------------------+--------------+---------------+ |
| 54 | +| Time | Event | Bucket State | Bucket Offset | |
| 55 | ++------+--------------------+--------------+---------------+ |
| 56 | +| 1 | read(iter_fd) -> A | A->B->C->D | 1 | |
| 57 | +| 2 | close(A) | B->C->D | 1 | |
| 58 | +| 3 | read(iter_fd) -> C | B->C->D | 2 | |
| 59 | +| 4 | read(iter_fd) -> D | B->C->D | 3 | |
| 60 | +| 5 | read(iter_fd) -> 0 | B->C->D | - | |
| 61 | ++------+--------------------+--------------+---------------+ |
| 62 | + |
| 63 | +Iteration sees these buckets: [A, C, D] |
| 64 | +B is skipped. |
| 65 | + |
| 66 | +Repeat A Socket |
| 67 | ++------+--------------------+---------------+---------------+ |
| 68 | +| Time | Event | Bucket State | Bucket Offset | |
| 69 | ++------+--------------------+---------------+---------------+ |
| 70 | +| 1 | read(iter_fd) -> A | A->B->C->D | 1 | |
| 71 | +| 2 | connect(E) | E->A->B->C->D | 1 | |
| 72 | +| 3 | read(iter_fd) -> A | E->A->B->C->D | 2 | |
| 73 | +| 3 | read(iter_fd) -> B | E->A->B->C->D | 3 | |
| 74 | +| 3 | read(iter_fd) -> C | E->A->B->C->D | 4 | |
| 75 | +| 4 | read(iter_fd) -> D | E->A->B->C->D | 5 | |
| 76 | +| 5 | read(iter_fd) -> 0 | E->A->B->C->D | - | |
| 77 | ++------+--------------------+---------------+---------------+ |
| 78 | + |
| 79 | +Iteration sees these buckets: [A, A, B, C, D] |
| 80 | +A is repeated. |
| 81 | + |
| 82 | +If we consider corner cases like these, semantics are neither |
| 83 | +at-most-once, nor at-least-once, nor exactly-once. Repeating a socket |
| 84 | +during iteration is perhaps less problematic than skipping it |
| 85 | +altogether as long as the BPF program is aware that duplicates are |
| 86 | +possible; however, in an ideal world, we could process each socket |
| 87 | +exactly once. There are some constraints that make this a bit more |
| 88 | +difficult: |
| 89 | + |
| 90 | +1) Despite batch resize attempts inside both bpf_iter_udp_batch() and |
| 91 | + bpf_iter_tcp_batch(), we have to deal with the possibility that our |
| 92 | + batch size cannot contain all items in a bucket at once. |
| 93 | +2) We cannot hold a lock on the bucket between iterations, meaning that |
| 94 | + the structure can change in lots of interesting ways. |
| 95 | + |
| 96 | +PROPOSAL |
| 97 | +======== |
| 98 | +Can we achieve exactly-once semantics for socket iterators even in the |
| 99 | +face of concurrent additions or removals to the current bucket? If we |
| 100 | +ignore the possibility of signed 64 bit rollover, then yes. This |
| 101 | +series replaces the current offset-based scheme used for progress |
| 102 | +tracking with a scheme based on a monotonically increasing version |
| 103 | +number. It works as follows: |
| 104 | + |
| 105 | +* Assign index numbers on sockets in the bucket's linked list such that |
| 106 | + they are monotonically increasing as you read from the head to tail. |
| 107 | + |
| 108 | + * Every time a socket is added to a bucket, increment the hash |
| 109 | + table's version number, ver. |
| 110 | + * If the socket is being added to the head of the bucket's linked |
| 111 | + list, set sk->idx to -1*ver. |
| 112 | + * If the socket is being added to the tail of the bucket's linked |
| 113 | + list, set sk->ver to ver. |
| 114 | + |
| 115 | + Ex: append_head(C), append_head(B), append_tail(D), append_head(A), |
| 116 | + append_tail(E) results in the following state. |
| 117 | + |
| 118 | + A -> B -> C -> D -> E |
| 119 | + -4 -2 -1 3 5 |
| 120 | +* As we iterate through a bucket, keep track of the last index number |
| 121 | + we've seen for that bucket, iter->prev_idx. |
| 122 | +* On subsequent iterations, skip ahead in the bucket until we see a |
| 123 | + socket whose index, sk->idx, is greater than iter->prev_idx. |
| 124 | + |
| 125 | +Since we always iterate from head to tail and indexes are always |
| 126 | +increasing in that direction, we can be sure that any socket whose index |
| 127 | +is greater than iter->prev_idx has not yet been seen. Any socket whose |
| 128 | +index is less than or equal to iter->prev_idx has either been seen |
| 129 | +before or was added since we last saw that bucket. In either case, it's |
| 130 | +safe to skip them (any new sockets did not exist when we created the |
| 131 | +iterator). |
| 132 | + |
| 133 | +SOME ALTERNATIVES |
| 134 | +================= |
| 135 | +1. One alternative I considered was simply counting the number of |
| 136 | + removals that have occurred per bucket, remembering this between |
| 137 | + calls to bpf_iter_(tcp|udp)_batch() as part of the iterator state, |
| 138 | + then using it to detect if it has changed. If any removals have |
| 139 | + occurred, we would need to walk back iter->offset by at least that |
| 140 | + much to avoid skips. This approach is simpler but may repeat sockets. |
| 141 | +2. Don't allow partial batches; always make sure we capture all sockets |
| 142 | + in a bucket in a batch. bpf_iter_(tcp|udp)_batch() already have some |
| 143 | + logic to try one time to resize the batch, but as far as I know, |
| 144 | + this isn't viable since we have to contend with the fact that |
| 145 | + bpf_iter_(tcp|udp)_realloc_batch() may not be able to grab more |
| 146 | + memory. |
| 147 | + |
| 148 | +Anyway, maybe everyone already knows this can happen and isn't |
| 149 | +overly concerned, since the possibility of skips or repeats is small, |
| 150 | +but I thought I'd highlight the possibility just in case. It certainly |
| 151 | +seems like something we'd want to avoid if we can help it, and with a |
| 152 | +few adjustments, we can. |
| 153 | + |
| 154 | +-Jordan |
| 155 | + |
| 156 | +[1]: https://github.com/cilium/cilium/issues/37907 |
| 157 | +[2]: https://lore.kernel.org/bpf/ [email protected]/ |
| 158 | +[3]: https://lore.kernel.org/netdev/ [email protected]/ |
| 159 | +[4]: https://lore.kernel.org/netdev/ [email protected]/ |
| 160 | + |
0 commit comments