-
Notifications
You must be signed in to change notification settings - Fork 673
IP allocation design
See the requirements.
Basic idea: We divide the IP address space between all nodes, and adjust that division dynamically in order to ensure a supply of IPs to every nodes. This allows nodes to allocate and free individual IPs locally in a very efficient manner, generally without synchronous interaction with other nodes. It also provides some inherent partition tolerance.
Each node has two interfaces: a command interface which the container infrastructure (e.g. the Weave script, a Docker plugin) uses to request IP addresses, and a messaging interface which it uses to exchange information with other nodes on the network.
The key problems to solve are:
- How do nodes get hold of IP addresses when they start up or are running out of IPs?
- How do we avoid leaks of addresses due to nodes dying and other failures?
Our approach is founded on the following ideas:
- Reservations. Instead of dealing with individual IP addresses, we operate on them in groups, so called "reservations". These have a more space efficient representation than random collections of IPs, and include a notion of "merging" - certain reservations can be combined into larger reservations with a smaller space footprint.
- Heirs. We introduce a notion of "heirs" for IP addresses, extending it from their to reservations and, finally, nodes. The idea is that for every IP address we can identify a unique node which should reclaim it in the event it leaked.
- CRDT. All reservations in the system are recorded in a map from node ids to reservations. This map is a CRDT. Nodes only ever update their own entries (though they may delete other entries). This makes the data structure inherently convergent. Moreover, the algorithms which operate on the map can tolerate inconsistencies between entries.
- Gossipping. Updates to the CRDT are communicated between nodes through gossipping, which scales well and copes well with changing topologies and partitions. The gossipping is piggybacking on the weave router's peer gossiping and is therefore topology-aware, which decreases message loss and duplication. Due to the aforementioned tolerance for inconsistency between entries, the information contained in the CRDT can be gossipped incrementally, though we do require gossipping to propagate all information everywhere eventually. [1]
- Protocol. Nodes transfer reservations between each other through a mini messaging protocol. The protocol is conjoined with the CRDT gossiping, thus ensuring that the recipient has up to date information about the sender. The messaging is async, with no delivery or ordering guarantees.
[1] In practice this can only be guaranteed when the rate of change in the topology is below a certain threshold.
The allocator works within a single subnet, for example 10.0.0.0/8. This is specified to every peer at start-up, and must be consistent between peers.
We think of this IP space as a ring, with reservations representing segments of the ring labelled by the node holding the reservation. Reservations that are adjacent in the ring and have the same label can be merged. Gaps in the ring represent leaked IP addresses; we can think of the gaps as representing "lost reservations".
We require that every IP address has a unique "heir", in order to avoid ambiguity and races when cleaning up after node or network failure. For the simple reservation representation of segments on a ring, we define the heir to be the predecessor.
To illustrate, in the space 10.0.0.0/8, the heir of a segment beginning with address 10.5.1.32 is the segment ending with address 10.5.1.31. Since it is a ring the arithmetic wraps around: the heir of a segment beginning with address 10.0.0.0 is the segment ending with address 10.255.255.255.
The CRDT contains, for each node, a set of reservations and a boolean "has some free IPs" flag.
The CRDT is implemented with help of a vector clock of all map entries. When a node updates its own map entry, the clock for its entry is advanced. This allows gossipees to monotonically merge entries with their own, i.e. they will overwrite entries they hold which are older than those received.
A crucial property of this data structure and the protocol we describe below is that, despite nodes potentially having wildly different ideas of the reservations held by other nodes, no two nodes will simultaneously think that they hold a reservation containing the same IP.
The CRDT does not need to be consistent, e.g. it is ok for it contain multiple entries containing the same reservations. Consequently we can gossip individual entries rather than the entire map.
Each node sends out a gossip broadcast when it gains or loses a reservation, and when its state of 'has free space' changes.
Nodes are given an IP range when starting up. Each node must be given the same range, i.e. the range from which all nodes are meant to make their reservations. The range is part of the CRDT, in addition to the reservation map.
We deal with concurrent start-up through a process of leader election. In essence, the node with the lowest id claims the entire range for itself.
When a node starts up, it initialises the CRDT map to contain just an empty entry for itself. It then waits for map updates and/or commands. Until a node receives a command to allocate or claim an IP address, it does not care what else is going on, but it does need to keep track of it.
When a node first receives such a command, it consults its map. If it sees any other nodes that do claim to have some reservations, then it proceeds with normal allocation [see below]. Otherwise, if no such update has been received (i.e. we've either received no update or only updates with all node entries being empty), then the node starts a leader election. If this node has the lowest id, then the node claims the entire IP range for itself, setting its map entry to reservation(s) covering the entire range.
If it sees that another node has the lowest ID, it sends a message to that node proposing that it be leader. A node receiving such a message proceeds as above: only if it sees no other nodes with reservations and it sees itself as the node with the lowest ID does it take charge.
Note that the chosen leader can be a node other than the one with the overall lowest id. This happens in the event a node with a lower id starts up just as another node has made the decision to become leader. That is ok; the new node wouldn't itself claim to be leader until some time has passed, enough time to receive an update from the leader and thus preempt that.
Failures:
- two nodes that haven't communicated with each other yet can each decide to be leader -> todo?
- prospective leader dies before sending map -> detect node failure (see below)
- repeated appearance and sudden death (after sending map) of nodes with lower ids - perhaps make the MSB of the id a time-stamp? Then the max delay is bounded by the clock difference.
We have a "node recovery mode", where a node rejoins the network after the IP allocator (and/or perhaps other parts of the weave infrastructure) has been restarted. The container infrastructure (e.g. the weave script) tells it what IPs are locally allocated by inspecting the local containers, and it learns from the received map what its previous reservations were, and reclaims them.
When a node wants more IPs it contacts another node, asking it for a donation of reservations. We select the node which a) has some free IPs, and b) has the maximum number of reservations which can be merged with ours.[1]
[1] There is an assumption here that the ability of a node to donate mergeable reservations is reasonably correlated with what reservations it holds. This may not be the case when the reservations have been heavily fragmented by IP allocation and freeing. Therefore nodes should attempt to minimise such fragmentation in their alloc/free logic.
The chosen node replies with a list of reservations, selected with a preference for a) reservations that can be merged with the requester, and b) small reservations. Collectively, the reservations should add up to approximately half of the free IPs in reservations of that node. Special case: if the node only has one free IP, it does donate that, except when that is the only reservation it holds.
If the returned list is empty, the requester repeats the node selection process, contacting a different node this time (since the original requestee will now show as having no free IPs). Otherwise the requester updates its map entry to add the donated reservations, merged with existing reservations if possible.
When the map indicates no free IPs, the command to allocate an IP is failed back to the requestor.
Failures:
- request delayed --> requester gives up after a while and tries the next best node. Any subsequent response is ignored. -> we've leaked some IPs. See below.
- request lost
- target node dead
- target node unreachable --> requester gives up after a while and tries the next best node.
- response delayed --> requester should ignore it -> we've leaked some IPs. See below.
- response lost
- requester node dead
- requester node unreachable --> we've leaked some IPs. See below.
In some scenarios, such as recovery, a node may want to claim specific IPs. For every such IP, one of the following is the case:
- the IP is contained in a reservation held by this node
- the IP is contained in a reservation held by another node
- the IP is not contained in any reservation, i.e. it is a potentially leaked IP.
For (3), the node waits until the IP has been reclaimed, after which we are in (1) and (2).
For (1), all we need to do is mark the IP as allocated.
For (2), we talk to the other node in a protocol similar to reservation. The request contains all the IPs we want to claim. The response contains a list of reservations containing one or more of the given IPs.
If the returned list is empty, the accompanying map update will show the target node holding no reservations containing any of the IPs we want to claim. We repeat the process, talking to another node this time. Ultimately, we either manage to get reservations for all IPs, or we run out of nodes to talk to, i.e. we are left with some IPs we want to claim that are not contained in any reservation. See (3) above. Rinse, repeat.
If the IP is already allocated by another node, then the command is failed back to the requestor.
We could have no node shutdown protocol at all and instead handle this the same way as node failure (see below). But failure detection is comparatively expensive and takes time, during which the IPs in reservations of the failed node cannot be allocated by other nodes. So the protocol here is purely an optimisation.
When a node leaves, it donates its reservations to one other node by first removing its own map entry (this entails erecting a tombstone, see below), and then sending a message to the selected node, containing all the reservations. The target node is chosen based on its ability to merge the reservation with its own; essentially the reverse of the criteria used when selecting nodes for donation.
After sending the message, the node terminates - it does not wait for any response. Overall, there is very little work performed by the node on shutdown, very little, so as to not unduly delay it.
Failures:
- message lost
- target node dead
- target node unreachable --> we've leaked some IPs. See below.
By examining the entire map, a node can determine which reservations are lost. If it holds a reservation that is the heir of a lost reservation then it is the heir of that reservation and can claim that reservation as its own, updating its map entry accordingly. Since a node's knowledge of its own reservations is authoritative, and no two nodes simultaneously hold a reservation for the same IP, we can be sure that at most one[1] node will claim a lost reservation.
[1] recall the definition of the heir relationship, in particular the 'bridging' requirement. This guarantees that some lost reservation will have an heir. But not necessarily every lost reservation. However, transitively, once one lost reservation has been reclaimed by its heir, another remaining lost reservations will have an heir.
Nodes wait for a while before claiming a lost reservations. This is to allow for reservation transfers to occur as part of the reservation and node shutdown protocols, and for the result of that to propagate. Note that "waiting for a while" is more complex than it may first appear. We need to be able to reclaim reservations even when their heir nodes keep changing due to reservations getting transferred between nodes. One way to cope with that is for each node to maintain age information on the lost reservations, updating that information whenever receiving/performing a map update. Lost reservations of the same age can be merged, if permitted by the chosen reservation representation. Nodes then only reclaim lost reservations which have a recorded age beyond a certain threshold. This creates a dual trigger: reclaiming can happen when a node acquires a reservation (which is the heir of a sufficiently old lost reservation) or through the passage of time.
Failures:
- reservation reclaimed too early --> as a result at some point some node will attempt to construct a map containing overlapping reservations. That in itself does not indicate an erroneous condition though, since we do not require the map to be consistent. Nevertheless, any node holding a reservation can expect to eventually see its map updated to show no overlaps with these reservations. Therefore, genuine overlaps are detectable after all. And potentially recoverable: We could ask the affected nodes to talk to each other and attempt to divide the overlapping reservations between themselves. As long as the nodes haven't actually handed out the same IP then everything is fine. And if there is a genuine conflict then we could pause or kill one of the offending containers.
TODO Can we end up in a situation where a reservation moves rapidly between nodes, with none of corresponding map updates ever making it to the node which is the potential heir of that reservation? It would result in the reservation being reclaimed in error. Surely unlikely, but can we quantify this somehow?
If we haven't heard from a node for a while[1] then any other node can marks that node as dead by removing its entry from the map. Deletion is performed by erecting a tombstone - setting the vector clock slot of the map entry to the max value, thus ensuring that the map entry will always be chosen during a CRDT merge operation.
[1] This requires some heart-beating; which gossipping tends to incorporate anyway. The duration to wait before declaring a node as dead must be long enough to span likely partitioning periods.
Once a node has been marked as dead, its reservations are effectively leaked and can be reclaimed as explained above.
NB Bad things will happen if a node rejoins with the same id after it has been declared dead. A partition rejoin is one scenario where that can happen. We can detect that at least - the node will be receiving a map update with a tombstone for itself.
TODO We may want some way of garbage collecting tombstones.
It would be nice if we were able to identify a single node to perform this failure marking, so we don't needlessly perform the same cleanup operation on multiple nodes. Various options for identifying a single node:
- node with the lowest id overall - this is rather unbalanced
- node with the "next lowest" id compared to dead node - this can still be rather unbalanced, especially if ids have the time-stamp as the MSB.
Here's another idea... when marking a node as dead, we move its reservations into another part of the CRDT, which collects reservations from all dead nodes. Alive nodes can then remove reservations from that data structure for which they are the heir. It should be possible to construct a convergent data structure for this.