IP allocation design

Basic idea: We divide the IP address space between all nodes, and adjust that division dynamically in order to ensure a supply of IPs to every nodes. This allows nodes to allocate and free individual IPs locally in a very efficient manner, generally without synchronous interaction with other nodes. It also provides some inherent partition tolerance.

The key problems to solve are:

How do nodes get hold of IP addresses when they start up or are running out of IPs?
How do we avoid leaks of addresses due to nodes dying and other failures?

Our approach is founded on the following ideas:

Reservations. Instead of dealing with individual IP addresses, we operate on them in groups, so called "reservations". These have a more space efficient representation than random collections of IPs, and include a notion of "merging" - certain reservations can be combined into larger reservations with a smaller space footprint.
Heirs. We introduce a notion of "heirs" for IP addresses, extending it from their to reservations and, finally, nodes. The idea is that for every IP address we can identify a unique node which should reclaim it in the event it leaked.
CRDT. All reservations in the system are recorded in a map from node ids to reservations. This map is a CRDT. Nodes only ever update their own entries (though they may delete other entries). This makes the data structure inherently convergent. Moreover, the algorithms which operate on the map can tolerate inconsistencies between entries.
Gossipping. Updates to the CRDT are communicated between nodes through gossipping, which scales well and copes well with changing topologies and partitions. The gossipping is piggybacking on the weave router's peer gossiping and is therefore topology-aware, which decreases message loss and duplication. Due to the aforementioned tolerance for inconsistency between entries, the information contained in the CRDT can be gossipped incrementally, though we do require gossipping to propagate all information everywhere eventually. [1]
Protocol. Nodes transfer reservations between each other through a mini messaging protocol. The protocol is conjoined with the CRDT gossiping, thus ensuring that the recipient has up to date information about the sender. The messaging is async, with no delivery or ordering guarantees.

[1] In practice this can only be guaranteed when the rate of change in the topology is below a certain threshold.

reservations

There are several suitable representations for reservations.

One quite simple and straightforward scheme is to think of the IP space as a ring, with reservations representing segments of the ring labelled by the node holding the reservation. Reservations that are adjacent in the ring and have the same label can be merged. Gaps in the ring represent leaked IP addresses; we can think of the gaps as representing "lost reservations".

For a possibly more compact representation, which also happens to neatly align with sub-nets and CIDR notation, think of the IP space as a binary tree, with reservations being represented as leaves labelled by the node holding the reservation for all IPs below it. Unlabelled leaves then represent lost IP addresses. Non-leaf nodes get the label of the left-most labelled leaf of their children. The heir of a lost reservation is the node identified by the label on the sibling of the unlabelled leaf corresponding to the lost reservation.

heirs

We require that every IP address has a unique heir. Formally,

Forall x,y,z in U. heir(x,z) & heir(y,z) -> x=y

where U is the universal set of all IP addresses we are dealing with, aka 'IP space', and heir(x,y) reads as "x is the heir of y".

Furthermore, we require that the heir relationship bridges any possible division of the IP space:

X+Y=U & X#Y=0 & X=/=0 & Y=/=0 -> Exists x in X, y in Y. heir(x,y)

where + is set union and # is set intersection.

(This does require the IP space to contain at least two addresses. It also ensures that an IP cannot be its own heir).

For the simple reservation representation of segments on a ring, we can define the heir to be the predecessor. This satisfies the above criteria; he proof is left as an exercise to the reader.

We can extend the heir relationship from individual IP addresses to reservations, by defining it as the predecessor relationship on ring segments.

CRDT and gossipping

The CRDT is implemented with help of a vector clock of all map entries. When a node updates its own map entry, the clock for its entry is advanced. This allows gossipees to monotonically merge entries with their own, i.e. they will overwrite entries they hold which are older than those received.

A crucial property of this data structure and the protocol we describe below is that, despite nodes potentially having wildly different ideas of the reservations held by other nodes, no two nodes will simultaneously think that they hold a reservation containing the same IP.

The CRDT does not need to be consistent, e.g. it is ok for it contain multiple entries containing the same reservations. Consequently we can gossip individual entries rather than the entire map.

TODO figure out exactly what should be gossipped when; we need to ensure that all entries get propagated everywhere eventually, with a propagation delay of no worse than log(number-of-nodes) and a per-node message rates and volumes being no worse than log(number-of-nodes).

initialisation

Nodes are given an IP range when starting up. We assume that each node is given the same range, i.e. the range from which all nodes are meant to make their reservations. The range is part of the CRDT, in addition to the reservation map.

TODO deal with nodes being given different ranges and multiple ranges; the effect should be that all ranges specified anywhere are used for reservations everywhere.

We need to deal with concurrent start-up. We do this by electing a leader - the node with the lowest id - that claims the entire range for itself.

When a node starts up, it initialises the CRDT map to contain just an empty entry for itself. It then waits for map updates. If it receives an update containing entries for other nodes that do claim to have some reservations, then it proceeds with normal allocation [see below]. Otherwise, if, after a while [1], no such update has been received (i.e. we've either received no update or only updates with all node entries being empty [2]), and the node is the node with the lowest id, then the node claims the entire IP range for itself, setting its map entry to reservation(s) covering the entire range.

[1] This time can be quite short; it needs to be just long enough for any alive nodes to gossip their state. Unless we want to cater for situations where a partition has occurred and all nodes on our side were subsequently stopped, or we are trying to add a new node to a network of existing nodes which is presently unreachable. We could conceivably have two start-up modes for nodes; one which permits the above range grabbing, and another where we simply wait for a non-empty map. This mode selection could happen based on whether any IP ranges were specified on start-up. However, one issue here is that weave backs off connection re-establishment quite aggressively, so a node that starts up and cannot establish any outbound connections due to firewalls, may have to wait a very long time for inbound connections to occur. TODO work this out.

[2] Ordinarily nodes will always hold some reservations, i.e. once a node has managed to get hold of some reservations it is never going to transfer them all entirely. Therefore the situation of all node entries being empty only arises on formation of a new network.

Note that the chosen leader can be a node other than the one with the lowest id. This happens in the event a node with a lower id starts up just as another node has made the decision to become leader. That is ok; the new node wouldn't itself claim to be leader until some time has passed, enough time to receive an update from the leader and thus preempt that.

Failures:

prospective leader dies before sending map -> detect node failure (see below)
repeated appearance and sudden death (after sending map) of nodes with lower ids - perhaps make the MSB of the id a time-stamp? Then the max delay is bounded by the clock difference.

TODO We probably want to allow for a "node recovery mode", where a node rejoins the network after the IP allocator (and/or perhaps other parts of the weave infrastructure) has been restarted. A node could learn what IPs are locally allocated by inspecting the local containers, and then learn from the received map what its previous reservations were.

reservation

When a node wants more IPs it contacts another node, asking it for a donation of reservations. The criteria for selecting that node are:

proximity
number of available IPs
mergeability of donations with our existing reservations, so that we can reduce fragmentation

TODO figure out what selection criteria we should use. This also has an impact on what additional data about nodes we need to include in the CRDT, and the update frequency of that.

For now, we select the node which a) has some free IPs, and b) has the maximum number of reservations which can be merged with ours.[1] This requires the addition of just a boolean "has some free IPs" flag to the CRDT.

[1] There is an assumption here that the ability of a node to donate mergeable reservations is reasonably correlated with what reservations it holds. This may not be the case when the reservations have been heavily fragmented by IP allocation and freeing. Therefore nodes should attempt to minimise such fragmentation in their alloc/free logic.

The chosen node replies with a list of reservations, selected with a preference for a) reservations that can be merged with the requester, and b) small reservations. Collectively, the reservations should add up to approximately half of the free IPs in reservations of that node. Special case: if the node only has one free IP, it does donate that, except when that is the only reservation it holds.

If the returned list is empty, the requester repeats the node selection process, contacting a different node this time (since the original requestee will now show as having no free IPs). Otherwise the requester updates its map entry to add the donated reservations, merged with existing reservations if possible.

When the map indicates no free IPs, we wait for a map update that indicates a change of conditions.

Failures:

request delayed --> requester gives up after a while and tries the next best node. Any subsequent response is ignored. -> we've leaked some IPs. See below.
request lost
target node dead
target node unreachable --> requester gives up after a while and tries the next best node.
response delayed --> requester should ignore it -> we've leaked some IPs. See below.
response lost
requester node dead
requester node unreachable --> we've leaked some IPs. See below.

TODO handle case where we run out of "next best nodes". Probably we should just sleep and try again, essentially waiting for either better luck next time (i.e. less message loss/delay), some node recovery / partition-healing, or a map update.

TODO might it be worth considering proactive donation, perhaps based on gossiped "time until I run out of IPs based on current allocation rate"?

node shutdown

We could have no node shutdown protocol at all and instead handle this the same way as node failure (see below). But failure detection is comparatively expensive and takes time, during which the IPs in reservations of the failed node cannot be allocated by other nodes. So the protocol here is purely an optimisation.

When a node leaves, it donates its reservations to one other node by first removing its own map entry (this entails erecting a tombstone, see below), and then sending a message to the selected node, containing all the reservations. The target node is chosen based on its ability to merge the reservation with its own; essentially the reverse of the criteria used when selecting nodes for donation.

After sending the message, the node terminates - it does not wait for any response. Overall, there is very little work performed by the node on shutdown, very little, so as to not unduly delay it.

Failures:

message lost
target node dead
target node unreachable --> we've leaked some IPs. See below.

reclaiming leaked IPs

By examining the entire map, a node can determine which reservations are lost. If it holds a reservation that is the heir of a lost reservation then it is the heir of that reservation and can claim that reservation as its own, updating its map entry accordingly. Since a node's knowledge of its own reservations is authoritative, and no two nodes simultaneously hold a reservation for the same IP, we can be sure that at most one[1] node will claim a lost reservation.

[1] recall the definition of the heir relationship, in particular the 'bridging' requirement. This guarantees that some lost reservation will have an heir. But not necessarily every lost reservation. However, transitively, once one lost reservation has been reclaimed by its heir, another remaining lost reservations will have an heir.

Nodes wait for a while before claiming a lost reservations. This is to allow for reservation transfers to occur as part of the reservation and node shutdown protocols, and for the result of that to propagate. Note that "waiting for a while" is more complex than it may first appear. We need to be able to reclaim reservations even when their heir nodes keep changing due to reservations getting transferred between nodes. One way to cope with that is for each node to maintain age information on the lost reservations, updating that information whenever receiving/performing a map update. Lost reservations of the same age can be merged, if permitted by the chosen reservation representation. Nodes then only reclaim lost reservations which have a recorded age beyond a certain threshold. This creates a dual trigger: reclaiming can happen when a node acquires a reservation (which is the heir of a sufficiently old lost reservation) or through the passage of time.

Failures:

reservation reclaimed too early --> as a result at some point some node will attempt to construct a map containing overlapping reservations. That in itself does not indicate an erroneous condition though, since we do not require the map to be consistent. Nevertheless, any node holding a reservation can expect to eventually see its map updated to show no overlaps with these reservations. Therefore, genuine overlaps are detectable after all. And potentially recoverable: We could ask the affected nodes to talk to each other and attempt to divide the overlapping reservations between themselves. As long as the nodes haven't actually handed out the same IP then everything is fine. And if there is a genuine conflict then we could pause or kill one of the offending containers.

TODO Can we end up in a situation where a reservation moves rapidly between nodes, with none of corresponding map updates ever making it to the node which is the potential heir of that reservation? It would result in the reservation being reclaimed in error. Surely unlikely, but can we quantify this somehow?

node failure

If we haven't heard from a node for a while[1] then any other node can marks that node as dead by removing its entry from the map. Deletion is performed by erecting a tombstone - setting the vector clock slot of the map entry to the max value, thus ensuring that the map entry will always be chosen during a CRDT merge operation.

[1] This requires some heart-beating; which gossipping tends to incorporate anyway. The duration to wait before declaring a node as dead must be long enough to span likely partitioning periods.

Once a node has been marked as dead, its reservations are effectively leaked and can be reclaimed as explained above.

NB Bad things will happen if a node rejoins with the same id after it has been declared dead. A partition rejoin is one scenario where that can happen. We can detect that at least - the node will be receiving a map update with a tombstone for itself.

TODO We may want some way of garbage collecting tombstones.

Can we do better?...

It would be nice if we were able to identify a single node to perform this failure marking, so we don't needlessly perform the same cleanup operation on multiple nodes. Various options for identifying a single node:

node with the lowest id overall - this is rather unbalanced
node with the "next lowest" id compared to dead node - this can still be rather unbalanced, especially if ids have the time-stamp as the MSB.

Here's another idea... when marking a node as dead, we move its reservations into another part of the CRDT, which collects reservations from all dead nodes. Alive nodes can then remove reservations from that data structure for which they are the heir. It should be possible to construct a convergent data structure for this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IP allocation design

reservations

heirs

CRDT and gossipping

initialisation

reservation

node shutdown

reclaiming leaked IPs

node failure

Can we do better?...

Clone this wiki locally