Experience TLA+ in action by specifying distributed termination detection on a ring, due to Shmuel Safra. Each git commit introduces a new TLA+ concept. Go back to the very first commit to follow along!
Click either one of the buttons to launch a zero-install IDE to give the TLA+ specification language a try:
(=> Screencast how to create the TLA+ Codespace)
v01a: Termination of pleasingly parallel
For this tutorial, we assume that the distributed system nodes are organized as a ring, with one the (unique) leader[^1]. If we further assume that nodes execute independent computations, (global) termination detection becomes trivial--the leader initiates a token transfer around the ring, and each node passes the token to its next neighbor, iff the node finished its computation. When the initiator receives back the token, it knows that all (other) nodes have terminated.
This problem is too simple, and we don't need TLA+ to model it.
[^1] Perhaps by some leader election algorithm.
A more interesting problem is to look at a "collaborative" computation, which implies that nodes can re-activate each other. For example, the result of a computation at node 23 is (atomically!) sent to and further processed at node 42. With the previous protocol, node 42 might have already passed on the token, causing the initiator to eventually detect (global) termination; a bug that is at least difficult to reproduce with testing! A solution is offered in EWD840:
- Initiator sends a "stateful" token around the ring
- Each node remembers if it activated another node
- Activation taints the token (when the activator gets the token)
- Initiator keeps running rounds until it receives an untainted token
What happens if we loosen the restriction that message delivery is atomic (it seldom is)? Clearly, we are back at square one:
- Node 23 sends a message to 42
- 23 taints the token
- Initiator starts a new round
- Node 42 received the fresh token before receiving the activation message from 23
- Boom!
The fix proposed in Shmuel Safra's EWD998, is to count in-flight messages. But will this work?
Throughout the chapters of this tutorial, we will use the TLA+ specification language to model EWD998, and check interesting properties.
TLA+ is all about abstraction, and, as we will later see, has first-class support to connect different levels of abstraction. Let's use this and write a basic spec that either falsifies our design above, or gives us sufficient confidence to invest in writing a more detailed spec.
(Credit: Stephan Merz wrote AsyncTerminationDetection)
Instead of modeling message channels, let alone modeling the transport layer, we will write a spec that models:
- A ring of N nodes
- The activation status of each node
- The number of messages pending[^2] at a node
- A send action
- A receive action
- A terminate action
- The initial configuration of the system
Please switch to AsyncTerminationDetection.tla and read its comments. From here on, the tutorial continues there...
[^2] It's difficult to (efficiently) count pending messages in an implementation. In a TLA+ spec, we don't care about that notion of efficiency. Also, all variables are global.