-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Coordinators #22
Comments
So in the end, this issue resolves around the transport layer of the control protocol? May I rename this issue to talk about the transport layer of the control protocol or do you think it is an extra issue? |
Here my ideas regarding the transport layer of the control protocol: Basic ideas
Open questions:How do Coordinators connect to each other?
How do we achieve the routing of messages?
An example of the Coordinator Connector (This is my current test implementation):
In this system the Coordinators do not know, that they are tricked by the Connector and names just have to be unique for one Coordinator. You need, however, an additional Connector and names get changed throughout the routing path. |
let's use "ID" instead of "name" -- a name to me implies something expressive like "Temperature logger", which is not necessarily unique, nor really compact (for the protocol). I think it's OK if a Component has a "name" (e.g. "Keithley2000 PSU from pymeasure") and an "ID" (some random string, or something composed in another way)
Just to clarify this, "its node" can also operate in DTM (in which case that would also use a DEALER socket), a queue etc. is only used in LTM.
IMO, this one, the others seem uncessarily complicated:
Agreed with this. I also like the namespaced names. We could do this: Every Coordinator needs to know its components, it can compile and distribute that list (+ later updates), e.g. via a "routing" message. Then other Coordinators have that information, so a Component can easily request the list of all Components in the network (like an address book)! |
I would like to avoid an additional component if we can. |
Quick sketch. Let's assume this structure: flowchart TB
subgraph node1
CO1-->A
CO1-->B
end
subgraph node2
CO2-->C
CO2-->D
end
Address book exchange: sequenceDiagram
CO1->>CO2: I know CO1.A, CO.B
CO2->>CO1: I know CO2.C, CO2.D
Local routing: sequenceDiagram
A->>CO1: sender: A, recipient: B
CO1->>B: sender: A, recipient: B
Inter-coordinator routing: sequenceDiagram
A->>CO1: sender: A, recipient: CO2.C
Note over CO1: "I know that guy!"
CO1->>CO2: sender: CO1.A, recipient: CO2.C
CO2->>C: sender: CO1.A, recipient: C
Note over C: "C now knows the full address of A"
|
I understood a node as using LTM. If a node does not use LTM, the node does not need an internal Coordinator, as all Components can connect to an external Coordinator. The Coordinators need one Dealer socket for each Coordinator Coordinator connection, as the Dealer sends messages to any connected peer (it does not know which one), while the Router cannot initiate a conversation (it needs to know the addresses).
I think you missed to name the sender "CO1.A" at some point on the path (probably between the Coordinators). Similarly the recipient should be (between CO2 and C) only "C". I'd say the "home Coordinator" adds its own name space at outgoing messages and strips it from incoming messages. So the second message (in your example) would be "sender 'CO1.A', recipient 'CO2.C'" and the third one "sender 'CO1.A', recipient 'C'". Regarding routing: We could have a pub sub proxy (very reliable, almost no code for us), where the Coordinators exchange all Component connects and disconnects (for fast discovery) and regularly the list of their connected components plus their own address (host and port), such that a Coordinator may create a connection to the other one. That way all coordinators know all active components (from the lists + updates) and do not have to ask the other Coordinators, whether they know "C", for example. Edit: I thought more about the problem. A recently started Coordinator may announce its presence via that channel and the already started Coordinators connect to the new one. One (minor) question, whether all Coordinators use a Dealer port for all other Coordinators (so all outgoing messages go through Dealer) or only one of both Coordinators use the dealer port (I'd say, both ways are similar in complexity and it could be decided later). |
Yes, but not necessarily. Inside a Node you have the option LTM or DTM, outside only DTM.
Ah, so 4 Coordinators would each need 3 DEALER ports?
Indeed, thanks!
Agreed, that's what I intended, too. Your routing remarks sound sensible. I can't really help/assist with zmq intricacies, unfortunately. |
Either each needs 3 DEALER ports (symmetric connection), or they need in average 1.5 DEALER ports, if one of two Coordinators uses its ROUTER port to communicate with the other one. Just a side note on that "Coordinator coordination system" via PUB-SUB: We can use the same proxy and protocol defined in #3 . In fact, we would just have three identical proxy servers (on different ports) for three uses: Coordinator coordination, data exchange, log messages. The proxy server itself (in python you just call
In the summer I read the full zmq guide, before I implemented my system. Now this information and experience is very helpful. |
Let's keep the focus on the original question -- how do deal with multiple Coordinators. |
Yeah, I already noticed you two are quite experienced :D |
I summarize:
Whether a port connects or binds is marked bold. |
You are just so fast, I don't really keep up anymore... Did I get it right, the Coordinators use the (missing) heartbeat from the Actors to detect "disconnects", right? EDIT: found LMT and DMT abbreviations |
That, or because a Component explicitly "signs out".
Thanks for that missing part. I edited my message. |
Ah, okay, sure, we can always have a "sign out" message in a shutdown method, that sounds quite sensible, did not think about it. |
The summary sounds great!
Could send address book updates, too. Could be useful for a Director to know which Actors are available (e.g. to populate a GUI). |
Yes, we should add the possibility to get the whole address book, but that is not an issue of the communication between Coordinators. Regarding stripping / appending the name space, I started a new discussion in #27, but that does not change the basic principles. |
Just an idea: we could regularly, but rarely (every half an hour or so) request a current Components list in order to update the local list, just in case some information got lost. |
That could be part of a regular "resync" exchange -- that will maybe not remain the only thing to be synched (clocks, e.g.?). |
@bklebel had the idea to do the "Coordinator's announcement" via the control protocol instead of the Data protocol (Pub-sub) Whatever the way is, we need one central server, whose address is known (be it a normal Coordinator or a XPUB-XSUB Proxy), such that Coordinators may connect to the know address and get the information about other Coordinators. Advantages:
Disadvantages
|
Regarding using the control protocol for address book updates: Another advantage is, that the Coordinators are self sufficient (they do not need another communication channel). Implementation (now that every Coordinator has a Dealer to each other), they can send a message via each Dealer channel. They same works for newly started Coordinators:
This setup is great, as we do not need any "central coordinator". Any Coordinator serves as entry point to the Network. For reliability, we could give a list of addresses, such that it tries to contact one after the other until it finds a running Coordinator, so the network can rebuild itself, if the Coordinators restart (as OS services). |
I like this approach of self-sufficient Coordinators without a "central" one. We could probably look to mesh network algorithms how to efficiently deal with updates/resyncs after Coordinators have disappeared/reappeared.
These "addresses" would in fact be of the ROUTER sockets of all previously known Coordinators, right? I guess the service/process can store that somewhere on disk as it should not change too often. Then on restart that info is already available. Maybe same with a Coordinator's list of connected Components? |
Yes, exactly.
I think so too, although I am not sure how to put that into the protocol itself - "MUST store on disk" (but we do not say where/how)? In regard to the list of connected Components, I am not so sure, although this would stay the same for quite long times, the Components will always try to talk to the Control Coordinator anyways, so the CCoordinator will notice them soon enough, especially with heartbeats. In the end, this particular question is more about reliability and the implementation, I think. |
We could prescribe just that implementations must "persist" that info without saying how. We could also make that optional - it's a convenience feature, and we could offer a path for manual discovery of other connectors. Re: Component connections: consider that after connector restart all incoming connection senders will be unknown to the connector, and will thus be refused (unknown sender). |
That's my stance: We do not need it for proper routing. It could be done externally (starting the Coordinator with a set of command line parameters etc) or in hard written in the start up script. We could require, that a Coordinator shall accept a list of addresses to connect to at startup as a parameter.
Right, as these addresses are IP addresses and port numbers. We could add a "store configuration to disk command", which could be also useful for some Actors etc.
The list of connected Components is useless, as the Zmq connection identity will be different at reconnect. And you do not know, whether an old client will come back.
Yes. As we require to Sign in, Components have to sign in again after Coordinator restart. |
Ah, good to know! So, the logic would/could be that
|
Exactly.
You should wait with resending a bit, however, such that the other side has time to sign in as well.
Am 6. Februar 2023 08:39:27 MEZ schrieb Christoph Buchner ***@***.***>:
…> The list of connected Components is useless, as the Zmq connection identity will be different at reconnect.
Ah, good to know! So, the logic would/could be that the fresh Coordinator send an "you're unknown to me" response, the Component will `SIGN_IN`, and then can re-send its message.
--
Reply to this email directly or view it on GitHub:
#22 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
Since it comes up again and again, let's collect the notes on the possible ways how to use multiple Coordinators (either in one system or across systems) here.
The text was updated successfully, but these errors were encountered: