-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement Bootstrapping Process #112
Comments
For testing purposes I would like to opt for solution B. This comes with the disadvantage that the next servers can take up to 24 hours to function properly (read #113) |
Thank you @fredo for this structured issue. @GataKamsky this might need some legal clarification with respect of being an operator if we handle the whitelisting process of RSB providers. Can you put that on the open issues list for Jim? For testing our release now, let's go for Short term solution B. |
As discussed earlier in private I think it makes sense to change the room_ensurer to only require its own and the "first" server. Other than that I don't think we should spend much (or any really) time on this. Bootstrapping a completely new federation (where all servers are "new") is such a rare occurrence that it doesn't seem worth to fix what is essentially a cosmetic problem that will solve itself after some time. |
In addition to that, we should have something like, if the first server is down, align with second, and so on |
Problem Description
Currently, the bootstrapping process has some hick-ups until the federation and the services works. The Broadcasting Rooms will only be created if every Server in the list is online for at least 60 seconds. After that, the rooms will be created by the first server in the list and the other will ensure the rooms afterward. This will also take about 60 seconds for each server.
The PFS and the MS will be restarted and should run successfully after the rooms are created. The restart process is handled by docker-compose and has probably a back off effect in it.
Problem Cause
The cause for this problem lies in the
room ensurer
which runs only if it can log in to all other matrix servers. If it fails to connect, it will sleep for 60s. All matrix servers will start doing this until the last server in thewell_known_server
list are reachable.The Room Ensurer
The room ensurer ensures that on all matrix servers the room ids align with the alias for the broadcast rooms. This makes sure that not even it exist a public room with the broadcast room alias but also it is actually the same room (room_id) on all servers meaning that all servers share the same room.
The rule is that if there is a mismatch it will always use the room_id of the first server in the list. If it finds a mismatch on another server it will simply give a warning but cannot do anything about it. If all follow the same rule, every server will use the room of the first server in the list eventually.
Furthermore, the room ensurer will also create the rooms only if it is the first server in the list.
Solutions
Since the bootstrapping process is a one-time event, it has not a big impact. After the release there will be some orchestration needed though that people are aware.
Short term solution A - Leave it as it is
One solution is that we do not change anything. This also means that neither service will work until all matrix servers are up and running. To speed up the process orchestration between the setups would be a helpful thing.
Short term solution B - Start with a single server in the list
We could also start with a single server in the list. This would at least provide a working services network at the beginning for the initial service provider. Every other provider would have to go through the process of being added to the list. What this means for the other provider is addressed in the issue #113.
Long term solution - Make it work even if not every server is online
The room ensurer for the first server in the list can actually create the rooms even if the other matrix servers are not online. This should not open any attack vectors at all. In such a setup the first server could already start running its services and make raiden functioning even if the other servers arent online yet.
Currently, every room ensurer will ensure the rooms between all servers in the list. This is actually not necessary. Since every server will align with the broadcast rooms of the first server in the list the other servers only would need to ensure rooms with this server and not with the others. It gives only warnings so it would not change anything at the moment.
The text was updated successfully, but these errors were encountered: