Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement Bootstrapping Process #112

Open
fredo opened this issue Mar 30, 2020 · 4 comments
Open

Enhancement Bootstrapping Process #112

fredo opened this issue Mar 30, 2020 · 4 comments

Comments

@fredo
Copy link
Contributor

fredo commented Mar 30, 2020

Problem Description

Currently, the bootstrapping process has some hick-ups until the federation and the services works. The Broadcasting Rooms will only be created if every Server in the list is online for at least 60 seconds. After that, the rooms will be created by the first server in the list and the other will ensure the rooms afterward. This will also take about 60 seconds for each server.
The PFS and the MS will be restarted and should run successfully after the rooms are created. The restart process is handled by docker-compose and has probably a back off effect in it.

Problem Cause

The cause for this problem lies in the room ensurer which runs only if it can log in to all other matrix servers. If it fails to connect, it will sleep for 60s. All matrix servers will start doing this until the last server in the well_known_server list are reachable.

The Room Ensurer

The room ensurer ensures that on all matrix servers the room ids align with the alias for the broadcast rooms. This makes sure that not even it exist a public room with the broadcast room alias but also it is actually the same room (room_id) on all servers meaning that all servers share the same room.
The rule is that if there is a mismatch it will always use the room_id of the first server in the list. If it finds a mismatch on another server it will simply give a warning but cannot do anything about it. If all follow the same rule, every server will use the room of the first server in the list eventually.
Furthermore, the room ensurer will also create the rooms only if it is the first server in the list.

Solutions

Since the bootstrapping process is a one-time event, it has not a big impact. After the release there will be some orchestration needed though that people are aware.

Short term solution A - Leave it as it is

One solution is that we do not change anything. This also means that neither service will work until all matrix servers are up and running. To speed up the process orchestration between the setups would be a helpful thing.

Short term solution B - Start with a single server in the list

We could also start with a single server in the list. This would at least provide a working services network at the beginning for the initial service provider. Every other provider would have to go through the process of being added to the list. What this means for the other provider is addressed in the issue #113.

Long term solution - Make it work even if not every server is online

The room ensurer for the first server in the list can actually create the rooms even if the other matrix servers are not online. This should not open any attack vectors at all. In such a setup the first server could already start running its services and make raiden functioning even if the other servers arent online yet.
Currently, every room ensurer will ensure the rooms between all servers in the list. This is actually not necessary. Since every server will align with the broadcast rooms of the first server in the list the other servers only would need to ensure rooms with this server and not with the others. It gives only warnings so it would not change anything at the moment.

@fredo
Copy link
Contributor Author

fredo commented Mar 30, 2020

For testing purposes I would like to opt for solution B. This comes with the disadvantage that the next servers can take up to 24 hours to function properly (read #113)

@Dominik1999
Copy link
Contributor

Thank you @fredo for this structured issue.

@GataKamsky this might need some legal clarification with respect of being an operator if we handle the whitelisting process of RSB providers. Can you put that on the open issues list for Jim?

For testing our release now, let's go for Short term solution B.

@ulope
Copy link
Collaborator

ulope commented Mar 30, 2020

As discussed earlier in private I think it makes sense to change the room_ensurer to only require its own and the "first" server.

Other than that I don't think we should spend much (or any really) time on this. Bootstrapping a completely new federation (where all servers are "new") is such a rare occurrence that it doesn't seem worth to fix what is essentially a cosmetic problem that will solve itself after some time.

@fredo
Copy link
Contributor Author

fredo commented Mar 31, 2020

In addition to that, we should have something like, if the first server is down, align with second, and so on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants