-
Notifications
You must be signed in to change notification settings - Fork 397
[WIP] MSC3814: Dehydrated devices with SSSS #3814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/dehydrated_device/{device_id}/events` to obtain the next batch. | ||
|
||
``` | ||
POST /dehydrated_device/{device_id}/events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a POST and not a GET like /sync and /messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the rationale was because the call has side-effects (deleting the device).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit weird that it doesn't follow the pattern of /messages, /events or /sync imo. I'll try implementing it as a GET without the device deletion first and see how that works out, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A GET endpoint with side-effects seems like a big no-no to me. Everyone expects a GET request to have approximately zero side-effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, but we're also proposing removing the side-effects? SGTM in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, the current implementation no longer automatically deletes the device on the server side, but relies on the client to delete/create a new device. So we're going to try to make this a GET.
``` | ||
|
||
Once a client calls `POST /dehydrated_device/{device_id}/events`, the server | ||
can delete the device (though not necessarily its to-device messages). Once a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should the server delete the device? Shouldn't this rather be done by the client explicitly in a delete call?
Imo it is not quite obvious that fetching the events should "break" the device. A client might fail to properly restore and now you lost all the intermediate sessions. instead the client should replace the device once it is somewhat sure it restored successfully and has uploaded the megolm keys to online backup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that when the client starts getting events, it means that the client is signalling its intention to use the dehydrated device, and it has been "claimed", so it shouldn't be used by anyone else. At this point, there isn't much that can be done if the client, e.g. fails to decrypt some messages. If it fails to decrypt messages with the dehydrated device, it's unlikely that leaving the device around will fix anything in the future -- any future attempts would likely fail as well. So the best thing to do is to replace the dehydrated device with a new one anyways.
I'm not insistent on this endpoint deleting the dehydrated device, but I think that once you start using a dehydrated device, you'll want to create a new device no matter what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more that if the device fetches the first few events but then the user closes the browser and it never gets to upload the devices, then you have effectively thrown the dried fish out the window without properly getting to use it. So in that case it should either only delete the device, when it deletes the first few messages (i.e. by the client paginating with a next token), or just wait for the user to send a new device. Since you CAN still use the same dried device from another device, I think. All of the messages will be PRE KEY messages, so you can decrypt them as long as you haven't deleted the one time keys from the pickled device. So even if a client downloads the first batch of messages and then starts with the next batch and the first batch gets deleted, a different client should still be able to pick up from there.
I agree that you want to create a new device no matter what, but that can just be done by uploading a new one instead of implicitly doing it when receiving messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the issue of a client starting to load events and then dying somehow is possible, but seems like it would be extremely rare.
I think another consideration is that a client could forget to replace the dehydrated device. If the device gets deleted automatically, then it makes it obvious that the client didn't do that.
In any event, I think it's fine to try it out with GET
and without automatically deleting the device, and see how it goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also an issue if there's a connection problem during the POST
request; the client presumably won't be able to re-try the request because the device will have been deleted.
We generally try to design our APIs so that they work / can be retried even if the request fails half-way through for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the idea is that the dehydrated device is "deleted" in the sense that no other client can claim it, and if a client queries for the dehydrated device, it won't be returned. But the events associated with it are still there and can be retrieved (until the events get deleted as described elsewhere in the MSC), so if the client re-tries the POST
request, it will still get the events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the problem is just that if the client fails to replace the device, then a failed login attempt will break the device dehydration until the next successfull login, since there is no way to receive messages in the meantime (while that would work fine if the device is just kept).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually ran into another race condition here in production. We currently have it implemented that the PUT of a new device removes the old device. However, since uploading a new device takes several requests (claim new device, upload keys, sign it, upload encrypted device), we run into a race condition, where the user closes the browser window during one of the steps and maybe only signs back in later. That means we have an unhydrateable device and we again lose messages over the gap. Ideally there would be some way to make this atomic to prevent this race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that having this be a PUT
is the wrong pattern here. If the whole idea here is for a Client to fetch to-device events and room keys then we should design the flows in a way that minimizes the risk of loss of such room keys.
As it stands the flows can't be resumed once you start fetching the to-device events, and the fetching of the to-device events will be by far the longest operation here.
What would allow perfect resumption is:
- The dehydrated device only gets deleted if the client requests so.
- The to-device events only get deleted¹ when the dehydrated device gets deleted, this is opposed to the current mechanism, where to-device events get deleted once the server sees a
next_batch
from a previous request.
No 2. would ensure that, even if a device that attempts a rehydration gets stopped and deleted mid-rehydration, another, new device can restart the rehydration process.
I actually ran into another race condition here in production. We currently have it implemented that the PUT of a new device removes the old device. However, since uploading a new device takes several requests (claim new device, upload keys, sign it, upload encrypted device),
Agree here as well, PUT
of a new device should happen in a single request which should upload the dehydrated device, its device keys and any one-time keys. I implemented a draft version of this behavior in this patch: matrix-org/synapse@777b305
dehydration algorithm `m.dehydration.v1.olm` will be called | ||
`org.matrix.msc3814.v1.olm`. The SSSS name for the dehydration key will be | ||
`org.matrix.msc3814` instead of `m.dehydrated_device`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client implementation: https://gitlab.com/famedly/company/frontend/famedlysdk/-/merge_requests/1111
Server implementation: matrix-org/synapse#13581
Both not merged yet and notably missing is the dehydrated device format.
losing any megolm keys that were sent to the dehydrated device, but the client | ||
would likely have received those megolm keys itself. | ||
|
||
Alternatively, the client could perform a `/sync` for the dehydrated device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this sill works with v2? can we still sync on the dehydrated device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't sync as a different device in this proposal. You can fetch the events for that device, but this proposal implicitly deletes the device in that case, which means you can't keep the device alive after that. So imo your only option is to replace it (which is somewhat easy to do, but you might need to authenticate the new signature upload/device?).
|
||
If the client is able to decrypt the data and wants to use the dehydrated | ||
device, the client retrieves the to-device messages sent to the dehydrated | ||
device by calling `POST /dehydrated_device/{device_id}/events`, where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why include a device_id
if you can only have a single dehydrated device? It has implications that you can provide more than one (and causes additional error checking of whether the provided device ID matches the dehydrated device ID).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the device ID is redundant. Though since this MSC has been written a new use-case for this endpoint has been found.
In the sliding sync world we have split out the fetching of to-device events into a separate sync loop. Namely one of the biggest problems of the existing /sync
mechanism is that you get too much data all at once and the downloading and processing of that data prevents the UI from being updated.
To-device events are one of those things that are not directly related to the things that a client will want to display in a room or room list, so putting it into a separate sync loop allows the main loop to quickly send updates while to-device moves along in the background. More info here: matrix-org/matrix-rust-sdk#1928
I think that old sync could handle such a split as well, so I would suggest here to rename the endpoint to become /sync/to_device/{device_id}
where device_id
might be optional and used only in the case of a dehydrated device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for the device_id
parameter is that, while one client is fetching events, another client could create a new dehydrated device. Without the device_id
parameter, the server could think that the client wants to fetch the events for the new device which, if there are any, it won't be able to decrypt since it's for a device that it doesn't know about. With the device_id
parameter, the server will at least be able to say that there are no more events (since the device has been replaced by a new one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds quite racy to me -- how does the server know that one dehydrated device is claimed? How would the client know to make a new one instead of claim the old one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does the server know that one dehydrated device is claimed?
It's OK for multiple clients to rehydrate the same device (unlike in the previous proposal), because it never becomes a real device. So the server can just wait until some client fetches all the events before dropping the device.
How would the client know to make a new one instead of claim the old one?
Making a new device and rehydrating an old one are two different use cases. Rehydration happens after you log in, and you're setting up encryption and trying to get keys. It only happens once in the device's lifetime. Creating a new dehydrated device would happen after you've already set up your encryption and already attempted to rehydrate a device.
batches. For the last batch of messages, the server will still send a | ||
`next_batch` token, and return an empty `events` array when called with that | ||
token, so that it knows that the client has successfully received all the | ||
messages and can clean up all the to-device messages for that device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm failing to track what's going on here, I think it is:
- Client requests with no
next_batch
; server returns some messages and anext_batch
ofA
- Client requests with
next_batch=A
; server returns some messages and anext_batch
ofB
- Client requests with
next_batch=B
; server returns an empty array of messages and anext_batch
ofC
- Client requests with
next_batch=C
and discards any response
I'm unsure about steps 3 & 4. Why would the server provide a next_batch
if it is already out of messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in your example, the "last batch" of messages is 2. where the client requests with next_batch=A
. So the server still returns a next_batch
of B
even though it knows that the next call will be empty. Then at 3. when the client requests with next_batch=B
, the server discards all messages, and returns an empty array of messages and no next_batch
. I agree that the text could be clearer.
No significant changes since 1.89.0rc1. - Add Unix Socket support for HTTP Replication Listeners. [Document and provide usage instructions](https://matrix-org.github.io/synapse/v1.89/usage/configuration/config_documentation.html#listeners) for utilizing Unix sockets in Synapse. Contributed by Jason Little. ([\matrix-org#15708](matrix-org#15708), [\matrix-org#15924](matrix-org#15924)) - Allow `+` in Matrix IDs, per [MSC4009](matrix-org/matrix-spec-proposals#4009). ([\matrix-org#15911](matrix-org#15911)) - Support room version 11 from [MSC3820](matrix-org/matrix-spec-proposals#3820). ([\matrix-org#15912](matrix-org#15912)) - Allow configuring the set of workers to proxy outbound federation traffic through via `outbound_federation_restricted_to`. ([\matrix-org#15913](matrix-org#15913), [\matrix-org#15969](matrix-org#15969)) - Implement [MSC3814](matrix-org/matrix-spec-proposals#3814), dehydrated devices v2/shrivelled sessions and move [MSC2697](matrix-org/matrix-spec-proposals#2697) behind a config flag. Contributed by Nico from Famedly, H-Shay and poljar. ([\matrix-org#15929](matrix-org#15929)) - Fix a long-standing bug where remote invites weren't correctly pushed. ([\matrix-org#15820](matrix-org#15820)) - Fix background schema updates failing over a large upgrade gap. ([\matrix-org#15887](matrix-org#15887)) - Fix a bug introduced in 1.86.0 where Synapse starting with an empty `experimental_features` configuration setting. ([\matrix-org#15925](matrix-org#15925)) - Fixed deploy annotations in the provided Grafana dashboard config, so that it shows for any homeserver and not just matrix.org. Contributed by @wrjlewis. ([\matrix-org#15957](matrix-org#15957)) - Ensure a long state res does not starve CPU by occasionally yielding to the reactor. ([\matrix-org#15960](matrix-org#15960)) - Properly handle redactions of creation events. ([\matrix-org#15973](matrix-org#15973)) - Fix a bug where resyncing stale device lists could block responding to federation transactions, and thus delay receiving new data from the remote server. ([\matrix-org#15975](matrix-org#15975)) - Better clarify how to run a worker instance (pass both configs). ([\matrix-org#15921](matrix-org#15921)) - Improve [the documentation](https://matrix-org.github.io/synapse/v1.89/admin_api/user_admin_api.html#login-as-a-user) for the login as a user admin API. ([\matrix-org#15938](matrix-org#15938)) - Fix broken Arch Linux package link. Contributed by @SnipeXandrej. ([\matrix-org#15981](matrix-org#15981)) - Remove support for calling the `/register` endpoint with an unspecced `user` property for application services. ([\matrix-org#15928](matrix-org#15928)) - Mark `get_user_in_directory` private since it is only used in tests. Also remove the cache from it. ([\matrix-org#15884](matrix-org#15884)) - Document which Python version runs on a given Linux distribution so we can more easily clean up later. ([\matrix-org#15909](matrix-org#15909)) - Add details to warning in log when we fail to fetch an alias. ([\matrix-org#15922](matrix-org#15922)) - Remove unneeded `__init__`. ([\matrix-org#15926](matrix-org#15926)) - Fix bug with read/write lock implementation. This is currently unused so has no observable effects. ([\matrix-org#15933](matrix-org#15933), [\matrix-org#15958](matrix-org#15958)) - Unbreak the nix development environment by pinning the Rust version to 1.70.0. ([\matrix-org#15940](matrix-org#15940)) - Update presence metrics to differentiate remote vs local users. ([\matrix-org#15952](matrix-org#15952)) - Stop reading from column `user_id` of table `profiles`. ([\matrix-org#15955](matrix-org#15955)) - Build packages for Debian Trixie. ([\matrix-org#15961](matrix-org#15961)) - Reduce the amount of state we pull out. ([\matrix-org#15968](matrix-org#15968)) - Speed up updating state in large rooms. ([\matrix-org#15971](matrix-org#15971)) * Bump anyhow from 1.0.71 to 1.0.72. ([\matrix-org#15949](matrix-org#15949)) * Bump click from 8.1.3 to 8.1.6. ([\matrix-org#15984](matrix-org#15984)) * Bump cryptography from 41.0.1 to 41.0.2. ([\matrix-org#15943](matrix-org#15943)) * Bump jsonschema from 4.17.3 to 4.18.3. ([\matrix-org#15948](matrix-org#15948)) * Bump pillow from 9.4.0 to 10.0.0. ([\matrix-org#15986](matrix-org#15986)) * Bump prometheus-client from 0.17.0 to 0.17.1. ([\matrix-org#15945](matrix-org#15945)) * Bump pydantic from 1.10.10 to 1.10.11. ([\matrix-org#15946](matrix-org#15946)) * Bump pygithub from 1.58.2 to 1.59.0. ([\matrix-org#15834](matrix-org#15834)) * Bump pyo3-log from 0.8.2 to 0.8.3. ([\matrix-org#15951](matrix-org#15951)) * Bump sentry-sdk from 1.26.0 to 1.28.1. ([\matrix-org#15985](matrix-org#15985)) * Bump serde_json from 1.0.100 to 1.0.103. ([\matrix-org#15950](matrix-org#15950)) * Bump types-pillow from 9.5.0.4 to 10.0.0.1. ([\matrix-org#15932](matrix-org#15932)) * Bump types-requests from 2.31.0.1 to 2.31.0.2. ([\matrix-org#15983](matrix-org#15983)) * Bump typing-extensions from 4.5.0 to 4.7.1. ([\matrix-org#15947](matrix-org#15947))
dehydrated devices *must not* encrypt to devices marked as being a dehydrated | ||
device if they are not cross-signed. Clients should also drop any to-device | ||
messages from a device marked as being a dehydrated device, since dehydrated | ||
device should not be sending messages. Clients can use also this flag to for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of encrypted messages, we can identify the sending device using the sender key, or MSC4147 device info. 2 questions:
- What about unencrypted messages? Even if the event content contains a device id, we can't trust that a malicious homeserver didn't just fake it. Should we just allow all unencrypted to-device messages?
- What if the sender key is not found, maybe because the device is logged out (and there is no MSC4147 device info)? Should we assume the device is not dehydrated and process it as normal? This gives a way to work around the messages being dropped, but it's hard to see what else we can do in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dkasak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This only applies to encrypted messages, since those are the only ones that can be reliably tied to a device. We can also drop verification requests from dehydrated devices too, but I think that would be optional.
- This would be a best-effort thing. If we really have no idea about the device, then we would keep the message.
Co-authored-by: Andy Balaam <[email protected]>
Changelog: Added support for [MSC3814](matrix-org/matrix-spec-proposals#3814), allowing users to export an Olm Account in a serialized format. The serialized account can be uploaded to a server for safekeeping until needed.
Changelog: Added support for [MSC3814](matrix-org/matrix-spec-proposals#3814), allowing users to export an Olm Account in a serialized format. The serialized account can be uploaded to a server for safekeeping until needed.
Changelog: Added support for [MSC3814](matrix-org/matrix-spec-proposals#3814), allowing users to export an Olm Account in a serialized format. The serialized account can be uploaded to a server for safekeeping until needed.
To reduce the chances of one-time key exhaustion, if the user has an active | ||
client, it should periodically (e.g. once a week) replace the dehydrated device | ||
with a new dehydrated device with new one-time keys. If a client does this, | ||
then it runs the risk of losing any megolm keys that were sent to the dehydrated | ||
device, but the client should have received those megolm keys itself, so this | ||
should not be a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the approach taken in the sample implementation? We should probably hoist this sentence ("clients should periodically replace the dehydrated device") into the main proposal: I wouldn't necessarily read through all the "Potential issues" when I was writing an impl
uploaded in the same request. This change should prevent the creation of | ||
dehydrated devices which do not support end-to-end encryption. | ||
|
||
To upload a new dehydrated device, a client will use `PUT /dehydrated_device`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should maybe write down how we know if the server supports the feature. AIUI the sample impl calls GET /dehydrated_device
and checks for an M_UNRECOGNISED
error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to be an optional feature for servers?
users who do not expect to see another device. | ||
- Don't send key forwarding or verification requests to the dehydrated device, | ||
since it will not respond to them. | ||
- Don't send room keys to the dehydrated device if the user has a sufficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem to be a heuristic, that can be implemented (at least for foreign devices). It would mean that the sending device must know, which devices of the receiving user are considered to be active/online.
It also seems, that fallback keys already solve this issue (see also Potential issues -> One-time key exhaustion
), but are not yet taken into account. As suggested in the same section, I would prefer to re-create a new dehydrated device periodically (also mentioned here)
A 96-bit (12-byte) nonce is randomly generated; each time a device is | ||
dehydrated, a new nonce must be generated. | ||
|
||
The plain-text is encrypted with ChaCha20-Poly1305 as defined in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really weired decision. This makes it a lot harder to implement for SDK developers without any benefit, because AES is also considered safe and is way more supported on all platforms. Browser subtle crypto for example does NOT support ChaCha20-Poly1305. So why not re-use an algorithm, that is already used in the Matrix spec like AES-CTR-256 with HMAC-SHA-256?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a brief summary of why the ChaCha algorithms were introduced over AES in the RFC. I think ideally the MSC should motivate its choice explicitly though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, the advantages mentioned in the RFC (performance, cache-collision timing attacks) are not relevant for this use case. The disadvantages mentioned above, on the other hand, strongly outweigh them in my opinion.
and clients are recommended to create a fallback key for the dehydrated device. | ||
|
||
To reduce the chances of one-time key exhaustion, if the user has an active | ||
client, it should periodically (e.g. once a week) replace the dehydrated device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the dehydrated device is rotated regularly without any kind of synchronisation between the active devices, there will be a lot of federation and sync traffic due to the unnecessarily frequent device list changes.
Therefore, we either need the information when the dehydrated device has been created or, even better, when it no longer has one-time keys.
Rendered