crypto: events can be sent without an up-to-date member list causing UTDs #3529

kegsay · 2024-06-11T08:49:40Z

I've seen two bug reports from the same user on EXA which shows the following:

Get invited to a DM over federation
Accept the invite
Send some messages
Those messages are UTD.
~45s later send more messages, they are not UTD.

The SDK is supposed to hit /members prior to sending encrypted events to ensure they have an up-to-date member list. This is done in

matrix-rust-sdk/crates/matrix-sdk/src/room/futures.rs

Lines 166 to 168 in 0c97c85

    
           if !room.are_members_synced() { 
        
               room.sync_members().await?; 
        
           }

I see no evidence of a /members hit when sending the event.

However, this is complicated by the several confounding factors:

the app itself also asks for /members and then immediately cancels the job, causing kotlinx.coroutines.JobCancellationException: Job was cancelled; job=SupervisorJobImpl{Cancelling}@6f96c4a - this is normal behaviour when you back out of a room quickly before it can fully load.
sync_members will not actually sync the members if the user is invited and history visibility is HistoryVisibility::WorldReadable | HistoryVisibility::Invited
the SS request doesn't return the member event of the inviter in this scenario because it's before the join event and the client doesn't required_state: [[m.room.member, *]].
the client is stuck unable to upload OTKs due to Lost OTK, leading to "OneTime key already exists" error and later UTDs #1415 which is causing /keys/upload to HTTP 400.

These factors mean there's several things which could be going wrong here:

Can calling Room.members() break the state machine such that it doesn't call it again when you send an event?
Is there a race condition where you can send an event after you join but before the rust SDK layer believes you are joined, so sync_members doesn't actually sync members?
The SDK does eventually learn that the inviter is in the room, but how? SS isn't telling the SDK directly, so what mechanism is being used here? /messages? That doesn't make sense as it's historical state.
Does the inability to upload OTKs cause the state machine to terminate early or something?

There's a lot of moving parts here but I've done due diligence:

haproxy logs on the server show that /members is only hit one time, just after the /join (this was triggered by EXA)
haproxy logs show the response size is about right for 2 members to be returned in the JSON, indicating that the server likely did include the inviter in the response.

The text was updated successfully, but these errors were encountered:

poljar · 2024-06-11T12:28:43Z

Is there a race condition where you can send an event after you join but before the rust SDK layer believes you are joined, so sync_members doesn't actually sync members?

I don't think so, due to this check happening at the start of the send:

matrix-rust-sdk/crates/matrix-sdk/src/room/futures.rs

Line 169 in abda959

room.ensure_room_joined()?;

Which does this check:

matrix-rust-sdk/crates/matrix-sdk/src/room/mod.rs

Lines 2524 to 2529 in abda959

    
           let state = self.state(); 
        
           if state == RoomState::Joined { 
        
               Ok(()) 
        
           } else { 
        
               Err(Error::WrongRoomState(WrongRoomState::new("Joined", state))) 
        
           }

Later on, the same piece of data is used to skip the /members request:

matrix-rust-sdk/crates/matrix-sdk/src/room/mod.rs

Lines 517 to 522 in abda959

    
           if let RoomState::Invited = self.inner.state() { 
        
               return matches!( 
        
                   self.inner.history_visibility(), 
        
                   HistoryVisibility::WorldReadable | HistoryVisibility::Invited 
        
               ); 
        
           }

In conclusion, if the SDK doesn't think you are joined, it doesn't let you send a message.

Does the inability to upload OTKs cause the state machine to terminate early or something?

I think this is an unrelated issue, failing to upload OTKs would produce UTDs on the EX side, not on the EW side.

poljar · 2024-06-11T13:28:31Z

the app itself also asks for /members and then immediately cancels the job, causing kotlinx.coroutines.JobCancellationException: Job was cancelled; job=SupervisorJobImpl{Cancelling}@6f96c4a - this is normal behaviour when you back out of a room quickly before it can fully load.

I think this might be the problem, though the cancellation shouldn't set the members_synced flag to true, so next time you try to send a message it should retry the /members request.

To further expand, we have a request deduplication handler:

matrix-rust-sdk/crates/matrix-sdk/src/deduplicating_handler.rs

Lines 29 to 38 in abda959

    
           /// Handler that properly deduplicates function calls given a key uniquely 
        
           /// identifying the call kind, and will properly report error upwards in case 
        
           /// the concurrent call failed. 
        
           /// 
        
           /// This is handy for deduplicating per-room requests, but can also be used in 
        
           /// other contexts. 
        
           pub(crate) struct DeduplicatingHandler<Key> { 
        
               /// Map of outstanding function calls, grouped by key. 
        
               inflight: DeduplicatedRequestMap<Key>, 
        
           }

The handler allows the first caller to acquire a lock and the lock is held for the duration of the call. The second caller then tries to acquire the same lock, though it will do so only after the first caller is done.

The first caller gets the result of the operation and shares it through the lock with the rest of them. So a failure would be propagated correctly to the rest of the callers, but a cancellation, like EX does here, might not be a failure due to this critical line:

matrix-rust-sdk/crates/matrix-sdk/src/deduplicating_handler.rs

Lines 65 to 67 in abda959

    
           // Assume a successful request; we'll modify the result in case of failures 
        
           // later. 
        
           let request_mutex = Arc::new(Mutex::new(Ok(())));

bnjbvr · 2024-06-11T13:31:25Z

the app itself also asks for /members and then immediately cancels the job, causing kotlinx.coroutines.JobCancellationException: Job was cancelled; job=SupervisorJobImpl{Cancelling}@6f96c4a - this is normal behaviour when you back out of a room quickly before it can fully load.

We also have a deduplicating mechanism for some queries, including /members: if there's already another caller doing this query, then the second query is deduplicated.

Now, I didn't think about cancellability when I implemented that deduplication (one extra proof that making it possible to cancel any future makes reasoning really hard). So what could happen:

the app calls /members
we try to send an event, which requires calling /members because we don't know who's in the room
since there's already a /members query, the one called when sending in the event waits for the former (caused by the app) to finish
the app cancels the first query
???

Ideally, the SDK would "transmit" ownership of the deduplicated query to one of the waiters on the result, but I'm not sure that's what's happening here...

poljar · 2024-06-11T13:39:52Z

Ideally, the SDK would "transmit" ownership of the deduplicated query to one of the waiters on the result, but I'm not sure that's what's happening here...

I think we assume a successful result, look at my previous reply.

bnjbvr · 2024-06-11T13:41:02Z

Indeed, I hadn't seen your answer yet while typing mine. Good catch!

kegsay · 2024-06-11T14:09:04Z

So the failure scenario would be:

Call room.members() to trigger a /members request
Whilst this is ongoing, send a message / do something to trigger the member sync logic?
Cancel the room.members() call - How is this done at an FFI level? There is no cancel function or anything.
This causes the member sync code to return a success when it shouldn't, which propagates to the send message code?

poljar · 2024-06-11T14:30:02Z

Cancel the room.members() call - How is this done at an FFI level? There is no cancel function or anything.

Not sure how it's done in this case, as I haven't checked what EXA is doing, but we provide a general cancellation primitive:

matrix-rust-sdk/bindings/matrix-sdk-ffi/src/task_handle.rs

Lines 4 to 39 in abda959

    
           /// A task handle is a way to keep the handle a task running by itself in 
        
           /// detached mode. 
        
           /// 
        
           /// It's a thin wrapper around [`JoinHandle`]. 
        
           #[derive(uniffi::Object)] 
        
           pub struct TaskHandle { 
        
               handle: JoinHandle<()>, 
        
           } 
        
           impl TaskHandle { 
        
               // Create a new task handle. 
        
               pub fn new(handle: JoinHandle<()>) -> Self { 
        
                   Self { handle } 
        
               } 
        
           } 
        
           #[uniffi::export] 
        
           impl TaskHandle { 
        
               // Cancel a task handle. 
        
               pub fn cancel(&self) { 
        
                   debug!("Cancelling the task handle"); 
        
                   self.handle.abort(); 
        
               } 
        
               /// Check whether the handle is finished. 
        
               pub fn is_finished(&self) -> bool { 
        
                   self.handle.is_finished() 
        
               } 
        
           } 
        
           impl Drop for TaskHandle { 
        
               fn drop(&mut self) { 
        
                   self.cancel(); 
        
               } 
        
           }

This causes the member sync code to return a success when it shouldn't, which propagates to the send message code?

Yes, correct. The previous steps are correct as well.

bnjbvr · 2024-06-11T14:58:59Z

Fwiw, I'm working on adding a test showing the deduplicating handler failure (in this case), and making it more robust against this particular cancelling race.

This doesn't mean it will solve this particular issue entirely (since it's unclear whether that was the root cause or not), but at least it'll make the whole deduplication handling more robust.

poljar · 2024-06-11T15:05:14Z

Fwiw, I'm working on adding a test showing the deduplicating handler failure (in this case), and making it more robust against this particular cancelling race.

Ah nice, thanks.

kegsay · 2024-06-13T07:48:02Z

Manually confirmed that #3533 fixes the issue. Whilst EXA still produces kotlinx.coroutines.JobCancellationException: Job was cancelled; the SDK does now retry /members requests.

kegsay added the bug Something isn't working label Jun 11, 2024

bnjbvr mentioned this issue Jun 11, 2024

deduplicating handler: handle cancellation by repeating the query, if needs be #3533

Merged

kegsay closed this as completed Jun 13, 2024

richvdh mentioned this issue Jul 4, 2024

Unable To Decrypt meta issue element-hq/element-meta#245

Open

87 tasks

poljar mentioned this issue Jul 4, 2024

Chat history for members before they joined shows: Unable to decrypt - Unknown message inbox element-hq/element-x-android#2226

Closed

MarcWadai mentioned this issue Jul 9, 2024

Can we have more info about why messages are undecrypted ? tchapgouv/tchap-web-v4#904

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crypto: events can be sent without an up-to-date member list causing UTDs #3529

crypto: events can be sent without an up-to-date member list causing UTDs #3529

kegsay commented Jun 11, 2024 •

edited

Loading

poljar commented Jun 11, 2024

poljar commented Jun 11, 2024

bnjbvr commented Jun 11, 2024

poljar commented Jun 11, 2024

bnjbvr commented Jun 11, 2024

kegsay commented Jun 11, 2024

poljar commented Jun 11, 2024

bnjbvr commented Jun 11, 2024

poljar commented Jun 11, 2024

kegsay commented Jun 13, 2024

crypto: events can be sent without an up-to-date member list causing UTDs #3529

crypto: events can be sent without an up-to-date member list causing UTDs #3529

Comments

kegsay commented Jun 11, 2024 • edited Loading

poljar commented Jun 11, 2024

poljar commented Jun 11, 2024

bnjbvr commented Jun 11, 2024

poljar commented Jun 11, 2024

bnjbvr commented Jun 11, 2024

kegsay commented Jun 11, 2024

poljar commented Jun 11, 2024

bnjbvr commented Jun 11, 2024

poljar commented Jun 11, 2024

kegsay commented Jun 13, 2024

kegsay commented Jun 11, 2024 •

edited

Loading