Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[$250] Chat - Group is created with duplicate user when chat is started quickly on a slow network #49931

Open
1 of 6 tasks
lanitochka17 opened this issue Sep 30, 2024 · 49 comments
Assignees
Labels
Bug Something is broken. Auto assigns a BugZero manager. External Added to denote the issue can be worked on by a contributor Reviewing Has a PR in review Weekly KSv2

Comments

@lanitochka17
Copy link

lanitochka17 commented Sep 30, 2024

If you haven’t already, check out our contributing guidelines for onboarding and email [email protected] to request to join our Slack channel!


Version Number: 9.0.41-2
Reproducible in staging?: Y
Reproducible in production?: Y
If this was caught during regression testing, add the test name, ID and link from TestRail: N/A
Email or phone of affected tester (no customers): [email protected]
Issue reported by: Applause - Internal Team

Action Performed:

  1. Go to https://staging.new.expensify.com/.
  2. Open the Developer Tools and go to the Network tab, then set the network speed to 3G
  3. Click on the FAB > Start Chat
  4. Enter a user's email, then quickly click on it before the fallback avatar changes to the correct user avatar
  5. Send a message

Expected Result:

The chat with the selected user should remain a 1-on-1 chat. The chat should not change into a group with duplicate users

Actual Result:

After sending the message, the chat turns into a group with duplicate users. It only corrects itself back to a single chat after switching to another chat and then returning to it

Workaround:

Unknown

Platforms:

Which of our officially supported platforms is this issue occurring on?

  • Android: Native
  • Android: mWeb Chrome
  • iOS: Native
  • iOS: mWeb Safari
  • MacOS: Chrome / Safari
  • MacOS: Desktop

Screenshots/Videos

Add any screenshot/video evidence

Bug6619285_1727614560806.1.mp4

View all open jobs on GitHub

Upwork Automation - Do Not Edit
  • Upwork Job URL: https://www.upwork.com/jobs/~021843049310841478139
  • Upwork Job ID: 1843049310841478139
  • Last Price Increase: 2024-10-13
  • Automatic offers:
    • DylanDylann | Contributor | 104420901
    • nkdengineer | Contributor | 104651259
Issue OwnerCurrent Issue Owner: @nkdengineer
@lanitochka17 lanitochka17 added Daily KSv2 Bug Something is broken. Auto assigns a BugZero manager. labels Sep 30, 2024
Copy link

melvin-bot bot commented Sep 30, 2024

Triggered auto assignment to @RachCHopkins (Bug), see https://stackoverflow.com/c/expensify/questions/14418 for more details. Please add this bug to a GH project, as outlined in the SO.

@RachCHopkins
Copy link
Contributor

I don't understand what's happening in that video. I can't replicate this.

image

I'm on 2 bars of 4G.

Is the person reporting it saying that the two users other than themselves are, in fact, the same user?

@RachCHopkins
Copy link
Contributor

Please clarify the "actual result"

@melvin-bot melvin-bot bot added the Overdue label Oct 3, 2024
@RachCHopkins
Copy link
Contributor

Waiting on a response/clarification from QA team.

@melvin-bot melvin-bot bot removed the Overdue label Oct 4, 2024
@mvtglobally
Copy link

team is still able to reproduce

group.chat.mp4

@RachCHopkins
Copy link
Contributor

Ok, to sum up, it can be reproduced and we don't know what/who the third user is.

@RachCHopkins RachCHopkins added the External Added to denote the issue can be worked on by a contributor label Oct 6, 2024
@melvin-bot melvin-bot bot changed the title Chat - Group is created with duplicate user when chat is started quickly on a slow network [$250] Chat - Group is created with duplicate user when chat is started quickly on a slow network Oct 6, 2024
Copy link

melvin-bot bot commented Oct 6, 2024

Job added to Upwork: https://www.upwork.com/jobs/~021843049310841478139

@melvin-bot melvin-bot bot added the Help Wanted Apply this label when an issue is open to proposals by contributors label Oct 6, 2024
Copy link

melvin-bot bot commented Oct 6, 2024

Triggered auto assignment to Contributor-plus team member for initial proposal review - @parasharrajat (External)

@nkdengineer
Copy link
Contributor

nkdengineer commented Oct 7, 2024

Edited by proposal-police: This proposal was edited at 2024-10-07T12:12:00Z.

Proposal

Please re-state the problem that we are trying to solve in this issue.

After sending the message, the chat turns into a group with duplicate users. It only corrects itself back to a single chat after switching to another chat and then returning to it

What is the root cause of that problem?

When OpenReport API is called, the success data is added to queuedOnyxUpdates and waited for all write requests to be complete before merging it to Onyx

function queueOnyxUpdates(updates: OnyxUpdate[]): Promise<void> {
queuedOnyxUpdates = queuedOnyxUpdates.concat(updates);
return Promise.resolve();
}

We've had the logic to flush the queue after all requests are complete here

// The queue can be paused when we sync the data with backend so we should only update the Onyx data when the queue is empty
if (PersistedRequests.getAll().length === 0) {
flushOnyxUpdatesQueue();

But in this case after AddComment API is called, the queue is paused

function saveUpdateInformation(updateParams: OnyxUpdatesFromServer) {
// If we got here, that means we are missing some updates on our local storage. To
// guarantee that we're not fetching more updates before our local data is up to date,
// let's stop the sequential queue from running until we're done catching up.
SequentialQueue.pause();
// Always use set() here so that the updateParams are never merged and always unique to the request that came in

After GetMissingOnyxMessages API, the queue is active again but the queue is empty then the queuedOnyxUpdates isn't flush which causes the participant of the DM to have both optimistic and current participants of a user.

What changes do you think we should make in order to solve the problem?

  • In unpause function, we should flush the queuedOnyxUpdates if there is no request . We should update after this line.

I notice that we should only flush the queuedOnyxUpdates if only numberOfPersistedRequests is 0 because if we merge these data to Onyx when there are some pending requests in the queue that can cause the race condition since other requests can update the data for the same key of the previous request.

if (numberOfPersistedRequests === 0) {
    flushOnyxUpdatesQueue();
}

isQueuePaused = false;

  • For the reason why GetMissingOnyxMessages API is called I think maybe it's expected from BE. But after all, the frontend problem can happen in other cases and we need to fix this and it also fixes this issue.

What alternative solutions did you explore? (Optional)

@c3024
Copy link
Contributor

c3024 commented Oct 7, 2024

Proposal

Please re-state the problem that we are trying to solve in this issue.

On slow networks, if we create a report and comment on it, a group with a duplicate chat is created.

What is the root cause of that problem?

For write requests, we do not write the response onyxData and the corresponding successData/failureData/finallyData to Onyx directly. Instead, we write them to queuedOnyxUpdates to prevent the replay effect.

// For most requests we can immediately update Onyx. For write requests we queue the updates and apply them after the sequential queue has flushed to prevent a replay effect in
// the UI. See https://github.com/Expensify/App/issues/12775 for more info.
const updateHandler: (updates: OnyxUpdate[]) => Promise<unknown> = request?.data?.apiRequestType === CONST.API_REQUEST_TYPE.WRITE ? QueuedOnyxUpdates.queueOnyxUpdates : Onyx.update;

These updates are flushed only after all the PersistedRequests are cleared.
if (PersistedRequests.getAll().length === 0) {
flushOnyxUpdatesQueue();
}

We remove a request from persistent requests only after the response is received and processed with all middlewares.
currentRequestPromise = Request.processWithMiddleware(requestToProcess, true)
.then((response) => {
// A response might indicate that the queue should be paused. This happens when a gap in onyx updates is detected between the client and the server and
// that gap needs resolved before the queue can continue.
if (response?.shouldPauseQueue) {
Log.info("[SequentialQueue] Handled 'shouldPauseQueue' in response. Pausing the queue.");
pause();
}
PersistedRequests.remove(requestToProcess);

So, on slow networks, when we call OpenReport, its response and corresponding request’s successData, etc., are added to queuedOnyxUpdates. This cannot be flushed immediately because there was already an AddComment request added to the PersistedRequests.
For some reason that needs to be fixed, the backend sends a previousUpdateID for AddComment that is larger than the previous lastUpdateID/lastClientUpdateID. This makes doesClientNeedToBeUpdated true.
if (requestsToIgnoreLastUpdateID.includes(request.command) || !OnyxUpdates.doesClientNeedToBeUpdated(Number(response?.previousUpdateID ?? 0))) {

As a result, the AddComment response is not processed directly with OnyxUpdates.apply. It sets this AddComment response to saveUpdateInformation, and the queue is paused. When the queue is paused, the updates waiting in queuedUpdates are not flushed, so OpenReport’s successData waits in the queuedUpdates, and the optimistic user detail is not removed, making the report appear like a group report.

What changes do you think we should make in order to solve the problem?

  1. First, it needs to be checked why the previousUpdateID for the AddComment response is sent as a higher value than the OpenReport call’s lastUpdatedID/lastClientUpdateID. Fixing this issue should resolve the problem.
  2. However, even with an accidental call to GetMissingOnyxMessages, the previous updates should not be left in limbo without being updated to Onyx until a new write request is made. The updates in the queue need to be written to Onyx after all the missing updates are obtained with GetMissingOnyxMessages. Currently, this is not handled anywhere. When the next write request gets completed and the persistent requests array is empty, all queuedOnyxMessages get updated.
    We can see this in the present bug as well: if we send another message, everything updates correctly (the backend also does not send an incorrect previousUpdateID) because the queue is unpaused. After processing this request, when the persisted requests are empty, the queuedOnyxUpdates (which include the previous OpenReport and AddComment responses’ onyxData, successData, etc.) are flushed.
    To achieve this, after fetching the missing messages, we should flush the queue finally by changing this:
    DeferredOnyxUpdates.getMissingOnyxUpdatesQueryPromise()?.finally(finalizeUpdatesAndResumeQueue);

    to:
DeferredOnyxUpdates.getMissingOnyxUpdatesQueryPromise()?.finally(finalizeUpdatesAndResumeQueue).then(() => QueuedOnyxUpdates.flushQueue());

What alternative solutions did you explore? (Optional)

Copy link

melvin-bot bot commented Oct 10, 2024

@parasharrajat, @RachCHopkins Whoops! This issue is 2 days overdue. Let's get this updated quick!

@melvin-bot melvin-bot bot added the Overdue label Oct 10, 2024
@RachCHopkins
Copy link
Contributor

@parasharrajat do you like any of the proposals here?

Copy link

melvin-bot bot commented Oct 13, 2024

📣 It's been a week! Do we have any satisfactory proposals yet? Do we need to adjust the bounty for this issue? 💸

@parasharrajat
Copy link
Member

@RachCHopkins I won't be available from 16 Oct for a few days, please reassign.

@melvin-bot melvin-bot bot removed the Overdue label Oct 14, 2024
@DylanDylann
Copy link
Contributor

I can take over this issue as C+

Copy link

melvin-bot bot commented Oct 14, 2024

@parasharrajat @RachCHopkins this issue was created 2 weeks ago. Are we close to approving a proposal? If not, what's blocking us from getting this issue assigned? Don't hesitate to create a thread in #expensify-open-source to align faster in real time. Thanks!

@melvin-bot melvin-bot bot removed the Help Wanted Apply this label when an issue is open to proposals by contributors label Oct 15, 2024
@c3024
Copy link
Contributor

c3024 commented Oct 22, 2024

That is fine. It is necessary for PersistedRequests to be empty before flushing the queue. While I explained it in the next AddComment part, I missed considering it in my solution.

@aldo-expensify
Copy link
Contributor

aldo-expensify commented Oct 22, 2024

hmm, thanks @c3024 for that detailed explanation: #49931 (comment)

I gave it a first read, I didn't understand 100% because I'm not very familiar with all that, but I think I understood that there is a backend bug related to the onyx updates ids being passed around. If this is true, I'd prefer to prioritize and fix that first, and then see again where we are at with this bug.

I'll give it sometime later today to try to reproduce your detailed steps and see if I can understand/see the bug.

Copy link

melvin-bot bot commented Oct 25, 2024

@RachCHopkins, @aldo-expensify, @DylanDylann Uh oh! This issue is overdue by 2 days. Don't forget to update your issues!

@melvin-bot melvin-bot bot added the Overdue label Oct 25, 2024
@aldo-expensify
Copy link
Contributor

I'll give it sometime later today to try to reproduce your detailed steps and see if I can understand/see the bug.

I fail that day and then I forgot to come back. I'll try to take a better look at this asap

@DylanDylann
Copy link
Contributor

not overdue, waiting for @aldo-expensify

@melvin-bot melvin-bot bot removed the Overdue label Oct 26, 2024
Copy link

melvin-bot bot commented Oct 28, 2024

@RachCHopkins @aldo-expensify @DylanDylann this issue is now 4 weeks old, please consider:

  • Finding a contributor to fix the bug
  • Closing the issue if BZ has been unable to add the issue to a VIP or Wave project
  • If you have any questions, don't hesitate to start a discussion in #expensify-open-source

Thanks!

@melvin-bot melvin-bot bot added the Overdue label Oct 28, 2024
@aldo-expensify
Copy link
Contributor

Sorry, still haven't had time for this yet.

@RachCHopkins
Copy link
Contributor

Waiting on Aldo. Not overdue.

@melvin-bot melvin-bot bot removed the Overdue label Oct 29, 2024
@aldo-expensify
Copy link
Contributor

aldo-expensify commented Oct 29, 2024

  1. For all write requests, the backend sends a previousUpdateID and a lastUpdateID. In the above video, the lastUpdateID sent back in the OpenReport response is 2468230213. This means the backend considers this ID to be the last time the client device was updated.
  2. Then, when we made an AddComment request, the server’s response returned a previousUpdateID of 2468230219. This means the backend is informing the client that the freshest update it sent previously to all devices is 2468230219. This is larger than the lastUpdateID (2468230213) saved to the device. The backend is suggesting to the client that there are updates between the OpenReport response (2468230213) and the AddComment response (2468230219). However, this is incorrect because it is impossible for there to be any updates in between, as this is the only client making changes to this account. The backend is sending an incorrect previousUpdateID of 2468230219 when it should be sending the lastUpdateID from the OpenReport response, which is 2468230213.

I wasn't been able to reproduce this with a slow connection:

image image

but then with these steps I was able to reproduce in dev:

  1. Go offline
  2. Open a chat with a user you haven't chatted before (it can be an account that doesn't exist)
  3. Send a message
  4. Go online
  5. Check the OpenReport's last lastUpdateID and the AddComment's previousUpdateID

I'll investigate what are the updates in between and why these are not matching now.

Update: In the end, I was able to reproduce without having to go offline, I think the happens sometimes and sometimes not.

@aldo-expensify
Copy link
Contributor

Hmm... I think the lastUpdateID and previousUpdateID mismatch come from the OpenReport command forwarding a CreateReportAction command to create the whisper "Hi there! This is your first time chatting with...". Sometimes the CreateReportAction can finish fast enough so its updates are considered in the previousUpdateID returned by OpenReport and sometimes not.

Here: https://github.com/Expensify/Auth/blob/d3481121749cc204c713c771a13454dbe0165fbe/auth/lib/Report.cpp#L7751

@aldo-expensify
Copy link
Contributor

Changing this behaviour in the backend sounds tricky to do. We forward the command to avoid making OpenReport faster so there is less DB conflicts.

@aldo-expensify
Copy link
Contributor

Assigning @nkdengineer since the proposal to fix the frontend makes sense to me too.

Copy link

melvin-bot bot commented Oct 29, 2024

📣 @nkdengineer 🎉 An offer has been automatically sent to your Upwork account for the Contributor role 🎉 Thanks for contributing to the Expensify app!

Offer link
Upwork job
Please accept the offer and leave a comment on the Github issue letting us know when we can expect a PR to be ready for review 🧑‍💻
Keep in mind: Code of Conduct | Contributing 📖

@melvin-bot melvin-bot bot added Reviewing Has a PR in review Weekly KSv2 and removed Daily KSv2 labels Oct 30, 2024
Copy link

melvin-bot bot commented Oct 31, 2024

⚠️ Looks like this issue was linked to a Deploy Blocker here

If you are the assigned CME please investigate whether the linked PR caused a regression and leave a comment with the results.

If a regression has occurred and you are the assigned CM follow the instructions here.

If this regression could have been avoided please consider also proposing a recommendation to the PR checklist so that we can avoid it in the future.

@DylanDylann
Copy link
Contributor

Copy link

melvin-bot bot commented Nov 1, 2024

⚠️ Looks like this issue was linked to a Deploy Blocker here

If you are the assigned CME please investigate whether the linked PR caused a regression and leave a comment with the results.

If a regression has occurred and you are the assigned CM follow the instructions here.

If this regression could have been avoided please consider also proposing a recommendation to the PR checklist so that we can avoid it in the future.

@aldo-expensify
Copy link
Contributor

@RachCHopkins I'll be off next week (for the whole week). I don't think this bug is very important, so I think it would be fine to wait until after that. If we don't want to wait, feel free to reassign to another engineer.

The PR #51712 is a bit tricky to review because it is changing some low level behaviour, and I'm not sure if it may cause more harm than good. Someone with more knowledge about the network queue may know better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something is broken. Auto assigns a BugZero manager. External Added to denote the issue can be worked on by a contributor Reviewing Has a PR in review Weekly KSv2
Projects
Status: Bugs and Follow Up Issues
Development

No branches or pull requests

9 participants