Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show poll-failure details #868

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Show poll-failure details #868

wants to merge 4 commits into from

Conversation

PIG208
Copy link
Member

@PIG208 PIG208 commented Aug 6, 2024

Ready for review.

This fixes #555.

@PIG208 PIG208 changed the title WIP Show polling failure details WIP Show poll-failure details Aug 6, 2024
@PIG208 PIG208 force-pushed the report-error branch 2 times, most recently from f712ca5 to 08f8815 Compare August 8, 2024 03:51
@PIG208
Copy link
Member Author

PIG208 commented Aug 8, 2024

A question to be answered: how do we determine if an error is transient?

Currently we handle 3 classes of errors in UpdateMachine:

  1. ZulipApiException: The event queue is lost. This is commonly caused by inactivity.
  2. Server5xxException, NetworkException: A server-side bug, or hiccups due to unstable Internet connection.
  3. default: Could be anything that happens in the client code during the getEvents call.

I think it is usually safe to assume that 1 is recoverable, while we are less certain about that for 2 and 3. An idea is to define a number of retries for the client to start displaying error messages. And then in a follow-up PR the client can try to do something different (which is #186).

@PIG208
Copy link
Member Author

PIG208 commented Aug 8, 2024

For non-transient errors, we will probably show a modal instead.

@PIG208 PIG208 force-pushed the report-error branch 2 times, most recently from d3a14ee to 54436ee Compare August 8, 2024 23:35
@gnprice
Copy link
Member

gnprice commented Aug 12, 2024

If there's a ZulipApiException other than the event queue being gone, then that's likely non-transient — it's either a bug in the client, or some kind of unexpected incompatibility between server and client. A retry is likely to just hit the same error again.

(Those fall into the default case.)

@gnprice
Copy link
Member

gnprice commented Aug 12, 2024

Other than that point, the description above all looks right.

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Quick comments on the high-level behavior.)

Comment on lines 795 to 796
if (accumlatedTransientFailureCount > transientFailureCountNotifyThreshold) {
reportErrorToUserInDialog('Failed to reach server. Will retry: $e');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this case as a toast. It's a server 5xx or a network exception, so it's unlikely there's anything the client should be doing differently about it — no need to get in the user's way and potentially annoy them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's start showing the toasts once it's been a few seconds since we started retrying. The simple way to do that is to make the count threshold about 5, rather than 10 — at the 6th failure, it's been an expected 1.55s since the first failure, and we'll be backing off for an expected 1.6s before the next retry.

@PIG208 PIG208 force-pushed the report-error branch 9 times, most recently from 837a6d0 to b64c99c Compare August 12, 2024 21:29
@PIG208 PIG208 marked this pull request as ready for review August 12, 2024 21:29
@PIG208 PIG208 added the maintainer review PR ready for review by Zulip maintainers label Aug 12, 2024
@PIG208 PIG208 requested a review from chrisbobbe August 12, 2024 21:30
@PIG208 PIG208 changed the title WIP Show poll-failure details Show poll-failure details Aug 12, 2024
Copy link
Collaborator

@chrisbobbe chrisbobbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Comments below.

Then one other comment is, for all the new poll-failure feedback we're giving the user, let's identify which server it's about. This is particularly relevant when the issue is a network problem, but applies whenever the feedback can be significantly delayed. The user might have given up waiting for this account, like a minute or more ago, and is doing other things—maybe even in a different account—when the feedback appears.

lib/log.dart Outdated
@@ -30,3 +33,45 @@ bool debugLog(String message) {
}());
return true;
}

/// Display an error message.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Display an error message in a [SnackBar]." I like how the summary line on reportErrorToUserInDialog's dartdoc says how the error is displayed.

lib/log.dart Outdated
/// This shows a [SnackBar] containing the message after app startup,
/// otherwise logs it to the console.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific about "after app startup"? I can imagine confused readers wondering how the snackbar could be shown at any time other than after the app starts up.

lib/log.dart Outdated
/// This shows a [SnackBar] containing the message after app startup,
/// otherwise logs it to the console.
///
/// See also: [ZulipApp._reportErrorToUserBriefly]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a private method, and also kind of feels like an implementation detail. Is this the kind of information that callers need, to understand if reportErrorToUserBriefly is right for them, and what its expectations/guarantees are like? (Similarly with the dartdoc on reportErrorToUserInDialog.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary goal of including this line is to have a back link to where the setters get used, since it is not obvious. I think it would be more appropriate to exclude that bit from the dart doct, though, considering the point that it's some implementation detail.

lib/log.dart Outdated
/// otherwise logs it to the console.
///
/// See also: [ZulipApp._reportErrorToUserBriefly]
void Function(String message) get reportErrorToUserBriefly => (message) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about reportErrorToUserInSnackbar, to be more informative, and more parallel with reportErrorToUserInDialog?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds like a good choice.

lib/log.dart Outdated

String? get debugLastReportedError => _debugLastReportedError;
String? _debugLastReportedError;
String? takeLastReportedError() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be debugTakeLastReportedError?

// After app startup, reportErrorToUserInDialog displays a dialog.
reportErrorToUserInDialog('test error message');
await tester.pump();
check(finder.evaluate()).single;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could maybe be strengthened with checkErrorDialog; just a thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a bit repetitive, but it should be good to also check the title.

test('retries on Server5xxException', () {
checkRetry(() => connection.prepare(httpStatus: 500, body: 'splat'));
});
group('retries', () {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store test [nfc]: Make a group for retry tests.

Let's leave off the [nfc] in the commit message: #870 (comment)

@@ -801,6 +801,7 @@ class UpdateMachine {
assert(debugLog('Error polling event queue for $store: $e\n'
'Backing off and retrying even though may be hopeless…'));
// TODO tell user on non-transient error in polling
reportErrorToUserInDialog('Error loading server data. Will retry: $e');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store: Reoprt non-transient polling errors.

commit-message nit: spelling

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this TODO can be removed:

            // TODO tell user on non-transient error in polling

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can this be a translated string? If that's not straightforward, let's include a TODO(i18n).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. And for non-transient errors, we expect that retrying won't help, right? It seems like we should have a TODO to not retry. I'm also wondering if retrying and showing a dialog, for a task that's likely to be hopeless, will spam the user with the same dialog repetitively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. And for non-transient errors, we expect that retrying won't help, right? It seems like we should have a TODO to not retry. I'm also wondering if retrying and showing a dialog, for a task that's likely to be hopeless, will spam the user with the same dialog repetitively.

Yes, that's #186. Added a line mentioning that.

@PIG208 PIG208 force-pushed the report-error branch 2 times, most recently from 96cd1ec to a434078 Compare August 16, 2024 04:16
@PIG208
Copy link
Member Author

PIG208 commented Aug 16, 2024

The PR has been updated. Thanks for the review!

I'm not entirely sure about the quality of the translation strings. These messages will appear more technical at the current stage for the ease of debugging. However, polishing user facing errors might be a post-launch task.

@PIG208
Copy link
Member Author

PIG208 commented Sep 10, 2024

Thanks for the review! The PR has been updated.

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The changes in the revision look good, modulo a couple of comments below.

I've also updated the build on my device, so I'll see how the UX works out.

Would you take the reportErrorToUserBriefly commit and the one it depends on, and send those as a separate PR (which this one can then stack on top of)? That way we can get those changes merged very soon, so they can be used in other features.

I'll take a pass over the commit messages but otherwise I think those two commits are basically ready. For the later commits that actually start giving the user feedback, I'll want a bit of bake time seeing the UX.

@@ -160,6 +160,23 @@
"message": {"type": "String", "example": "Invalid format"}
}
},
"errorLoadingServerDataShort": "Error connecting to Zulip. Retrying…",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"errorLoadingServerDataShort": "Error connecting to Zulip. Retrying…",
"errorConnectingToServerShort": "Error connecting to Zulip. Retrying…",

(didn't update that part in my previous suggestion, oops)

Comment on lines 406 to 409
// This overrides the function globally, but we don't need to worry about
// leaking the state because that is only relevant to widget tests.
// See `lib/log.dart ` for details.
reportErrorToUserBriefly = logAndReportErrorToUserBriefly;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just systematically avoid leaking state between tests.

In general the way to do that is to (a) set it up from within a test case, and then (b) use addTearDown to ensure it gets reset when the test case is done. Given the function definition above, I think code like

  test('some test', () {
    reportErrorToUserBriefly = logAndReportErrorToUserBriefly;
    addTearDown(reportErrorToUserBriefly = defaultReportErrorToUserBriefly);
    // … rest of test …

should do it.

When the setup is more complicated than that, that's the sort of thing that we write helper functions named like "prepareFoo" for.

@gnprice
Copy link
Member

gnprice commented Sep 11, 2024

Quick observation on the UX:

I reopened the app after a while. It showed the loading indicator at top for a while, then successfully reconnected.

But then for some time after that, it showed a chain of several rounds of snack bars saying there was an error, followed by one saying "reconnecting". Each one lasts several seconds, and there were about three of them that started after the loading indicator finished.

I hit the "details" link on a couple of the errors. Both were NetworkException, saying the host name chat.zulip.org couldn't be resolved.

@PIG208
Copy link
Member Author

PIG208 commented Sep 11, 2024

Opened #935 for the two commits implementing the report error feature,
and #936 for the other nfc commit.

@PIG208 PIG208 force-pushed the report-error branch 4 times, most recently from beac99b to 9cd37ea Compare September 12, 2024 20:54
@PIG208 PIG208 removed the integration review Added by maintainers when PR may be ready for integration label Sep 13, 2024
@PIG208 PIG208 marked this pull request as draft September 24, 2024 01:06
@PIG208 PIG208 marked this pull request as ready for review September 27, 2024 23:01
@PIG208 PIG208 added the integration review Added by maintainers when PR may be ready for integration label Sep 27, 2024
@PIG208 PIG208 requested a review from gnprice September 27, 2024 23:01
@PIG208
Copy link
Member Author

PIG208 commented Sep 28, 2024

Updated the PR to ignore NetworkException where the underlying cause is a SocketException, added some more documentation and cleaned up some diffs to make them easier to read. Feel free to install a build of this PR to test it out!

We extend the checkRetry helper to support simulating multiple
consecutive errors.

We delay reporting transient errors until we have retried a certain
number of times, because they might recover on their own.

Signed-off-by: Zixuan James Li <[email protected]>
Usually when a client goes back from sleep or lose network connection,
there will be a bunch of errors when polling event queue.  This known
error will always be a `NetworkException`, so we make a separate case
for it.

Previously, we may report these errors, and it ended up being spammy.
This filters out the more interesting one to report instead.

Signed-off-by: Zixuan James Li <[email protected]>
@gnprice
Copy link
Member

gnprice commented Oct 4, 2024

Thanks for the update! Installing on my phone now.

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've observed this on my phone for a while; and it's been pretty quiet, which is good.

After seeing it in action, I think we can leave out the last commit, with the "Reconnecting to {serverUrl}…" message. It feels redundant with the loading indicator at the top; and the fact that the toast disappears on its own schedule while the loading indicator can be still going seems incongruous, and likely to be confusing.

Otherwise this generally all looks good; just a few comments below.

(I also feel like the UpdateMachine.poll method has gotten kind of unwieldy; maybe I'll take a stab at reorganizing that after merging this PR. It's already kind of unwieldy in main.)

Comment on lines +517 to +535
if (i != numFailedRequests - 1) {
// Skip polling backoff unless this is the final iteration.
// This allows the next `updateMachine.debugAdvanceLoop` call to
// trigger the next request without wait.
async.flushTimers();
}

if (!expectNotifyError) {
// No matter how many retries there have been, no request should
// result in an error message.
check(takeLastReportedError()).isNull();
continue;
}
if (i < numFailedRequests - 1) {
// The error message should not appear until the `updateMachine`
// has retried the given number of times.
check(takeLastReportedError()).isNull();
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this test (or set of tests) gets fairly tangly to follow.

I think the main thing is that it'd be helpful to test two aspects separately: there's the existing test cases that just check that we retry, and let's have separate new test cases that just check the error-reporting.

Then I think both sets of tests can be a lot simpler than this — most of the logic is only relevant for one aspect or the other. It's OK that some other fragments of code will be duplicated between them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately and more tactically, I think it'd help to pull the last iteration here outside of the loop — have the loop only concern itself with the first N-1 requests. That would avoid these multiple conditionals about whether it's the last iteration.

Comment on lines +857 to +865
if (e.cause is! SocketException) {
// Heuristic check to only report interesting errors to the user.
// This currently ignores [SocketException], which typically
// occurs when the client goes back from sleep.
// TODO: Investigate if there are other cases of [SocketException]
// that might turn out to be interesting.
//
// See also:
// * [NetworkException.cause], which is the underlying exception.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (e.cause is! SocketException) {
// Heuristic check to only report interesting errors to the user.
// This currently ignores [SocketException], which typically
// occurs when the client goes back from sleep.
// TODO: Investigate if there are other cases of [SocketException]
// that might turn out to be interesting.
//
// See also:
// * [NetworkException.cause], which is the underlying exception.
if (e.cause is! SocketException) {
// Heuristic check to only report interesting errors to the user.
// A [SocketException] is common when the app returns from sleep.

I think the TODO isn't actionable.

The "See also" isn't needed — .cause is just above, and the reader can follow that to its definition using an IDE.

Then while at it I tightened the sentence about SocketException. No need to say what the code does, as the code here speaks for itself; only to explain why we chose to make the code do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration review Added by maintainers when PR may be ready for integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Show detailed poll-failure feedback, in beta
3 participants