Show poll-failure details #868

PIG208 · 2024-08-06T21:41:06Z

Ready for review.

This fixes #555.

PIG208 · 2024-08-08T04:00:42Z

A question to be answered: how do we determine if an error is transient?

Currently we handle 3 classes of errors in UpdateMachine:

ZulipApiException: The event queue is lost. This is commonly caused by inactivity.
Server5xxException, NetworkException: A server-side bug, or hiccups due to unstable Internet connection.
default: Could be anything that happens in the client code during the getEvents call.

I think it is usually safe to assume that 1 is recoverable, while we are less certain about that for 2 and 3. An idea is to define a number of retries for the client to start displaying error messages. And then in a follow-up PR the client can try to do something different (which is #186).

PIG208 · 2024-08-08T22:07:20Z

For non-transient errors, we will probably show a modal instead.

gnprice · 2024-08-12T01:08:06Z

If there's a ZulipApiException other than the event queue being gone, then that's likely non-transient — it's either a bug in the client, or some kind of unexpected incompatibility between server and client. A retry is likely to just hit the same error again.

(Those fall into the default case.)

gnprice · 2024-08-12T01:10:20Z

Other than that point, the description above all looks right.

gnprice

(Quick comments on the high-level behavior.)

gnprice · 2024-08-12T01:21:37Z

lib/model/store.dart

+            if (accumlatedTransientFailureCount > transientFailureCountNotifyThreshold) {
+              reportErrorToUserInDialog('Failed to reach server. Will retry: $e');


Let's leave this case as a toast. It's a server 5xx or a network exception, so it's unlikely there's anything the client should be doing differently about it — no need to get in the user's way and potentially annoy them.

Then let's start showing the toasts once it's been a few seconds since we started retrying. The simple way to do that is to make the count threshold about 5, rather than 10 — at the 6th failure, it's been an expected 1.55s since the first failure, and we'll be backing off for an expected 1.6s before the next retry.

chrisbobbe

Thanks! Comments below.

Then one other comment is, for all the new poll-failure feedback we're giving the user, let's identify which server it's about. This is particularly relevant when the issue is a network problem, but applies whenever the feedback can be significantly delayed. The user might have given up waiting for this account, like a minute or more ago, and is doing other things—maybe even in a different account—when the feedback appears.

chrisbobbe · 2024-08-15T19:01:59Z

lib/log.dart

@@ -30,3 +33,45 @@ bool debugLog(String message) {
  }());
  return true;
 }
+
+/// Display an error message.


How about "Display an error message in a [SnackBar]." I like how the summary line on reportErrorToUserInDialog's dartdoc says how the error is displayed.

chrisbobbe · 2024-08-15T19:05:29Z

lib/log.dart

+/// This shows a [SnackBar] containing the message after app startup,
+/// otherwise logs it to the console.


Can we be more specific about "after app startup"? I can imagine confused readers wondering how the snackbar could be shown at any time other than after the app starts up.

chrisbobbe · 2024-08-15T19:07:47Z

lib/log.dart

+/// This shows a [SnackBar] containing the message after app startup,
+/// otherwise logs it to the console.
+///
+/// See also: [ZulipApp._reportErrorToUserBriefly]


This is a private method, and also kind of feels like an implementation detail. Is this the kind of information that callers need, to understand if reportErrorToUserBriefly is right for them, and what its expectations/guarantees are like? (Similarly with the dartdoc on reportErrorToUserInDialog.)

The primary goal of including this line is to have a back link to where the setters get used, since it is not obvious. I think it would be more appropriate to exclude that bit from the dart doct, though, considering the point that it's some implementation detail.

chrisbobbe · 2024-08-15T19:08:54Z

lib/log.dart

+/// otherwise logs it to the console.
+///
+/// See also: [ZulipApp._reportErrorToUserBriefly]
+void Function(String message) get reportErrorToUserBriefly => (message) {


How about reportErrorToUserInSnackbar, to be more informative, and more parallel with reportErrorToUserInDialog?

Yeah that sounds like a good choice.

chrisbobbe · 2024-08-15T19:10:06Z

lib/log.dart

+
+String? get debugLastReportedError => _debugLastReportedError;
+String? _debugLastReportedError;
+String? takeLastReportedError() {


Should this be debugTakeLastReportedError?

chrisbobbe · 2024-08-15T19:18:50Z

test/widgets/app_test.dart

+      // After app startup, reportErrorToUserInDialog displays a dialog.
+      reportErrorToUserInDialog('test error message');
+      await tester.pump();
+      check(finder.evaluate()).single;


Could maybe be strengthened with checkErrorDialog; just a thought.

Could be a bit repetitive, but it should be good to also check the title.

chrisbobbe · 2024-08-15T19:20:31Z

test/model/store_test.dart

-    test('retries on Server5xxException', () {
-      checkRetry(() => connection.prepare(httpStatus: 500, body: 'splat'));
-    });
+    group('retries', () {


store test [nfc]: Make a group for retry tests.

Let's leave off the [nfc] in the commit message: #870 (comment)

chrisbobbe · 2024-08-15T19:21:19Z

lib/model/store.dart

@@ -801,6 +801,7 @@ class UpdateMachine {
            assert(debugLog('Error polling event queue for $store: $e\n'
                'Backing off and retrying even though may be hopeless…'));
            // TODO tell user on non-transient error in polling
+            reportErrorToUserInDialog('Error loading server data. Will retry: $e');


store: Reoprt non-transient polling errors.

commit-message nit: spelling

Also this TODO can be removed:

// TODO tell user on non-transient error in polling

Also, can this be a translated string? If that's not straightforward, let's include a TODO(i18n).

Hmm. And for non-transient errors, we expect that retrying won't help, right? It seems like we should have a TODO to not retry. I'm also wondering if retrying and showing a dialog, for a task that's likely to be hopeless, will spam the user with the same dialog repetitively.

Hmm. And for non-transient errors, we expect that retrying won't help, right? It seems like we should have a TODO to not retry. I'm also wondering if retrying and showing a dialog, for a task that's likely to be hopeless, will spam the user with the same dialog repetitively.

Yes, that's #186. Added a line mentioning that.

PIG208 · 2024-08-16T04:19:29Z

The PR has been updated. Thanks for the review!

I'm not entirely sure about the quality of the translation strings. These messages will appear more technical at the current stage for the ease of debugging. However, polishing user facing errors might be a post-launch task.

PIG208 · 2024-09-28T05:09:27Z

Updated the PR to ignore NetworkException where the underlying cause is a SocketException, added some more documentation and cleaned up some diffs to make them easier to read. Feel free to install a build of this PR to test it out!

gnprice · 2024-10-04T22:18:06Z

Thanks for the update! Installing on my phone now.

gnprice

OK, I've observed this on my phone for a while; and it's been pretty quiet, which is good.

After seeing it in action, I think we can leave out the last commit, with the "Reconnecting to {serverUrl}…" message. It feels redundant with the loading indicator at the top; and the fact that the toast disappears on its own schedule while the loading indicator can be still going seems incongruous, and likely to be confusing.

Otherwise this generally all looks good; just a few comments below.

(I also feel like the UpdateMachine.poll method has gotten kind of unwieldy; maybe I'll take a stab at reorganizing that after merging this PR. It's already kind of unwieldy in main.)

gnprice · 2024-10-17T22:31:45Z

test/model/store_test.dart

+            if (i != numFailedRequests - 1) {
+              // Skip polling backoff unless this is the final iteration.
+              // This allows the next `updateMachine.debugAdvanceLoop` call to
+              // trigger the next request without wait.
+              async.flushTimers();
+            }
+
+            if (!expectNotifyError) {
+              // No matter how many retries there have been, no request should
+              // result in an error message.
+              check(takeLastReportedError()).isNull();
+              continue;
+            }
+            if (i < numFailedRequests - 1) {
+              // The error message should not appear until the `updateMachine`
+              // has retried the given number of times.
+              check(takeLastReportedError()).isNull();
+              continue;
+            }


The logic in this test (or set of tests) gets fairly tangly to follow.

I think the main thing is that it'd be helpful to test two aspects separately: there's the existing test cases that just check that we retry, and let's have separate new test cases that just check the error-reporting.

Then I think both sets of tests can be a lot simpler than this — most of the logic is only relevant for one aspect or the other. It's OK that some other fragments of code will be duplicated between them.

Separately and more tactically, I think it'd help to pull the last iteration here outside of the loop — have the loop only concern itself with the first N-1 requests. That would avoid these multiple conditionals about whether it's the last iteration.

gnprice · 2024-10-17T22:39:47Z

lib/model/store.dart

+            if (e.cause is! SocketException) {
+              // Heuristic check to only report interesting errors to the user.
+              // This currently ignores [SocketException], which typically
+              // occurs when the client goes back from sleep.
+              // TODO: Investigate if there are other cases of [SocketException]
+              //   that might turn out to be interesting.
+              //
+              // See also:
+              //  * [NetworkException.cause], which is the underlying exception.


Suggested change

if (e.cause is! SocketException) {

// Heuristic check to only report interesting errors to the user.

// This currently ignores [SocketException], which typically

// occurs when the client goes back from sleep.

// TODO: Investigate if there are other cases of [SocketException]

// that might turn out to be interesting.

//

// See also:

// * [NetworkException.cause], which is the underlying exception.

if (e.cause is! SocketException) {

// Heuristic check to only report interesting errors to the user.

// A [SocketException] is common when the app returns from sleep.

I think the TODO isn't actionable.

The "See also" isn't needed — .cause is just above, and the reader can follow that to its definition using an IDE.

Then while at it I tightened the sentence about SocketException. No need to say what the code does, as the code here speaks for itself; only to explain why we chose to make the code do that.

PIG208 · 2024-10-21T20:38:47Z

Thanks for the feedback! Dropped the last commit and previous changes relevant to the tests. A new group of tests is now responsible for testing the error reporting feature.

gnprice

Thanks for the revision! Just small comments below, all in the tests.

And as a note to myself, there's a followup I mentioned at #868 (review) I might do after merge.

gnprice · 2024-10-22T01:17:07Z

test/model/store_test.dart

@@ -652,6 +655,90 @@ void main() {
    test('retries on MalformedServerResponseException', () {
      checkRetry(() => connection.prepare(httpStatus: 200, body: 'nonsense'));
    });
+
+    group('reportErrorToUserBriefly', () {


nit:

Suggested change

group('reportErrorToUserBriefly', () {

group('error reporting', () {

Otherwise it sounds like these are tests of reportErrorToUserBriefly.

gnprice · 2024-10-22T01:22:18Z

test/model/store_test.dart

+      late void Function() prepareError;
+
+      void pollAndFail(FakeAsync async) {
+        prepareError();
+        updateMachine.debugAdvanceLoop();


Suggested change

late void Function() prepareError;

void pollAndFail(FakeAsync async) {

prepareError();

updateMachine.debugAdvanceLoop();

void pollAndFail(FakeAsync async) {

updateMachine.debugAdvanceLoop();

Instead the callers can say connection.prepare(…) (i.e., the body of what they would set prepareError to) just before each call to pollAndFail. That isn't really any longer — it makes a very slight duplication of the content of those connection.prepare calls — and it makes the control flow substantially more straightforward.

gnprice · 2024-10-22T01:24:12Z

test/model/store_test.dart

+          // This would skip the pending polling backoff.
+          async.flushTimers();


Suggested change

// This would skip the pending polling backoff.

async.flushTimers();

// This skips the pending polling backoff.

async.flushTimers();

I.e. this does skip that backoff, right?

Or am I missing a subtlety where it doesn't actually do so, but only would do so if something were different?

PIG208 · 2024-10-23T18:31:35Z

Thanks! Ready for review now.

We delay reporting transient errors until we have retried a certain number of times, because they might recover on their own. Signed-off-by: Zixuan James Li <[email protected]>

Usually when a client goes back from sleep or lose network connection, there will be a bunch of errors when polling event queue. This known error will always be a `NetworkException`, so we make a separate case for it. Previously, we may report these errors, and it ended up being spammy. This filters out the more interesting one to report instead. Fixes: zulip#555 Signed-off-by: Zixuan James Li <[email protected]>

gnprice · 2024-10-24T00:39:19Z

Looks good — merging.

PIG208 changed the title ~~WIP Show polling failure details~~ WIP Show poll-failure details Aug 6, 2024

PIG208 force-pushed the report-error branch 2 times, most recently from f712ca5 to 08f8815 Compare August 8, 2024 03:51

PIG208 force-pushed the report-error branch 2 times, most recently from d3a14ee to 54436ee Compare August 8, 2024 23:35

gnprice reviewed Aug 12, 2024

View reviewed changes

PIG208 force-pushed the report-error branch 9 times, most recently from 837a6d0 to b64c99c Compare August 12, 2024 21:29

PIG208 marked this pull request as ready for review August 12, 2024 21:29

PIG208 added the maintainer review PR ready for review by Zulip maintainers label Aug 12, 2024

PIG208 requested a review from chrisbobbe August 12, 2024 21:30

PIG208 changed the title ~~WIP Show poll-failure details~~ Show poll-failure details Aug 12, 2024

chrisbobbe reviewed Aug 15, 2024

View reviewed changes

PIG208 force-pushed the report-error branch 2 times, most recently from 96cd1ec to a434078 Compare August 16, 2024 04:16

PIG208 force-pushed the report-error branch from a434078 to 148e14e Compare August 16, 2024 04:20

PIG208 requested a review from chrisbobbe August 20, 2024 18:12

PIG208 force-pushed the report-error branch from 148e14e to cb21135 Compare August 21, 2024 21:19

PIG208 force-pushed the report-error branch 4 times, most recently from beac99b to 9cd37ea Compare September 12, 2024 20:54

PIG208 removed the integration review Added by maintainers when PR may be ready for integration label Sep 13, 2024

PIG208 marked this pull request as draft September 24, 2024 01:06

PIG208 force-pushed the report-error branch from 9cd37ea to 63293f6 Compare September 27, 2024 22:59

PIG208 marked this pull request as ready for review September 27, 2024 23:01

PIG208 added the integration review Added by maintainers when PR may be ready for integration label Sep 27, 2024

PIG208 requested a review from gnprice September 27, 2024 23:01

PIG208 assigned gnprice Sep 27, 2024

PIG208 force-pushed the report-error branch from 63293f6 to 90370a7 Compare September 30, 2024 22:01

gnprice reviewed Oct 17, 2024

View reviewed changes

PIG208 force-pushed the report-error branch 2 times, most recently from 5f75ee2 to 6c58f92 Compare October 21, 2024 20:38

PIG208 requested a review from gnprice October 21, 2024 20:44

gnprice reviewed Oct 22, 2024

View reviewed changes

PIG208 force-pushed the report-error branch from 6c58f92 to 32d08c9 Compare October 22, 2024 14:29

PIG208 added 2 commits October 23, 2024 17:38

store: Report polling errors with details

4fdeb27

We delay reporting transient errors until we have retried a certain number of times, because they might recover on their own. Signed-off-by: Zixuan James Li <[email protected]>

gnprice force-pushed the report-error branch from 32d08c9 to 33c365a Compare October 24, 2024 00:38

gnprice merged commit 33c365a into zulip:main Oct 24, 2024

PIG208 deleted the report-error branch October 24, 2024 15:26

gnprice mentioned this pull request Oct 24, 2024

Slow sometimes to replace event queue after expiry #809

Open

gnprice mentioned this pull request Nov 23, 2024

Loading indicator slow to go away when returning to app, though in fact up to date #979

Open

		if (accumlatedTransientFailureCount > transientFailureCountNotifyThreshold) {
		reportErrorToUserInDialog('Failed to reach server. Will retry: $e');

		/// This shows a [SnackBar] containing the message after app startup,
		/// otherwise logs it to the console.

	group('reportErrorToUserBriefly', () {
	group('error reporting', () {

		// This would skip the pending polling backoff.
		async.flushTimers();

Show poll-failure details #868

Show poll-failure details #868

Uh oh!

Conversation

PIG208 commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PIG208 commented Aug 8, 2024

Uh oh!

PIG208 commented Aug 8, 2024

Uh oh!

gnprice commented Aug 12, 2024

Uh oh!

gnprice commented Aug 12, 2024

Uh oh!

gnprice left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrisbobbe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PIG208 commented Aug 16, 2024

Uh oh!

PIG208 commented Sep 28, 2024

Uh oh!

gnprice commented Oct 4, 2024

Uh oh!

gnprice left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PIG208 commented Oct 21, 2024

Uh oh!

gnprice left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

PIG208 commented Aug 6, 2024 •

edited

Loading