Exceptions are not properly handled in GrpcClient #297

marcuslyth · 2025-01-10T12:31:35Z

Hi,

We have had some issues with NotLeaderException and StatusRuntimeException (UNAVAILABLE) resulting in a permanent disconnect that is only solved by a restart of the disconnected client.

This happens consistently whenever a leader election changes which node in our cluster acts as the leader.

From what I can see the issue seems to be isolated to appending to streams, as persistent subscriptions (as far as I am currently aware) seem to reconnect. This might not be the case, but I have not really looked into that part further at this moment.

I have managed to reproduce this with debug logs, as well as locally on my machine, by connecting to a cluster and then killing the leader and attempting to write to a stream.

Whilst debugging the code I think I have identified the issue, which is also apparent if you look at this debug log:

RunWorkItem[baeaec4e-165e-4d9f-bc83-0f4b9e60b42c] completed exceptionally: java.util.concurrent.CompletionException: com.eventstore.dbclient.NotLeaderException

My guess is that the exception produced by com.eventstore.dbclient.ClientTelemetry#traceAppend is not handled in com.eventstore.dbclient.GrpcClient#runWithArgs becuase GrpcClient expects a NotLeaderException or StatusRuntimeException but receives a CompletionException.

So if I am correct, this should be solve:able by either not wrapping the exception in ClientTelemetry or by unwrapping CompletionExceptions in the GrpcClient.

Something as simple as:

// com/eventstore/dbclient/GrpcClient.java:83
if (error instanceof CompletionException) {
    error = error.getCause();
}

Should theoretically solve this, at least for this scenario.

Another potential solution would be to not wrap the exception at all, but it would require more intrusive changes to the Future chain. For example:

// com/eventstore/dbclient/ClientTelemetry.java:108
.exceptionallyCompose(throwable -> {
    span.setStatus(StatusCode.ERROR);
    span.recordException(throwable);
    span.end();
    return CompletableFuture.failedFuture(throwable);
}).thenApply(writeResult -> {
    span.setStatus(StatusCode.OK);
    span.end();
    return writeResult;
});

I could provide a PR for this if you agree with me that it should be changed.

If it is of interest I can provide a small (but dirty) unit test that kinda reproduces this, but as I mentioned above - killing the leader node and then attempting a appendToStream should throw this error.

The text was updated successfully, but these errors were encountered:

marcuslyth · 2025-01-10T14:45:52Z

I tested this towards our clustered test environment and unwrapping the exception seems to fix the issue

marcuslyth · 2025-01-10T15:40:42Z

I created a draft PR with the smallest change that solved the issue for me

ylorph · 2025-01-20T08:16:34Z

@RagingKore

marcuslyth mentioned this issue Jan 14, 2025

Unwrap CompletionException in GrpcClient #298

Merged

YoEight closed this as completed in #298 Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceptions are not properly handled in GrpcClient #297

Exceptions are not properly handled in GrpcClient #297

marcuslyth commented Jan 10, 2025 •

edited

Loading

marcuslyth commented Jan 10, 2025

marcuslyth commented Jan 10, 2025

ylorph commented Jan 20, 2025

Exceptions are not properly handled in GrpcClient #297

Exceptions are not properly handled in GrpcClient #297

Comments

marcuslyth commented Jan 10, 2025 • edited Loading

marcuslyth commented Jan 10, 2025

marcuslyth commented Jan 10, 2025

ylorph commented Jan 20, 2025

marcuslyth commented Jan 10, 2025 •

edited

Loading