Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceptions are not properly handled in GrpcClient #297

Closed
marcuslyth opened this issue Jan 10, 2025 · 3 comments · Fixed by #298
Closed

Exceptions are not properly handled in GrpcClient #297

marcuslyth opened this issue Jan 10, 2025 · 3 comments · Fixed by #298

Comments

@marcuslyth
Copy link
Contributor

marcuslyth commented Jan 10, 2025

Hi,

We have had some issues with NotLeaderException and StatusRuntimeException (UNAVAILABLE) resulting in a permanent disconnect that is only solved by a restart of the disconnected client.

This happens consistently whenever a leader election changes which node in our cluster acts as the leader.

From what I can see the issue seems to be isolated to appending to streams, as persistent subscriptions (as far as I am currently aware) seem to reconnect. This might not be the case, but I have not really looked into that part further at this moment.

I have managed to reproduce this with debug logs, as well as locally on my machine, by connecting to a cluster and then killing the leader and attempting to write to a stream.

Whilst debugging the code I think I have identified the issue, which is also apparent if you look at this debug log:

RunWorkItem[baeaec4e-165e-4d9f-bc83-0f4b9e60b42c] completed exceptionally: java.util.concurrent.CompletionException: com.eventstore.dbclient.NotLeaderException

My guess is that the exception produced by com.eventstore.dbclient.ClientTelemetry#traceAppend is not handled in com.eventstore.dbclient.GrpcClient#runWithArgs becuase GrpcClient expects a NotLeaderException or StatusRuntimeException but receives a CompletionException.

So if I am correct, this should be solve:able by either not wrapping the exception in ClientTelemetry or by unwrapping CompletionExceptions in the GrpcClient.

Something as simple as:

// com/eventstore/dbclient/GrpcClient.java:83
if (error instanceof CompletionException) {
    error = error.getCause();
}

Should theoretically solve this, at least for this scenario.

Another potential solution would be to not wrap the exception at all, but it would require more intrusive changes to the Future chain. For example:

// com/eventstore/dbclient/ClientTelemetry.java:108
.exceptionallyCompose(throwable -> {
    span.setStatus(StatusCode.ERROR);
    span.recordException(throwable);
    span.end();
    return CompletableFuture.failedFuture(throwable);
}).thenApply(writeResult -> {
    span.setStatus(StatusCode.OK);
    span.end();
    return writeResult;
});

I could provide a PR for this if you agree with me that it should be changed.

If it is of interest I can provide a small (but dirty) unit test that kinda reproduces this, but as I mentioned above - killing the leader node and then attempting a appendToStream should throw this error.

@marcuslyth
Copy link
Contributor Author

I tested this towards our clustered test environment and unwrapping the exception seems to fix the issue

@marcuslyth
Copy link
Contributor Author

I created a draft PR with the smallest change that solved the issue for me

@ylorph
Copy link

ylorph commented Jan 20, 2025

@RagingKore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants