You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using SASL OAuth for authentication and most of the time it works fine. However occasionally the token will expire and the refresh function (passed into the SetOAuthBearerTokenRefreshHandler method) will not get fired, leading to the consumer getting stuck in an error state where the following message is emitted over and over:
SASL authentication error: {"status":"JWT_EXPIRED"} (after 5054ms in state AUTH_REQ, 6 identical error(s) suppressed)
When the consumer gets into this state, it never tries to recover by trying to refresh its token again and the consumer is basically dead, requiring the application to be restarted.
We have multiple environments and seemingly this only happens in some of the consumers in environments with low traffic in the topic - at least I have never seen the issue in the environments with plenty of traffic. So maybe this is an issue with too many partitions compared to the message throughput, causing the consumer to be stuck enough time in the Consume method call that the OAuth refresh function isn't called before it runs into the current token being expired? At least that is my guess, since once this issue happens, the partitions assigned to the now dead consumer are re-assigned to other consumers and then there is seemingly enough traffic on those consumers that they can keep running without having this issue. Originally we were using the Consume(CancellationToken) overload but tried using the Consume(TimeSpan) overload with a 30 second time out, but eventually the issue is still happening.
So there definitely seems to be some bug here causing the OAuth token fresh handler not to fire if the Consume call is blocking for too long and actually I would also maybe expect that if the consumer starts getting the JWT_EXPIRED error, it should maybe default to trying to refresh its OAuth token instead of staying in an error state and never trying to recover.
We are using version 2.4.0 of the Confluent.Kafka nuget.
How to reproduce
I cannot tell you exactly how to produce it as I haven't been able to do so, but I can tell that this issue is happening in topics with a low amount of traffic where the consumer is spending a lot of time in the blocking Consume method.
Checklist
Please provide the following information:
A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
Confluent.Kafka nuget version.
Apache Kafka version.
Client configuration.
Operating system.
Provide logs (with "debug" : "..." as necessary in configuration).
Provide broker log excerpts.
Critical issue.
The text was updated successfully, but these errors were encountered:
jgn-epp
changed the title
SASL OAuth is sometimes not refreshed in low throughput scenarios
SASL OAuth is sometimes not refreshed and application gets stuck
Oct 17, 2024
Callback to refresh oauth token not invoked at all if newly created consumer perform QueryWatermarkOffsets as the first operation.
Case1: builder.SetOAuthBearerTokenRefreshHandler -> builder.Build -> consumer.QueryWatermarkOffsets -> Confluent.Kafka.KafkaException : Local: All broker connections are down
Case2: builder.SetOAuthBearerTokenRefreshHandler -> builder.Build -> consumer.Consume -> consumer.QueryWatermarkOffsets -> work as expected
Case3: config.SaslOauthbearerMethod = SaslOauthbearerMethod.Oidc -> consumer.QueryWatermarkOffsets -> work as expected
Faced with the same issue.
AWS MSK cluster with IAM authorization.
Case 1 and Case 2 - the same
But Case 3 not working because for SaslOauthbearerMethod.Oidc required additional options, like sasl.oauthbearer.client.id and other
Description
We are using SASL OAuth for authentication and most of the time it works fine. However occasionally the token will expire and the refresh function (passed into the
SetOAuthBearerTokenRefreshHandler
method) will not get fired, leading to the consumer getting stuck in an error state where the following message is emitted over and over:When the consumer gets into this state, it never tries to recover by trying to refresh its token again and the consumer is basically dead, requiring the application to be restarted.
We have multiple environments and seemingly this only happens in some of the consumers in environments with low traffic in the topic - at least I have never seen the issue in the environments with plenty of traffic. So maybe this is an issue with too many partitions compared to the message throughput, causing the consumer to be stuck enough time in the
Consume
method call that the OAuth refresh function isn't called before it runs into the current token being expired? At least that is my guess, since once this issue happens, the partitions assigned to the now dead consumer are re-assigned to other consumers and then there is seemingly enough traffic on those consumers that they can keep running without having this issue. Originally we were using theConsume(CancellationToken)
overload but tried using theConsume(TimeSpan)
overload with a 30 second time out, but eventually the issue is still happening.So there definitely seems to be some bug here causing the OAuth token fresh handler not to fire if the
Consume
call is blocking for too long and actually I would also maybe expect that if the consumer starts getting theJWT_EXPIRED
error, it should maybe default to trying to refresh its OAuth token instead of staying in an error state and never trying to recover.We are using version 2.4.0 of the Confluent.Kafka nuget.
How to reproduce
I cannot tell you exactly how to produce it as I haven't been able to do so, but I can tell that this issue is happening in topics with a low amount of traffic where the consumer is spending a lot of time in the blocking
Consume
method.Checklist
Please provide the following information:
The text was updated successfully, but these errors were encountered: