Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SASL OAuth is sometimes not refreshed and application gets stuck #2329

Open
3 of 8 tasks
jgn-epp opened this issue Oct 17, 2024 · 3 comments
Open
3 of 8 tasks

SASL OAuth is sometimes not refreshed and application gets stuck #2329

jgn-epp opened this issue Oct 17, 2024 · 3 comments

Comments

@jgn-epp
Copy link

jgn-epp commented Oct 17, 2024

Description

We are using SASL OAuth for authentication and most of the time it works fine. However occasionally the token will expire and the refresh function (passed into the SetOAuthBearerTokenRefreshHandler method) will not get fired, leading to the consumer getting stuck in an error state where the following message is emitted over and over:

SASL authentication error: {"status":"JWT_EXPIRED"} (after 5054ms in state AUTH_REQ, 6 identical error(s) suppressed)

When the consumer gets into this state, it never tries to recover by trying to refresh its token again and the consumer is basically dead, requiring the application to be restarted.

We have multiple environments and seemingly this only happens in some of the consumers in environments with low traffic in the topic - at least I have never seen the issue in the environments with plenty of traffic. So maybe this is an issue with too many partitions compared to the message throughput, causing the consumer to be stuck enough time in the Consume method call that the OAuth refresh function isn't called before it runs into the current token being expired? At least that is my guess, since once this issue happens, the partitions assigned to the now dead consumer are re-assigned to other consumers and then there is seemingly enough traffic on those consumers that they can keep running without having this issue. Originally we were using the Consume(CancellationToken) overload but tried using the Consume(TimeSpan) overload with a 30 second time out, but eventually the issue is still happening.

So there definitely seems to be some bug here causing the OAuth token fresh handler not to fire if the Consume call is blocking for too long and actually I would also maybe expect that if the consumer starts getting the JWT_EXPIRED error, it should maybe default to trying to refresh its OAuth token instead of staying in an error state and never trying to recover.

We are using version 2.4.0 of the Confluent.Kafka nuget.

How to reproduce

I cannot tell you exactly how to produce it as I haven't been able to do so, but I can tell that this issue is happening in topics with a low amount of traffic where the consumer is spending a lot of time in the blocking Consume method.

Checklist

Please provide the following information:

  • A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
  • Confluent.Kafka nuget version.
  • Apache Kafka version.
  • Client configuration.
  • Operating system.
  • Provide logs (with "debug" : "..." as necessary in configuration).
  • Provide broker log excerpts.
  • Critical issue.
@jgn-epp jgn-epp changed the title SASL OAuth is sometimes not refreshed in low throughput scenarios SASL OAuth is sometimes not refreshed and application gets stuck Oct 17, 2024
@anchitj
Copy link
Member

anchitj commented Oct 29, 2024

Can you please provide some debug logs during the time this issue happened

@IharYakimush
Copy link

Callback to refresh oauth token not invoked at all if newly created consumer perform QueryWatermarkOffsets as the first operation.

  • Case1: builder.SetOAuthBearerTokenRefreshHandler -> builder.Build -> consumer.QueryWatermarkOffsets -> Confluent.Kafka.KafkaException : Local: All broker connections are down

  • Case2: builder.SetOAuthBearerTokenRefreshHandler -> builder.Build -> consumer.Consume -> consumer.QueryWatermarkOffsets -> work as expected

  • Case3: config.SaslOauthbearerMethod = SaslOauthbearerMethod.Oidc -> consumer.QueryWatermarkOffsets -> work as expected

@amid0
Copy link

amid0 commented Dec 18, 2024

Faced with the same issue.
AWS MSK cluster with IAM authorization.
Case 1 and Case 2 - the same
But Case 3 not working because for SaslOauthbearerMethod.Oidc required additional options, like sasl.oauthbearer.client.id and other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants