Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS MSK fresh cluster first apply fails because SASLS SCRAM secrets association is delayed #364

Open
eroznik opened this issue Dec 12, 2023 · 0 comments

Comments

@eroznik
Copy link

eroznik commented Dec 12, 2023

Hello,

Everything described below is implemented within 1 TF module that uses multiple providers and external modules. The module is applied in one run. It is also important to note that this happens only on the first run(e.g. fresh apply/install), if the user re-applies the stack, it passes successfully. So the re-apply can be considered as a workaround, but less than ideal.

The approximate TF apply sequence/scenario is:

  1. create MSK cluster(SASL SCRAM auth enabled)
  2. create the AWS SM secret for the "kafka-provider" user
  3. associated the secrets with the MSK cluster
  4. apply Kafka provider resources(topics, acls,..)

What happens is that even though the resource dependencies are clearly defined and properly detected by TF, thus each of the steps above executed one after another, there is just not enough of a "delay" between steps 3 and 4 for MSK to register the secret association and allow the freshly associated SASL SCRAM user to auth with the cluster.

Let me add a few logs with timestamps to clarify the problem a bit more:

Terraform:

[2023-12-12T13:42:59.268200Z] module.xxx.msk-association: Creation complete after 1s
[2023-12-12T13:42:59.336920Z] module.xxx.msk-topic-A: Creating...
[2023-12-12T13:43:00.648721Z] module.xxx.msk-topic-A: Creation errored after 2s
[2023-12-12T13:43:01.053532Z ]Error: kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n

MSK:

[2023-12-12 13:42:59,538] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:42:59,538] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]
[2023-12-12 13:42:59,960] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:42:59,960] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]
[2023-12-12 13:43:00,294] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:43:00,294] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]
[2023-12-12 13:43:00,629] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:43:00,629] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]

After a re-run approximately 10min later, we can se that MSK got the kafka-provider secret and the apply from Kafka provider went through:

[2023-12-12 13:53:06,635] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:53:06,683] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:53:07,615] INFO Processing Acl change notification for ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL), versionedAcls : Set(User:kafka-provider has ALLOW permission for operations: ALL from hosts: *), zkVersion : 0 (kafka.security.authorizer.AclAuthorizer)

With #251 we got the ability to run plans when the brokers aren't yet available, and that was a great improvement. But as of now, are there any suggested patterns/solutions to the problem described above? Is it "expected" to have Kafka provider in a separate run/pipeline? Maybe somehow delay the execution?

Also worth nothing that AWS has a guarantee of associating the secret within 10min, so the "required" delay might be as much as 10min long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant