AWS MSK fresh cluster first apply fails because SASLS SCRAM secrets association is delayed #364

eroznik · 2023-12-12T14:56:25Z

Hello,

Everything described below is implemented within 1 TF module that uses multiple providers and external modules. The module is applied in one run. It is also important to note that this happens only on the first run(e.g. fresh apply/install), if the user re-applies the stack, it passes successfully. So the re-apply can be considered as a workaround, but less than ideal.

The approximate TF apply sequence/scenario is:

create MSK cluster(SASL SCRAM auth enabled)
create the AWS SM secret for the "kafka-provider" user
associated the secrets with the MSK cluster
apply Kafka provider resources(topics, acls,..)

What happens is that even though the resource dependencies are clearly defined and properly detected by TF, thus each of the steps above executed one after another, there is just not enough of a "delay" between steps 3 and 4 for MSK to register the secret association and allow the freshly associated SASL SCRAM user to auth with the cluster.

Let me add a few logs with timestamps to clarify the problem a bit more:

Terraform:

[2023-12-12T13:42:59.268200Z] module.xxx.msk-association: Creation complete after 1s
[2023-12-12T13:42:59.336920Z] module.xxx.msk-topic-A: Creating...
[2023-12-12T13:43:00.648721Z] module.xxx.msk-topic-A: Creation errored after 2s
[2023-12-12T13:43:01.053532Z ]Error: kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n

MSK:

[2023-12-12 13:42:59,538] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:42:59,538] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]
[2023-12-12 13:42:59,960] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:42:59,960] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]
[2023-12-12 13:43:00,294] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:43:00,294] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]
[2023-12-12 13:43:00,629] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:43:00,629] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]

After a re-run approximately 10min later, we can se that MSK got the kafka-provider secret and the apply from Kafka provider went through:

[2023-12-12 13:53:06,635] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:53:06,683] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:53:07,615] INFO Processing Acl change notification for ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL), versionedAcls : Set(User:kafka-provider has ALLOW permission for operations: ALL from hosts: *), zkVersion : 0 (kafka.security.authorizer.AclAuthorizer)

With #251 we got the ability to run plans when the brokers aren't yet available, and that was a great improvement. But as of now, are there any suggested patterns/solutions to the problem described above? Is it "expected" to have Kafka provider in a separate run/pipeline? Maybe somehow delay the execution?

Also worth nothing that AWS has a guarantee of associating the secret within 10min, so the "required" delay might be as much as 10min long.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS MSK fresh cluster first apply fails because SASLS SCRAM secrets association is delayed #364

AWS MSK fresh cluster first apply fails because SASLS SCRAM secrets association is delayed #364

eroznik commented Dec 12, 2023

AWS MSK fresh cluster first apply fails because SASLS SCRAM secrets association is delayed #364

AWS MSK fresh cluster first apply fails because SASLS SCRAM secrets association is delayed #364

Comments

eroznik commented Dec 12, 2023