Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe recreation of topics #1739

Open
szczygiel-m opened this issue Sep 28, 2023 · 10 comments
Open

Safe recreation of topics #1739

szczygiel-m opened this issue Sep 28, 2023 · 10 comments

Comments

@szczygiel-m
Copy link
Contributor

Currently hermes-management has a bug which is caused by deleting topic and then recreating it quickly

The story is similar every time:

  1. someone deletes a topic
  2. topic is recreated
  3. kafka producer has stale metadata for that topic
  4. kafka producer fails to send messages to the brokers
  5. messages are buffered in hermes frontend instances
  6. we need to restart frontend instances in order for the messages to be retransmitted

Issue was thought to be solved with upgrade to kafka client 2.8.2 but again appeared recently. We would like to have a workaround for this.

One of the proposed solutions is to introduce "grace period" for deleted topics. E.g. if someone deletes a topic we should block the creation of topic with same name for long enough so that cluster and kafka producers can be in consistent state. Probably > 5 minutes is enough because metadata is refreshed every 5 minutes. 

@debanjanc01
Copy link

Hey @szczygiel-m , is this up for grabs?

@szczygiel-m
Copy link
Contributor Author

Hi, sure 😄 Assigned

@danigiri
Copy link

hey it may be too late for hacktoberfest but I can take a look at this if for a couple of days, you can assign it to me if you want

@danigiri
Copy link

danigiri commented Oct 28, 2024

Looking at this, high level proposal

  • model: add a list of recent deleted topics in group in ZK, topics contain deletion date any any other minimal metadata needed

  • ZookeeperTopicRepository creation

    • check if topic name exists in the list and is beyond the threshold or not
    • if beyond threshold, remove the old node in the deleted list and create normally
    • otherwise, throw a new exception, similar but not the same as TopicAlreadyExistsException, block creation
  • ZookeeperTopicRepository deletion

    • add node to deleted list
    • also check if it was deleted in the past to replace it

enhancements

  • have a reaper process that cleans deleted topics in the list periodically
  • have the threshold configurable

I'll have a look at this next and see if I can build a concept PR

@danigiri
Copy link

danigiri commented Oct 28, 2024

Looks like there are 3 design approaches:

  1. change topic data model to have a deletion flag, and change existence and removal logic deeply (complex)
  2. add a folder to groups to hold this data, and explicitly call it deletion info
  3. add a folder to groups to hold group historical data, and add the code and data to hold deletion info

I am trying to do 2. at this time, though 3. is kinda the same work in reality, depepending on the philosophy of being more generic or more concise in the data model. 3 is more generic and will make it easier to add more features (without migration impact)

@danigiri
Copy link

Have a PoC written, going through unit tests ATM

@szczygiel-m
Copy link
Contributor Author

Hi, sorry for delay in responding, great that you were willing to take on this task force 😄 assigned. I like the approach which You proposed and which You are currently working on (the one with holding data about deleted topics in zk in "deletion info")

danigiri added a commit to danigiri/hermes that referenced this issue Oct 29, 2024
see  allegro#1739
Deletion creates a node in the group /deletion_time, stores the time there
Throws an exception if not enough time has passed (5 min)
all existing unit tests pass (except the docker)

pending:
- updating with latest upstream
- manual testing in UI and running docker test
- adding unit tests for the added functionality
- checking integration tests
- configuration of the time delay
@danigiri
Copy link

Feel free to wait until TODOs are completed but early feedback welcome.

@danigiri
Copy link

See PR open for review, it'd be awesome if the hacktoberfest accepted tag can be added :)
Meanwhile I will look at at some of the TODOs

@danigiri
Copy link

I will check the remaining TODOs tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants