Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rfc] redesign the process management tree #546

Open
zmstone opened this issue Jan 1, 2023 · 3 comments
Open

[rfc] redesign the process management tree #546

zmstone opened this issue Jan 1, 2023 · 3 comments

Comments

@zmstone
Copy link
Contributor

zmstone commented Jan 1, 2023

supervisor3 was a bad choice

  • supervisor3 is stale, lagging behind with the changes in OTP's supervisor.erl
  • desired features from supervisor3 can be addressed with supervisor with some additional retry mechanism.
    • children restarts do not cause supervisor restart (this is not really necessary as the children are long lived in brod)
    • delayed restart (need our own delay-retry loop)

not following OTP design principal

brod_client directly spawn_link consumer and producer supervisor makes the processes unreachable from application root supervisor. i.e. it's not a tree.
this makes the release handler unable to traverse all processes from the supervision tree to perform hot-upgrades.

design proposal

  • Use supervision tree for all processes.
  • brod_client should run in a delayed-retry loop to start brod_producers_sup and/or brod_consumers_sup per configuration
  • Once a brod_producers|consumers_sup is started, it should not die at connection restarts (in fact this is the current behavior, it runs in a loop to to poll new leader connection from brod_client.
brod_sup
  |
  +-- brod_clients_sup
  |          |
  |          +-- brod_client (worker)
  |
  +-- brod_producers_sup
  |           |......
  |
  +-- brod_consumers_sup
              |......
@zmstone zmstone changed the title redesign the process management tree [rfc] redesign the process management tree Jan 1, 2023
@v0idpwn
Copy link
Contributor

v0idpwn commented Jan 2, 2023

Once a brod_producers|consumers_sup is started, it should not die at connection restarts (in fact this is the current behavior, it runs in a loop to to poll new leader connection from brod_client.

How are producers and consumers expected to behave when a client dies?

@zmstone
Copy link
Contributor Author

zmstone commented Feb 17, 2023

Once a brod_producers|consumers_sup is started, it should not die at connection restarts (in fact this is the current behavior, it runs in a loop to to poll new leader connection from brod_client.

How are producers and consumers expected to behave when a client dies?

For consumers, it's just a matter of time to wait for a new connection to be established to partition leader(s).
For producers, they will need to have a backlog of pending produce requests, in the backlog:

  • a request may expire (if it's sent via API with timeout)
  • a request might be dropped after the backlog reaches capacity limit.

@losvedir
Copy link

This is an "RFC" so here's a "C", heh!

I found myself on this issue after tracking down some error logs my app occasionally emits when shutting down as part of new deploys. The logs look like this:

12/20/2023 7:17:21.434 PM +0000 Last message: {:EXIT, #PID<0.28018.0>, :killed}
      19:17:21.434 [][error] GenServer #PID<0.28046.0> terminating ** (stop) killed

From dumping the process IDs prior to the deploy, I cross referenced the PIDs and see that 0.28018.0 is a supervisor3 and 0.28046.0 is a brod_producer. There's typically a bunch of them logged like this, all belonging to the same supervisor. The kafka topic has 64 partitions, and we might get ~10 producers logging like this.

This is in response to a SIGTERM the app receives, since we don't do hot upgrades; just bring up the new version in a new instance and bring down the old one. NB: This means, there might be some re-balancing happening at the time that the app is trying to shut down.

It's not the worst thing, but I would like for the app to shut down cleanly.

I know the general way Erlang apps shut down is recursively through the supervision tree in the reverse order. And I know an OTP supervisor will try to gently stop a child with a timeout as configured by its child_spec, and then kill it if it fails. And I know the official guidelines are to set the timeout for supervisors to infinity, to give them time to shut down their subtree, in contrast to workers which default to 5 seconds.

I don't know exactly how supervisor3 compares to the official OTP supervisors and if they follow a similar practice.

And I also don't know how best to handle brod_client, which according to the (excellently diagrammed!) process tree on the brod readme, is a worker that has supervisors as children. I suppose it shouldn't have the default 5 seconds shutdown timeout, since it has children with infinite timeouts.

I haven't quite worked out the timing and exit signals, but I'm currently thinking that my error logs have to do with brod_client taking too long to shut down when the app receives a SIGTERM, and then being killed by the BEAM. But as I think about how it should be configured, I get a little confused. Should its exit timeout be infinity like a supervisor? Or finite (but maybe longer than the default 5 seconds) since it's also a worker with TCP connections and such?

So I support the process architecture in the RFC here, if for no other reason than it's a bit easier for me to reason about when shutting down the application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants