Support for persistent locks/semaphores #84

stensonb · 2014-12-12T21:18:14Z

Use case:

My code performs a cluster-level operation that MUST be successful before other members of the cluster are allowed to begin (think "restarting a web service" on a node in a load balanced cluster).

"MUST" here includes when the program dies (either due to it's own exception, or due to a system exception).

I'd like to:

get the lock
perform the operation
remove the lock

If/When the operation (step 2) dies, the lock should persist (so no other members of the cluster perform the operation). Additionally, I'd like to be able to restart the program/application and resume with the same lock.

Currently, locks/semaphores are created with :ephemeral_sequential, which means the lock is automatically removed if/when the operation dies.

slyphon · 2014-12-13T04:28:13Z

Why not just have your cluster wait on a node? You can observe a node that hasn't been created yet, so just have your cluster check for the existence of /path/to/foo, if it doesn't exist, watch /path/to/foo and when it's created by the process-that-needs-to-do-something-before-the-cluster-can-start, then the cluster will be notified of the event and start running.

Locks were intended to lock around updating a record or running a job, or ensuring that you could have a single writer setup, but have two of them: one active, one on standby.

stensonb · 2014-12-16T22:39:35Z

Either I didn't explain my requirements well enough, or I'm not understanding what you're suggesting.

Each of my clustered nodes are trying to obtain the lock so they can do something locally (restart a service). If the service restarts successfully, or if the restart fails wildly, that lock is removed (because it's ephemeral), and the other nodes in the cluster proceed with their get-lock-or-block-then-do-stuff loop.

I feel like I'm missing something...

stensonb · 2014-12-16T22:41:26Z

To clarify, WHEN the service fails to start (for whatever reason), I want the node to continue to hold the lock to prevent other nodes from proceeding (maybe the service we're restarting is configured incorrectly, and sequencing through each of them will bring the entire load-balanced solution down).

tobowers · 2014-12-17T01:23:45Z

I think what Jonathan is saying is to not use the actual locking class and
instead have your first process write out a new node when it completes
its part of the process. Then have your secondary process watch for the
presence of that node.

On Tue Dec 16 2014 at 5:41:27 PM Bryan Stenson [email protected]
wrote:

To clarify, WHEN the service fails to start (for whatever reason), I want
the node to continue to hold the lock to prevent other nodes from
proceeding (maybe the service we're restarting is configured incorrectly,
and sequencing through each of them will bring the entire load-balanced
solution down).

—
Reply to this email directly or view it on GitHub
#84 (comment).

stensonb · 2014-12-17T16:59:40Z

I think I understand. But, that still won't work for me for a few reasons:

None of my nodes have a higher priority.
While it is required they perform the job in series, the order they perform their job in irrelevant (and does not need to be deterministic).
The number of nodes in the cluster is dynamic and could/will change during this "locking" process. So, having a node depend on another's lock is impractical. Similarly, number of nodes in the cluster may not be available before the locking sequence begins...(an approach of "wait until I see X nodes before I do my thing" would not work since I don't know how many X is a priori).

Finally, once my process completes on any given node, the process running the zk client completes. From a zookeeper cluster perspective, I cannot tell whether the restart was successful or not.

stensonb · 2014-12-17T20:57:05Z

Well, I think I've worked around this...as suggested, I'm not using the built-in lock/semaphore objects...I'm simply:

get lock by creating persistent node
if i have the lock(identified by lowest sequence #), i perform service restart and delete persistent node
if i don't have the lock (my znode is different than the lowest sequence #), then i block using the ZK::NodeDeletionWatcher class on the node with the lock.

stensonb · 2014-12-17T20:58:37Z

Me having to implement this workflow just to get persistent locks, however, seems like a common case.

I still think the idea of expanding the "lock" construct (semaphores too) to support persistent nodes would be a great feature.

Anybody else?

eric · 2014-12-17T21:09:27Z

If you have a node that is persistent, how does a client know which node is theirs after their process restarts?

It may be worth trying to separate these concepts into two things:

A directory that contains one ephemeral node per service that is currently active
A task that ensures at least N services are active and when that is met, attempts to get a lock (the existing ephemeral one) and performs a restart

rehevkor5 · 2018-05-16T23:52:33Z

I'd like to do this same thing too, for restarts across clusters like Cassandra, Kafka, even Zookeeper itself. I only want one machine to be down for restart/reboot at any given time.

how does a client know which node is theirs after their process restarts

I was thinking of including the machine's IP (it's unique and static, in my case) in the ZK node name. That way the client would always be able to tell if it has an existing lock node or not, and can delete the right one.

A task that ensures at least N services are active

I thought about this as a way to use ephemeral nodes instead of non-ephemeral nodes. Unfortunately, it relies on knowing how many nodes should be up at any given time. I'm not sure if that's always a straightforward thing to determine, and it may remove the advantage of decentralization that ZK affords. You'd probably end up having to create non-ephemeral nodes to record all the machines which are "supposed" to be up, and make sure that you create & delete those at the right time. If you accidentally don't create one, then bad things can happen like rebooting too many machines at once. If you accidentally don't delete one, the end result is the same as if the lock wasn't freed, so you need intervention anyway.

Using non-ephemeral nodes makes it somewhat more likely to encounter locks that are stuck. But in that situation I need a human to intervene anyway, so I just need to create the right tooling. I'm going to try to build my functionality on top of this library: hopefully it won't require too many big changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for persistent locks/semaphores #84

Support for persistent locks/semaphores #84

stensonb commented Dec 12, 2014

slyphon commented Dec 13, 2014

stensonb commented Dec 16, 2014

stensonb commented Dec 16, 2014

tobowers commented Dec 17, 2014

stensonb commented Dec 17, 2014

stensonb commented Dec 17, 2014

stensonb commented Dec 17, 2014

eric commented Dec 17, 2014

rehevkor5 commented May 16, 2018 •

edited

Loading

Support for persistent locks/semaphores #84

Support for persistent locks/semaphores #84

Comments

stensonb commented Dec 12, 2014

slyphon commented Dec 13, 2014

stensonb commented Dec 16, 2014

stensonb commented Dec 16, 2014

tobowers commented Dec 17, 2014

stensonb commented Dec 17, 2014

stensonb commented Dec 17, 2014

stensonb commented Dec 17, 2014

eric commented Dec 17, 2014

rehevkor5 commented May 16, 2018 • edited Loading

rehevkor5 commented May 16, 2018 •

edited

Loading