Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins 31084 graceful shutdown #115

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

akomakom
Copy link
Contributor

@akomakom akomakom commented Jun 5, 2019

Work in progress. Seems to be working so far.

Question: should doGetSlaveReadyForShutdown also check secrets/permissions?

Copy link
Member

@basil basil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great so far! I like the use of a ShutdownHook, and the backend APIs are pretty much exactly what I expected.

Have you given any thought as to how this functionality might be activated or deactivated? The Java API states:

Shutdown hooks should also finish their work quickly. When a program invokes exit the expectation is that the virtual machine will promptly shut down and exit. When the virtual machine is terminated due to user logoff or system shutdown the underlying operating system may only allow a fixed amount of time in which to shut down and exit. It is therefore inadvisable to attempt any user interaction or to perform a long-running computation in a shutdown hook.

For this reason as well as to preserve backwards compatibility of the existing command-line API, I think this new functionality should probably be opt-in rather than opt-out by default. What do you think?



public static class ShutdownHook extends Thread {
private SwarmClient client;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] These can be private final.

try {
client.takeSlaveOfflineAndWait(target, "JVM is shutting down");
} catch (Exception e) {
logger.log(Level.SEVERE, "Unable to perform graceful slave shutdown: ", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if you started a shutdown operation (e.g., by sending a SIGINT with Ctrl-C or a SIGTERM with kill(1)) and then sent another signal (e.g., another SIGINT by pressing Ctrl-C again) while we were in the middle of takeSlaveOfflineAndWait()? I think the expected behavior is that we would stop immediately without continuing to wait in takeSlaveOfflineAndWait(). I think that your current code probably already handles this in that an InterruptedException would be thrown (and you are catching Exception here). It might be worth validating this with a manual test. Even if the code works as expected, might it be worth printing an additional message in this case (e.g., "terminating with extreme prejudice")?


node.toComputer().setTemporarilyOffline(true, new OfflineCause.UserCause(User.current(), reason));

normalResponse(req, rsp, "<response/>"); //TODO: not sure what to send here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either offhand. Assuming that the default response status is javax.servlet.http.HttpServletResponse.SC_OK, can we rely on the response status alone and simply avoid calling rsp.getCompressedWriter() in the first place for such responses? I can look into this further if you're having trouble.

}

public void doGetSlaveReadyForShutdown(StaplerRequest req, StaplerResponse rsp, @QueryParameter String name) throws IOException {
Jenkins jenkins = Jenkins.getInstance();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding checking for secrets/permissions: I can't see any disadvantages to checking, and I can't see any advantages to not checking. In contrast, if we don't check, we increase the attack surface for potential security issues. Based on the above reasoning, I think I'd prefer to check secrets/permissions.

@akomakom
Copy link
Contributor Author

akomakom commented Jun 6, 2019

This looks great so far! I like the use of a ShutdownHook, and the backend APIs are pretty much exactly what I expected.

Have you given any thought as to how this functionality might be activated or deactivated? The Java API states:

Shutdown hooks should also finish their work quickly. When a program invokes exit the expectation is that the virtual machine will promptly shut down and exit. When the virtual machine is terminated due to user logoff or system shutdown the underlying operating system may only allow a fixed amount of time in which to shut down and exit. It is therefore inadvisable to attempt any user interaction or to perform a long-running computation in a shutdown hook.

For this reason as well as to preserve backwards compatibility of the existing command-line API, I think this new functionality should probably be opt-in rather than opt-out by default. What do you think?

I have mixed feeling about the design.
First off, the expected use case is the opposite of "finish work quickly". I have jobs that take 8 hours, which is how long the ShutdownHook would have to delay VM termination.

Then there is the opt-in. I see two main approaches:

  1. A new flag or signal to enable graceful shutdown behavior.
  2. A second kill required to terminate with prejudice.

Number 2 is not really opt-in since it will require instrumentation changes. It's also not trivial to do using only java's shutdown hooks (without explicitly registering signal hooks), but we might as well require kill -9 the second time to make it easy.

Which leaves these options:

  1. -gracefulShutdown=true changes SIGINT behavior to what I have in this PR. This still violates the "quick" directive for shutdown hooks.
  2. kill -SIGUSR1 (for example) as opposed to a SIGINT/SIGTERM which still kill instantly. (But I have yet to find a signal handling approach that doesn't rely on sun.misc.*)
  3. Some other mechanism, ie an HTTP listener, a file monitor, etc.

I'm in favor of choice 3 because I am not confident about the suitability of shutdown hooks for this use case. HTTP listener is probably too heavy and opens some security vulnerabilities, but a command file (a la nagios.cmd) would work, eg -commandFile /some/path. Then echo 'gracefulShutdown' > /some/path initiates shutdown (details of the file can be hidden by java also, ie java -jar swarm-client.jar gracefulShutdown). Maybe even 2 files - command and status, so that the new process can request shutdown and monitor the results, blocking until ready. Status file can potentially be used for monitoring by external tools.

@akomakom
Copy link
Contributor Author

akomakom commented Jun 6, 2019

Another half-baked idea:

  • Run the jar to shut down the other process, ie java -jar swarm-client.jar .... -gracefulShutdown (same command line + shutdown switch)
  • Communicates with jenkins
  • Jenkins marks original slave offline
  • The original instance determines that it should shut down by also communicating with Jenkins (by polling? by piggybacking on some existing channel?) and exits once jobs finish.

This might not work with unique client names.

Alternatively:

  • Run the jar as above, so that it:
  • Communicates to Jenkins, marks old slave offline.
  • Blocks until jobs finish.
  • Kills the original process.

Again unique client names makes finding the right one to work with difficult if multiple instances are present.

@basil basil added the feature New features and improvements label Jun 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New features and improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants