Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix infinite hang in wait_for_process_to_die() on Windows #4685

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

simpkins
Copy link
Contributor

@simpkins simpkins commented Feb 23, 2025

The ninja generator uses wait_for_process_to_die() to wait for a previous scons daemon process to terminate. It previously had 3 separate implementations: one using psutil if that module was available, plus a fallback implementation for Windows, and one for non-Windows platforms.

I was encountering problems with the fallback implementation on Windows hanging forever, and not being interruptible even with Ctrl-C. Apparently the win32 OpenProcess() function can return a valid process handle even for already terminated processes.

This change adds an extra call to GetExitCodeProcess() to avoid infinitely looping if the process has in fact already exited. I also added a timeout so that this function will eventually fail rather than hanging forever if a process does not exit.

I also removed the psutil implementation: it seemed simpler to me to avoid having multiple separate implementations with different behaviors. This appeared to be the only place in scons that depends on psutil outside of tests. Also note that this wait_for_process_to_die() function is only used by the ninja tool.

I have added unit tests checking the behavior of wait_for_process_to_die(). Without the GetExitCodeProcess() check added by this PR, the simple code in test_wait_for_process_to_die_success() would hang forever on Windows.

Contributor Checklist:

  • I have created a new test or updated the unit tests to cover the new/changed functionality.
  • I have updated CHANGES.txt and RELEASE.txt (and read the README.rst).
  • I have updated the appropriate documentation

The ninja generator uses wait_for_process_to_die() to wait for a
previous scons daemon process to terminate.  It previously had 3
separate implemementations: one using psutil if that module was not
available, plus a fallback implementation for Windows, and one for
non-Windows platforms.

I was encountering problems with the fallback implementation on Windows
hanging forever, and not being interruptible even with Ctrl-C.
Apparently the win32 OpenProcess() function can return a valid process
handle even for already terminated processes.

This change adds an extra call to GetExitCodeProcess() to avoid
infinitely looping if the process has in fact already exited.  I also
added a timeout so that this function will eventually fail rather than
hanging forever if a process does not exit.

I also removed the psutil implementation: it seemed simpler to me to
avoid having multiple separate implementations with different behaviors.
This appeared to be the only place in scons that depends on psutil
outside of tests.  Also note that this wait_for_process_to_die()
function is only used by the ninja tool.

I have added unit tests checking the behavior of
wait_for_process_to_die().  Without the `GetExitCodeProcess()` check
added by this PR, the simple code in
`test_wait_for_process_to_die_success()` would hang forever.
@bdbaddog
Copy link
Contributor

Does the infinite hang only happen when psutil is not installed?

@simpkins
Copy link
Contributor Author

Does the infinite hang only happen when psutil is not installed?

I didn't actually test installing psutil, but I assume that version did work. The implementation using psutil enumerates all processes on the system, and checks to see if the pid in question is in that list. When this bug occurs, the process in question is not shown in Task Manager or in the output of tasklist.exe, but Windows does successfully return a handle for it if you call OpenProcess() on the pid. (I didn't much information about this behavior in the Windows API docs, but there are various stack overflow questions like https://stackoverflow.com/questions/35296996 that discuss it a bit.)

My main rationale for just removing the psutil implementation was:

  • It makes the code harder to write tests for, if there are several different implementations that can be chosen at runtime. It also seems less desirable from a user support perspective if different users get slightly different behaviors at runtime.
  • The psutil-based version of scanning all processes on the system seems unnecessarily inefficient.
  • This appears to be the only non-test code in SCons that requires psutil. The comments in requirements-dev.txt and CHANGES.txt indicate that psutil is only needed for testing, but that isn't true for this code.

@bdbaddog
Copy link
Contributor

Can you restore the psutil usage in the function, but leave your improved code for when that's not available?

ninja tool use already requires ninja package installation, so using psutil if available seems reasonable.

And then add a blurb in CHANGES.txt and RELEASE.txt and should be no issue to merge.

@bdbaddog
Copy link
Contributor

We can add a note in the docs that psutil can optionally be installed for ninja tool usage.
If we could depend on it being there, then this issue wouldn't have been.
But per project goals, by and large SCons should work with default python library packages.

That said when this goal was set, installing other packages was much much more complicated than it is today.

@simpkins
Copy link
Contributor Author

What is the benefit of keeping the psutil implementation? It seems to just add maintenance complexity to me to have multiple separate implementations that all must be maintained.

If you don't trust the non-psutil version to work properly, it seems better to just remove it and make psutil a hard dependency that is checked on start-up (or perhaps whenever the ninja tool is first encountered in an SConscript file).

@bdbaddog
Copy link
Contributor

IMHO this PR is about fixing the issue you had, which was only in the code used when there was no psutil available.
So any changes should be constrained to that.

If you'd like to float a separate PR about removing psutil, feel free, but I don't see any reason to remove it at this point.
It has long been SCons practice to use better libraries/functions if they are available in a given python, and to build compatibility logic for when it's not present.

Regardless, I don't see how not removing psutil affects your usage in any way

@simpkins
Copy link
Contributor Author

I've updated the code to keep the existing psutil implementation as-is. Let me know if you would prefer I squash this into a single commit and force-push instead.

On a somewhat unrelated note, looking through CHANGES.txt was actually a blast from the past, and rather helpful:

RELEASE 0.98 - Sun, 30 Mar 2008 23:33:05 -0700
...
From Adam Simpkins:

  • Add a --interactive option that starts a session for building (or
    cleaning) targets without re-reading the SConscript files every time.

I think this is the key functionality I really care about, rather than the Ninja generator itself. I mostly wanted to use ninja to avoid having to wait for SCons to start up on incremental rebuilds. I had forgotten that this --interactive behavior ever got merged upstream, and when I was looking for it earlier I didn't see it in the repository history. (Looks like it was squashed in with a bunch of other changes in be25024.) I'm sort of surprised that the --interactive flag still exists and works after all these years.

@mwichmann
Copy link
Collaborator

Yeah, there's a particular range of dates where commits are hard to trace - seems to have been from the period when the project was using subversion and hosted on tigris. - lots if "merged hundreds of revisions via svnmerger from ..." . Before my time :-) but frustrating for prospecting.

Interactive could be pretty useful, I didn't see any real use for it for a long time, but I'm coming around. I'm thinking if someone ever built an IDE plugin, you could configure the build to remain open and be able to quickly rebuild on changes as long as you stay in the editor...

Was actually poking at an Interactive bug quite recently (#3029) if you felt amused to look...

@bdbaddog
Copy link
Contributor

Yeah, there's a particular range of dates where commits are hard to trace - seems to have been from the period when the project was using subversion and hosted on tigris. - lots if "merged hundreds of revisions via svnmerger from ..." . Before my time :-) but frustrating for prospecting.

Interactive could be pretty useful, I didn't see any real use for it for a long time, but I'm coming around. I'm thinking if someone ever built an IDE plugin, you could configure the build to remain open and be able to quickly rebuild on changes as long as you stay in the editor...

Was actually poking at an Interactive bug quite recently (#3029) if you felt amused to look...

Ninja calls back to interactive scons session to build what it can't.

@simpkins
Copy link
Contributor Author

simpkins commented Feb 25, 2025

Interactive could be pretty useful, I didn't see any real use for it for a long time, but I'm coming around. I'm thinking if someone ever built an IDE plugin, you could configure the build to remain open and be able to quickly rebuild on changes as long as you stay in the editor...

Yeah, interactive mode always did feel like a bit of a hack to me, but seemed necessary at the time to make incremental rebuilds fast. If scons takes several seconds to start on a build, this can be a significant fraction of the build time if only a single source file was changged. From 2003-2009 I was at a company that used scons as their build tool, and the start-up overhead was pretty high (they used variant_dir to define a number of build configurations, and SCons re-interpreted every SConscript file for each variant every time it started). Pretty much everyone at the company used --interactive mode for most development rebuilds. I had forgotten that it eventually did get accepted upstream.

Yeah, to do fast incremental rebuilds well it would be nice to spawn a background daemon that can monitor the filesystem for changes so it can be immediately ready to go whenever it is requested to do a build. More recently I was at Facebook for many years, and buck2 does this. The file system monitoring piece is available as a separate project if scons ever wanted to integrate it. (https://github.com/facebook/watchman/). It's actually quite nice and handles a lot of tricky cross-platform issues.

Was actually poking at an Interactive bug quite recently (#3029) if you felt amused to look...

Yeah, I'll see if I can take a look, but no promises. I actually did just run into this myself using --interactive mode to do some rebuilds with the godot project. (I had left the interactive mode running since yesterday, and when I tried a rebuild today it failed with this error. I didn't get a chance to poke around and see what was wrong, though.)

@mwichmann
Copy link
Collaborator

Tagged as "ninja" since that's the sole customer of the routine at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants