Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression: Windows jobs consistently freezing after git clean, before build script #203

Closed
SethTisue opened this issue Dec 1, 2016 · 27 comments · Fixed by #232
Closed

Comments

@SethTisue
Copy link
Member

e.g. https://scala-ci.typesafe.com/job/scala-2.13.x-integrate-windows/29/consoleFull

I had hoped this was just an intermittent thing, but for several days now all Windows jobs are affected.

I don't think this is the same as #143, but who knows. I did check that longpaths is set to true in user jenkins' global gitconfig

@adriaanm any ideas? I'm stumped. we won't able to release 2.12.1 without finding some solution or workaround, as the -package- jobs hang too, see e.g. the string of failures at https://scala-ci.typesafe.com/view/scala-2.11.x/job/scala-2.12.x-release-package-windows/

@SethTisue SethTisue self-assigned this Dec 1, 2016
@retronym
Copy link
Member

retronym commented Dec 1, 2016

https://scala-ci.typesafe.com/computer/jenkins-worker-windows-publish/script

println "jps -v".execute().text
println "jstack 2328".execute().text

Threads galore stuck in:

"pool-1-thread-447 for channel" #1203 prio=5 os_prio=0 tid=0x000000001e39f800 nid=0xbd8 in Object.wait() [0x0000000023ffe000]
   java.lang.Thread.State: RUNNABLE
	at com.sun.jna.PointerType.<init>(PointerType.java:25)
	at com.sun.jna.ptr.ByReference.<init>(ByReference.java:30)
	at com.sun.jna.ptr.IntByReference.<init>(IntByReference.java:22)
	at com.sun.jna.ptr.IntByReference.<init>(IntByReference.java:18)
	at hudson.util.jna.RegistryKey.open(RegistryKey.java:222)
	at hudson.util.jna.RegistryKey.openReadonly(RegistryKey.java:218)
	at hudson.plugins.cygpath.CygpathLauncherDecorator$GetCygpathTask.getCygwinRoot(CygpathLauncherDecorator.java:117)
	at hudson.plugins.cygpath.CygpathLauncherDecorator$GetCygpathTask.call(CygpathLauncherDecorator.java:142)
	at hudson.plugins.cygpath.CygpathLauncherDecorator$GetCygpathTask.call(CygpathLauncherDecorator.java:112)
	at hudson.remoting.UserRequest.perform(UserRequest.java:120)

@retronym
Copy link
Member

retronym commented Dec 1, 2016

The most similar stack trace I've found on the interwebs is http://pastebin.com/m6mtGbNW posted in the last day. (Was that you, @SethTisue ?)

@retronym
Copy link
Member

retronym commented Dec 1, 2016

https://scala-ci.typesafe.com/threadDump is appears to be easier way to get thread dumps for the entire Jenkins Cluster.

@retronym
Copy link
Member

retronym commented Dec 1, 2016

Maybe related:

https://issues.jenkins-ci.org/browse/JENKINS-16070

, I think that this appears to be a deadlock bug in com.sun.jna code where various classes in this package depend on each other, so if two threads both trigger class-loading of different classes in com.sun.jna then we can get a classloading deadlock, e.g. A depends on B and B depends on A then if both A and B are classloaded by different threads simultaneously then both can deadlock.

@retronym
Copy link
Member

retronym commented Dec 1, 2016

Even closer:

https://issues.jenkins-ci.org/browse/JENKINS-39179

If it were possible to disable the SwapSpaceMonitor – perhaps that might be a way to remove this deadlock. But there seems to be no way to do that.
Even when the swap space monitoring of slaves is disabled, the UserRequest still runs, as it seems to be updating the values, even if that slave condition is set to ignore

Linked JENKINS-38834 – I followed the advice there, to back level to LTS 2.7.4, and we have not seen a deadlock situation yet. (fingers crossed)
So this problem was introduced sometime between 2.7 and 2.19, and at least through our testing persists through 2.26 (last release before the move to remoting 3.0)
At some point, I want to validate if this problem occurs in >=2.27, where the remoting 3.0 was introduced. But as it occurs in the 2.19.2 LTS (latest available LTS as of this comment) I still believe this is critical.

@retronym
Copy link
Member

retronym commented Dec 1, 2016

We could try disabling the cygpath plugin, which seems to form part of a static initializer based deadlock or race condition with the SwapMonitor internal component. If we do that, our scripts will see Jenkins config variables as /bin/foo rather than c:/cygwin/bin/foo, and we'd might need to call cygpath ourselves.

Another workarond would be to patch the cygpath plugin not to use JNA to lookup the location of cygpath.exe on from the registry, and instead rely on the PATH. Here's the patch, just need to figure out how to build it (I think it needs some other project checked out and made known to Maven:

 Non-resolvable parent POM for org.jenkins-ci.plugins:cygpath:1.6-SNAPSHOT:
 Failure to find org.jenkins-ci.plugins:plugin:pom:1.418 in http://repo.jenkins-ci.org/public/ was cached in the local repository, resolution will not be reattempted until the update interval of repo.jenkins-ci.org has elapsed or updates are forced 
and 'parent.relativePath' points at wrong local POM @ line 4, column 11 -> [Help 2]

@adriaanm
Copy link
Contributor

adriaanm commented Dec 1, 2016

How come this just started failing? Could we give wiping the workspace a shot

@SethTisue
Copy link
Member Author

How come this just started failing?

beats the hell out of me, I don't even have a guess

Could we give wiping the workspace a shot

yeah, I should have mentioned, one of the first things I did was cd ~jenkins/workspace; rm -rf scala-2.13.x-integrate-windows and then re-rerun and let Jenkins re-create it. didn't help

@retronym
Copy link
Member

retronym commented Dec 1, 2016

The jenkins slave needs a restart for sure, it won't come good after the JNA race condition has been triggered. From the analysis on those tickets, the problem is a race on startup between different threads initializing JNA. Two spots that do that are the SwapSpaceMonitor (on a timer) and the cygpath-plugin (run just before the build).

@retronym
Copy link
Member

retronym commented Dec 1, 2016

One workaround suggested on the ticket is to:

set JVM Options of the slave with: -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native

This assumes a recent enough version of Jenkins (2.4+.)

That claims to do Class.forName(com.sun.jna.Native) at startup. Whether or not that's enough to to avoid later races, I don't know.

@adriaanm
Copy link
Contributor

adriaanm commented Dec 1, 2016

I moved all workspace folders under y:\workspace to y:\workspace\old and hit rebuild. Seems to have unclogged the pipes.

@SethTisue
Copy link
Member Author

@SethTisue
Copy link
Member Author

@retronym
Copy link
Member

I've noted our woes in https://issues.jenkins-ci.org/browse/JENKINS-39179 and asked for help in configuring a system property on the slave, which is a suggested fix.

@retronym
Copy link
Member

@SethTisue I found the spot to add the JVM option.

image

Should we just manually add -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native and see how things go for a while? If it works, we can roll it into the chef recipe for the master.

@retronym
Copy link
Member

retronym commented Jan 5, 2017

I've manually added this to the config via the Jenkins UI. If we have a run of successful builds, we should burn that into the Chef recipe for the master.

SethTisue added a commit to SethTisue/scala-jenkins-infra that referenced this issue Jun 16, 2017
hopefully fixes scala#203, see remarks there for details
@SethTisue
Copy link
Member Author

SethTisue commented Jun 16, 2017

after a nice long period of quiet, this came up again. all jobs on jenkins-worker-windows-publish were affected

Jason's manual change got lost (presumably a while ago), but I'll PR it now

SethTisue added a commit to SethTisue/scala-jenkins-infra that referenced this issue Jun 20, 2017
hopefully fixes scala#203, see remarks there for details
SethTisue added a commit to SethTisue/scala-jenkins-infra that referenced this issue Jun 20, 2017
hopefully fixes scala#203, see remarks there for details
@SethTisue SethTisue reopened this Sep 21, 2017
@SethTisue
Copy link
Member Author

SethTisue commented Sep 21, 2017

happened again today, all Windows jobs affected, and blowing away y:\workspace didn't fix it

@SethTisue
Copy link
Member Author

example job that hung on git clean even in a freshly cloned working dir: https://scala-ci.typesafe.com/job/scala-2.11.x-integrate-windows/779/console

@SethTisue
Copy link
Member Author

SethTisue commented Sep 21, 2017

https://scala-ci.typesafe.com/threadDump is useful for determining whether git clean completed or not before the hang

the thread dumps showed lots of threads hanging around monitoring swap space. Adriaan found this suspicious, disabled swap space monitoring (at https://scala-ci.typesafe.com/computer/configure), and restarted the node. everything seems to be working now (and all those threads don't show anymore)

@SethTisue
Copy link
Member Author

if this comes back, we should first try blowing away the workspace directory and restarting the node. if that doesn't do it, we should look at the swap space monitoring setting and see whether Adriaan's change needs to be re-applied. if it does, and if it that fixes it, we should PR the change

@SethTisue
Copy link
Member Author

oh, for the record, Adriaan also upgraded Git from 2.11 to 2.14, when we still thought it might actually be freezing during git clean

@SethTisue
Copy link
Member Author

today: all Windows jobs were freezing. deleting the workspaces didn't help. the swap space monitoring setting was still off, but there were a lot of swap space monitoring threads hanging around (though we still don't know if that's even a factor here). rebooting the node (by ssh-ing and doing shutdown -r) seems to have fixed it

@SethTisue
Copy link
Member Author

SethTisue commented Jan 11, 2018

happening again. clearly not fixed. I'll comment further if I ever find that ssh-ing to the Windows worker and doing shutdown -r isn't enough to (temporarily) cure it. having to do that every few weeks isn't great, but isn't horrible either

@SethTisue SethTisue reopened this Jan 11, 2018
@retronym
Copy link
Member

AFAIR, disabling swap space monitoring doesn't actually disable the background threads that start JNA, which race with the cygwin path plugin's use of JNA.

@retronym
Copy link
Member

There has been some activity on the Jenkins ticket, though sadly not to resolution:

I attached a build of an experimental plugin to this page; sources on GitHub: avoid-agent-jna-deadlock-plugin. It may work around the problem, and more easily than the previous workaround of configuring -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native on every agent (since you need merely install the plugin for the workaround to take effect). Without knowing how to reproduce the problem from scratch, I cannot confirm that it helps.

The JNA fix is as yet unreleased—scheduled for JNA 5.0.0 (due to its introducing an incompatible API change). Jenkins still uses 4.2.1. Updating to the current release 4.5.0 would not help in this regard, and I am loath to begin using an unreleased custom build or fork.

The direction we would like to take is to simply avoid using JNA at all from core, unless there is no plausible alternative. That has already been done in the case mentioned here, that of FilePath.deleteRecursive. See also workflow-support PR 48 which may help.

@SethTisue SethTisue removed their assignment Jan 22, 2018
@SethTisue
Copy link
Member Author

haven't seen this in a while, and should be superseded by scala/scala-dev#508 anyway (moving to AppVeyor for Windows testing; Windows publishing was already moved)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants